Skip to yearly menu bar Skip to main content


Timezone: Pacific/Honolulu

Registration Desk: Registration/Badge Pickup Tue 21 Oct 07:00 a.m.  


Reception: Welcome & Awards Tue 21 Oct 08:00 a.m.  


Oral 1A: Multi-modal learning Tue 21 Oct 08:45 a.m.  

Oral
David G. Shatwell · Ishan Rajendrakumar Dave · Swetha Sirnam · Mubarak Shah

[ Exhibit Hall III ]

Abstract
Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, we utilize Random Fourier Features for effective temporal representation. Instead of conventional contrastive learning with hard positives and negatives, we propose a metric-learning objective providing soft targets by modeling temporal differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses methods focused solely on time prediction and even those utilizing geo-location during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, while the unified embedding space facilitates compositional and text-based image retrieval.
Oral
Mustafa Shukor · Enrico Fini · Victor Guilherme Turrisi da Costa · Matthieu Cord · Joshua Susskind · Alaaeldin El-Nouby

[ Exhibit Hall III ]

Abstract
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing training on multimodal data. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)—those trained from the ground up on all modalities—and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on pre-trained image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter count, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.
Oral
Shuai Tan · Bill Gong · Bin Ji · Ye Pan

[ Exhibit Hall III ]

Abstract
Talking head generation is gaining significant importance across various domains, with a growing demand for high-quality rendering. However, existing methods often suffer from identity leakage (IL) and rendering artifacts (RA), particularly in extreme cases. Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. Firstly, we propose an **Enhanced Motion Indicator (EMI)** to effectively decouple identity information from motion features, mitigating the impact of IL on generated talking heads. To address RA, we introduce an **Enhanced Detail Indicator (EDI)**, which utilizes the leaked identity information to supplement missing details, thus fixing the artifacts. Extensive experiments demonstrate that FixTalk effectively mitigates IL and RA, achieving superior performance compared to state-of-the-art methods.
Oral
Derong Jin · Ruohan Gao

[ Exhibit Hall III ]

Abstract
An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.
Oral
Yi Li · Hualiang Wang · Xinpeng Ding · Haonan Wang · Xiaomeng Li

[ Exhibit Hall III ]

Abstract
Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual …

Oral 1B: Structure and Motion Tue 21 Oct 08:45 a.m.  

Oral
Frano Rajič · Haofei Xu · Marko Mihajlovic · Siyuan Li · Irem Demir · Emircan Gündoğdu · Lei Ke · Sergey Prokudin · Marc Pollefeys · Siyu Tang

[ Kalakaua Ballroom ]

Abstract
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or previous multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks—Panoptic Studio and DexYCB—where we achieve median trajectory errors of 3.2 cm and 2.3 cm, respectively. Notably, on DexYCB, our method surpasses the strongest single-view tracker by 58.2% and a simpler multi-view triplane-based baseline by 46.5%. It also generalizes better to diverse camera setups of 1–8 cameras with varying vantage points and video lengths of 24–150 frames. By releasing our pre-trained tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for a wide range of real-world applications.
Oral
Jonathan Ventura · Viktor Larsson · Fredrik Kahl

[ Kalakaua Ballroom ]

Abstract
Spherical motion is a special case of camera motion where the camera moves on the imaginary surface of a sphere with the optical axis normal to the surface. Common sources of spherical motion are a person capturing a stereo panorama with a phone held in an outstretched hand, or a hemi-spherical camera rig used for multi-view scene capture. However, traditional structure-from-motion pipelines tend to fail on spherical camera motion sequences, especially when the camera is facing outward. Building upon prior work addressing the calibrated case, we explore uncalibrated reconstruction from spherical motion, assuming a fixed but unknown focal length parameter. We show that, although two-view spherical motion is always a critical case, self-calibration is possible from three or more views. Through analysis of the relationship between focal length and spherical relative pose, we devise a global structure-from-motion approach for uncalibrated reconstruction. We demonstrate the effectiveness of our approach on real-world captures in various settings, even when the camera motion deviates from perfect spherical motion.
Oral
Simon Kiefhaber · Stefan Roth · Simone Schaub-Meyer

[ Kalakaua Ballroom ]

Abstract
Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor in optical flow methods regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows to remove the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being $1.2\times$ faster and having a $6\times$ lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at $20\mathrm{FPS}$ using only $500\mathrm{MB}$ of memory.
Oral
Jerred Chen · Ronald Clark

[ Kalakaua Ballroom ]

Abstract
In many robotics and VR/AR applications, fast camera motions cause a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.
Oral
Mark YU · Wenbo Hu · Jinbo Xing · Ying Shan

[ Kalakaua Ballroom ]

Abstract
We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method. Code and pre-trained model will be released.

Invited Talk: Sheperd Doeleman

Taking pictures and making movies of black holes

Black holes are cosmic objects so small and dense, that nothing, not even light can escape their gravitational pull. Until recently, no one had ever seen what a black hole actually looked like. Einstein's theories predict that a distant observer should see a ring of light encircling the black hole, which forms when radiation emitted by infalling hot gas is lensed by the extreme gravity near the event horizon. The Event Horizon Telescope (EHT) is a global array of radio dishes, linked together by a network of atomic clocks to form an Earth-sized virtual telescope that can resolve the nearest supermassive black holes where this ring feature may be measured. On April 10th, 2019, the EHT project reported success: we have imaged a black hole, and have seen the predicted strong gravitational lensing. In 2022, our team again saw this phenomenon towards the supermassive black hole at the center of our Milky Way galaxy. This talk will cover the background of the project, the technique, and the imaging strategies employed. Expansion of the global array to a next-generation EHT, enabling capture of multi-color movies of black holes, will be discussed.

Sheperd Doeleman

 

Shep Doeleman is Founding Director of the Event Horizon Telescope (EHT) project and led the international team that made the first image of a black hole. He received his bachelor's from Reed College, a PhD in astrophysics from MIT, and spent a year in Antarctica conducting space-science experiments where he got hooked on doing research in challenging circumstances. After serving as assistant director of MIT’s Haystack Observatory and receiving a Guggenheim Fellowship in 2012, he moved to the Harvard-Smithsonian Center for Astrophysics. There he co-founded the Black Hole Initiative – the first center dedicated to the interdisciplinary study of black holes – which is supported by the John Templeton Foundation. He now leads the next-generation EHT (ngEHT), which has a goal of making movies of black holes to answer the next set of big questions. photograph © Brigitte Lacombe for the Breakthrough Prize



Demonstration: Demos 1 Tue 21 Oct 11:30 a.m.  

  • Spatially-Varying Autofocus, Yingsi Qin, Aswin C Sankaranarayanan, Matthew O'Toole
  • Measuring the Performance of 6 DoF Object Pose Estimation for Robotic Bin Picking, Kamel Saidi
  • Multi-turn Consistent Image Editing, Zijun Zhou
  • AstroLoc: Robust Space to Ground Image Localizer, Gabriele Berton, Alex Stoken, Carlo Masone

Poster Session 1 & Exhibit Hall Tue 21 Oct 11:45 a.m.  

Poster
Li Li · Peilin Cai · Yuxiao Zhou · Zhiyu Ni · Renjie Liang · QIN YOU · Yi Nian · Zhengzhong Tu · Xiyang Hu · Yue Zhao

[ Exhibit Hall I ]

Abstract
Out-of-Distribution (OOD) detection is critical for ensuring the reliability of machine learning models in safety-critical applications such as autonomous driving and medical diagnosis. While deploying personalized OOD detection directly on edge devices is desirable, it remains challenging due to large model sizes and the computational infeasibility of on-device training. Federated learning partially addresses this but still requires gradient computation and backpropagation, exceeding the capabilities of many edge devices.To overcome these challenges, we propose \textbf{SecDOOD}, a secure cloud-device collaboration framework for efficient on-device OOD detection \textit{without} requiring device-side backpropagation.SecDOOD utilizes cloud resources for model training while ensuring user data privacy by retaining sensitive information on-device. Central to SecDOOD is a HyperNetwork-based personalized parameter generation module, which adapts cloud-trained models to device-specific distributions by dynamically generating local weight adjustments, effectively combining central and local information without local fine-tuning. Additionally, our dynamic feature sampling and encryption strategy selectively encrypts only the most informative feature channels, largely reducing encryption overhead without compromising detection performance.Extensive experiments across multiple datasets and OOD scenarios demonstrate that SecDOOD achieves performance comparable to fully fine-tuned models, enabling secure, efficient, and personalized OOD detection on resource-limited edge devices. To enhance accessibility and reproducibility, our code is publicly available at \url{https://anonymous.4open.science/r/SecDOOD/}.
Poster
Yu Zheng · Boyang Gong · Fanye Kong · Yueqi Duan · Bingyao Yu · Wenzhao Zheng · Lei Chen · Jiwen Lu · Jie Zhou

[ Exhibit Hall I ]

Abstract
In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks.
Poster
Wenxuan Bao · Ruxi Deng · Ruizhong Qiu · Tianxin Wei · Hanghang Tong · Jingrui He

[ Exhibit Hall I ]

Abstract
Test-time adaptation with pre-trained vision-language models has gained increasing attention for addressing distribution shifts during testing. Among these approaches, memory-based algorithms stand out due to their training-free nature and ability to leverage historical test data. However, existing test-time adaptation methods are typically designed for a single domain with abundant data. In decentralized settings such as federated learning, applying these methods individually to each client suffers from limited test data, while directly sharing a single global memory via the server prevents proper personalization to each client's unique distribution. To address this, we propose Latte, a novel framework where each client maintains a local memory to store embeddings from its own historical test data and an external memory to store class prototypes from other relevant clients. During communication, each client retrieves prototypes from similar clients under the server’s coordination to expand its memory. For local adaptation, Latte utilizes both embedding similarity and uncertainty to enhance model performance. Our theoretical analysis shows that Latte effectively leverages in-distribution clients while remaining robust to out-of-distribution clients. Extensive experiments on domain adaptation and corruption benchmarks validate that Latte achieves superior performance in decentralized settings, while introducing only negligible communication and computation costs.
Poster
Zixin Wang · Dong Gong · Sen Wang · Zi Huang · Yadan Luo

[ Exhibit Hall I ]

Abstract
Contrastive Language-Image Pretraining (CLIP) excels at learning generalizable image representations but often falls short in zero-shot inference on certain downstream datasets. Test-time adaptation (TTA) mitigates this issue by adjusting components like normalization layers or context prompts, yet it typically requires large batch sizes and extensive augmentations, leading to high computational costs. This raises a key question: Can VLMs' performance drop in specific test cases be mitigated through efficient, training-free approaches? To explore the solution, we investigate token condensation (TC) techniques, originally designed to enhance vision transformer efficiency by refining token usage during inference. We observe that informative tokens improve visual-text alignment in VLMs like CLIP on unseen datasets. However, existing TC methods often fail to maintain in-distribution performance when reducing tokens, prompting us to ask: How can we transform TC into an effective ``free-lunch'' adaptation strategy for VLMs? To address this, we propose Token Condensation as Adaptation (TCA), a training-free adaptation method that takes a step beyond standard TC. Rather than passively discarding tokens, TCA condenses token representation by introducing reservoir-based domain anchor tokens for information-preserving token reduction and logit correction. TCA achieves up to a 21.4\% performance improvement over the strongest baseline on cross-dataset benchmark and the CIFAR-100-Corrupted dataset …
Poster
Qifan Yu · Zhebei Shen · Zhongqi Yue · Yang Wu · Bosheng Qin · Wenqiao Zhang · Yunfei Li · Juncheng Li · Siliang Tang · Yueting Zhuang

[ Exhibit Hall I ]

Abstract
Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles—informativeness, uniqueness, and representativeness—for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 101.3\% of the performance of full-data fine-tuning with only 15\% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the "Less is More" philosophy in MLLM development. The code is in https://anonymous.4open.science/r/DataTailor-5BC3.
Poster
Zhe Cao · Jin Zhang · Ruiheng Zhang

[ Exhibit Hall I ]

Abstract
Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infrared-text. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.
Poster
Ziqi Wang · Chang Che · Qi Wang · Yangyang Li · Zenglin Shi · Meng Wang

[ Exhibit Hall I ]

Abstract
Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules—one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a new CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model's ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to …
Poster
Jiale Zhao · XINYANG JIANG · Junyao Gao · Yuhao Xue · Cairong Zhao

[ Exhibit Hall I ]

Abstract
Unified vision-language models (VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective—consistently manipulating a target object's classification across four downstream tasks—and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.
Poster
Chongjie Si · Zhiyi Shi · Xuehui Wang · Yichen Xiao · Xiaokang Yang · Wei Shen

[ Exhibit Hall I ]

Abstract
Adapting pre-trained foundation models for diverse downstream tasks is a core practice in artificial intelligence. However, the wide range of tasks and high computational costs make full fine-tuning impractical. To overcome this, parameter-efficient fine-tuning (PEFT) methods like LoRA have emerged and are becoming a growing research focus. Despite the success of these methods, they are primarily designed for linear layers, focusing on two-dimensional matrices while largely ignoring higher-dimensional parameter spaces like convolutional kernels. Moreover, directly applying these methods to higher-dimensional parameter spaces often disrupts their structural relationships. Given the rapid advancements in matrix-based PEFT methods, rather than designing a specialized strategy, we propose a generalization that extends matrix-based PEFT methods to higher-dimensional parameter spaces without compromising their structural properties. Specifically, we treat parameters as elements of a Lie group, with updates modeled as perturbations in the corresponding Lie algebra. These perturbations are mapped back to the Lie group through the exponential map, ensuring smooth, consistent updates that preserve the inherent structure of the parameter space. Extensive experiments on computer vision and natural language processing validate the effectiveness and versatility of our approach, demonstrating clear improvements over existing methods.
Poster
Jiaer Xia · Bingkui Tong · Yuhang Zang · Rui Shao · Kaiyang Zhou

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with chain-of-thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.
Poster
Qidong Huang · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Jiaqi Wang · Weiming Zhang · Nenghai Yu

[ Exhibit Hall I ]

Abstract
Multi-modal pre-training plays a pivotal role in aligning two modalities for Large Vision-Language Models (LVLMs), while evaluating its training quality usually requires the costly supervised fine-tuning (SFT) stage to verify the downstream benchmark scores. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when quantifying the pre-trained LVLMs. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc.In this paper, we first present Modality Integration Rate ($\textbf{MIR}$), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of LVLMs without SFT. This metric evaluates LVLM pre-training from the inter-modal distribution distance perspective, which is 1) $\textbf{Effective}$ to represent the pre-training quality and show a positive relation with the benchmark performance after SFT, 2) $\textbf{Robust}$ toward different training/evaluation data, and 3) $\textbf{Generalize}$ across training configurations and architecture choices.Complementing MIR, we further propose learnable Modality Calibration ($\textbf{MoCa}$), a lightweight module to narrow the modality gap at each language model layer during training. A series of experiments are conducted to explore the effectiveness of MIR and MoCa, …
Poster
Sicheng Mo · Thao Nguyen · Xun Huang · Siddharth Iyer · Yijun Li · Yuchen Liu · Abhishek Tandon · Eli Shechtman · Krishna Kumar Singh · Yong Jae Lee · Bolei Zhou · Yuheng Li

[ Exhibit Hall I ]

Abstract
We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM’s parameters frozen while integrating vision-specific information for both understanding and generation. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.
Poster
Yuxuan Cai · Jiangning Zhang · Haoyang He · Xinwei He · Ao Tong · Zhenye Gan · Chengjie Wang · Zhucun Xue · Yong Liu · Xiang Bai

[ Exhibit Hall I ]

Abstract
The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language. However, the increasing model size and computational complexity of large-scale MLLMs ($l$-MLLMs) limit their use in resource-constrained scenarios. Although small-scale MLLMs ($s$-MLLMs) are designed to reduce computational costs, they typically suffer from performance degradation.To mitigate this limitation, we propose a novel \method~framework to transfer knowledge from $l$-MLLMs to $s$-MLLMs. Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model's ability to capture visual token relationships.Additionally, we propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy: \textit{1)} Distilled Pre-Training to strengthen the alignment between visual-linguistic representations in $s$-MLLMs, \textit{2)} Supervised Fine-Tuning to equip the $s$-MLLMs with multimodal understanding capacity, and \textit{3)} Distilled Fine-Tuning to refine $s$-MLLM's knowledge.Our approach significantly improves $s$-MLLMs performance without altering the model architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component. Code will be available.
Poster
Tajamul Ashraf · Janibul Bashir

[ Exhibit Hall I ]

Abstract
We focus on the source-free domain adaptive object detection (SFDAOD) problem when source data is unavailable during adaptation and the model must adapt to the unlabeled target domain. The majority of approaches for the problem employ a self-supervised approach using a student-teacher (ST) framework where pseudo-labels are generated via a source-pretrained model for further fine-tuning. We observe that the performance of a student model often degrades drastically, due to the collapse of the teacher model primarily caused by high noise in pseudo-labels, resulting from domain bias, discrepancies, and a significant domain shift across domains. To obtain reliable pseudo-labels, we propose a Target-based Iterative Query-Token Adversarial Network (TITAN) which separates the target images into two subsets that are similar to the source (easy) and those that are dissimilar (hard). We propose a strategy to estimate variance to partition the target domain. This approach leverages the insight that higher detection variances correspond to higher recall and greater similarity to the source domain. Also, we incorporate query-token based adversarial modules into a student-teacher baseline framework to reduce the domain gaps between two feature representations. Experiments conducted on four natural imaging datasets and two challenging medical datasets have substantiated the superior performance of TITAN …
Poster
Yixu Wang · Yan Teng · Yingchun Wang · Xingjun Ma

[ Exhibit Hall I ]

Abstract
Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have transformed vision model adaptation, enabling the rapid deployment of customized models. However, the compactness of LoRA adaptations introduces new safety concerns, particularly their vulnerability to model extraction attacks. This paper introduces a new focus of model extraction attacks named LoRA extraction that extracts LoRA-adaptive models based on a public pre-trained model. We then propose a novel extraction method called StolenLoRA which trains a substitute model to extract the functionality of a LoRA-adapted model using synthetic data. StolenLoRA leverages a Large Language Model to craft effective prompts for data generation, and it incorporates a Disagreement-based Semi-supervised Learning (DSL) strategy to maximize information gain from limited queries.Our experiments demonstrate the effectiveness of StolenLoRA, achieving up to a 96.60% attack success rate with only 10k queries, even in cross-backbone scenarios where the attacker and victim models utilize different pre-trained backbones. These findings reveal the specific vulnerability of LoRA-adapted models to this type of extraction and underscore the urgent need for robust defense mechanisms tailored to PEFT methods.We also explore a preliminary defense strategy based on diversified LoRA deployments, highlighting its potential to mitigate such attacks.
Poster
Huanjin Yao · Jiaxing Huang · Yawen Qiu · Michael K. Chen · Wenzheng Liu · Wei Zhang · wenjie zeng · Xikun ZHANG · Jingyi Zhang · YuXin Song · Wenhao Wu · Dacheng Tao

[ Exhibit Hall I ]

Abstract
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence.However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps.To fill this gap, we introduce **MMReason**, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions.First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers).Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations.Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps.With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities.We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research.
Poster
Subhajit Maity · Ayan Bhunia · Subhadeep Koley · Pinaki Chowdhury · Aneeshan Sain · Yi-Zhe Song

[ Exhibit Hall I ]

Abstract
Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.
Poster
Luyao Tang · Kunze Huang · Yuxuan Yuan · Chenxin Li · Xiaotong Tu · Xinghao Ding · Chaoqi Chen · Yue Huang

[ Exhibit Hall I ]

Abstract
Human perceptual systems excel at inducing and recognizing objects across both known and novel categories, a capability far beyond current machine learning frameworks. While generalized category discovery (GCD) aims to bridge this gap, existing methods predominantly focus on optimizing objective functions. We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. We propose ConGCD, which establishes primitive-oriented representations through high-level semantic reconstruction, binding intra-class shared attributes via deconstruction. Mirroring human preference diversity in visual processing, where distinct individuals leverage dominant or contextual cues, we implement dominant and contextual consensus units to capture class-discriminative patterns and inherent distributional invariants, respectively. A consensus scheduler dynamically optimizes activation pathways, with final predictions emerging through multiplex consensus integration. Extensive evaluations across coarse- and fine-grained benchmarks demonstrate ConGCD's effectiveness as a consensus-aware paradigm.
Poster
Jincheng Li · Chunyu Xie · Ji Ao · Dawei Leng · Yuhui Yin

[ Exhibit Hall I ]

Abstract
Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others.While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors.To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a large multimodal model for vanilla object detection without relying on specialized detection modules.Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models.We claim that a large multimodal model possesses detection capability without any extra modules such as a specialist detection model or a region proposal network.Extensive experiments support our claim and show the effectiveness of …
Poster
Dongyue Wu · Zilin Guo · Jialong Zuo · Nong Sang · Changxin Gao

[ Exhibit Hall I ]

Abstract
The ever-growing size of training datasets enhances the generalization capability of modern machine learning models but also incurs exorbitant computational costs. Existing data pruning approaches aim to accelerate training by removing those less important samples. However, they often rely on gradients or proxy models, leading to prohibitive additional costs of gradient back-propagation and proxy model training.In this paper, we propose Partial Forward Blocking (PFB), a novel framework for lossless training acceleration. The efficiency of PFB stems from its unique pipeline: sample importance is assessed based on features extracted from the shallow layers of the target model. Less important samples are then pruned, allowing only the retained ones to proceed with the subsequent forward pass and loss back-propagation. This mechanism significantly reduces the computational overhead of deep-layer forward passes and back-propagation for pruned samples, while also eliminating the need for auxiliary backward computations and proxy model training.Moreover, PFB introduces probability density as an indicator of sample importance. Combined with an adaptive distribution estimation module, our method dynamically prioritizes relatively rare samples, aligning with the constantly evolving training state.Extensive experiments demonstrate the significant superiority of PFB in performance and speed.On ImageNet, PFB achieves a 0.5\% accuracy improvement and 33\% training time reduction …
Poster
Qianhao Yuan · Qingyu Zhang · yanjiang liu · Jiawei Chen · Yaojie Lu · Hongyu Lin · Jia Zheng · Xianpei Han · Le Sun

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively.The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens.Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens.Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers and freezes visual token updates in these layers.Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens.For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance.The code will be publicly available.
Poster
Haoran Chen · Ping Wang · Zihan Zhou · Xu Zhang · Zuxuan Wu · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
Class-incremental learning (CIL) enables models to learn new classes progressively while preserving knowledge of previously learned ones. Recent advances in this field have shifted towards parameter-efficient fine-tuning techniques, with many approaches building upon the framework that maintains a pool of learnable prompts. Although effective, these methods introduce substantial computational overhead, primarily due to prompt pool querying and increased input sequence lengths from prompt concatenation. In this work, we present a novel prompt-based approach that addresses this limitation. Our method trains a single set of shared prompts across all tasks and, rather than concatenating prompts to the input, directly modifies the CLS token's attention computation by adding the prompts to it. This simple and lightweight design not only significantly reduces computational complexity—both in terms of inference costs and the number of trainable parameters—but also eliminates the need to optimize prompt lengths for different downstream tasks, offering a more efficient yet powerful solution for rehearsal-free class-incremental learning. Extensive experiments across a diverse range of CIL benchmarks demonstrate the effectiveness of our approach, highlighting its potential to establish a new prompt-based CIL paradigm. Furthermore, experiments on general recognition benchmarks beyond the CIL setting also show strong performance, positioning our method as a promising …
Poster
Liming Lu · Shuchao Pang · Xu Zheng · Xiang GU · Anan Du · Yunhuai Liu · Yongbin Zhou

[ Exhibit Hall I ]

Abstract
Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model's robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: ① The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and ② The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: ① A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and ② Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average $\textbf{3.53\%}$ improvement in adversarial defense rates across various attack scenarios and a $\textbf{5.87\%}$ increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at https://github.com/CIARD2025/CIARD.
Poster
Wan Jiang · He Wang · Xin Zhang · Dan Guo · Zhaoxin Fan · Yunfeng Diao · Richang Hong

[ Exhibit Hall I ]

Abstract
Score-based Generative Models (SGMs) have demonstrated remarkable generalization capabilities, \eg generating unseen, but natural data. However, the greater the generalization power, the more likely the unintended generalization, and the more dangerous the abuse. Despite these concerns, research on unlearning SGMs has not been explored. To fill this gap, we first examine the current `gold standard' in Machine Unlearning (MU), \ie, re-training the model after removing the undesirable training data, and find it does not work in SGMs. Further analysis of score functions reveals that the MU ‘gold standard’ does not alter the original score function, which explains its ineffectiveness. Building on this insight, we propose the first Moderated Score-based Generative Model (MSGM), which introduces a novel score adjustment strategy that redirects the score function away from undesirable data during the continuous-time stochastic differential equation process. Albeit designed for SGMs, MSGM is a general and flexible MU framework compatible with diverse diffusion architectures, training strategies and downstream tasks. The code will be shared upon acceptance.
Poster
David Fan · Shengbang Tong · Jiachen Zhu · Koustuv Sinha · Zhuang Liu · Xinlei Chen · Michael Rabbat · Nicolas Ballas · Yann LeCun · Amir Bar · Saining Xie

[ Exhibit Hall I ]

Abstract
Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.
Poster
Jianfeng Dong · Danfeng Luo · Daizong Liu · Jie Sun · Xiaoye Qu · Xun Yang · Dongsheng Liu · Xun Wang

[ Exhibit Hall I ]

Abstract
Unsupervised Fine-grained Visual Represent Learning (FVRL) aims to learn discriminative features to distinguish subtle differences among visually similar categories without using labeled fine-grained data. Existing works, which typically learn representation from target data, often struggle to capture subtle inter-class variations due to the limited prior fine-grained knowledge. To alleviate it, this paper proposes LLM-assisted Entropy-based Adaptive Distillation (LEAD), a novel unsupervised FVRL framework that selectively distills fine-grained knowledge from a powerful teacher model built upon pre-trained models. Specifically, we first harness the powerful reasoning capabilities of Large Language Models (LLMs) to generate contextual knowledge of fine-grained category-aware descriptions, enriching semantic priors in the teacher model. These descriptions are then used to form a prototype-driven fine-grained classifier, which acts as an assistant to generate rich knowledge with a frozen vision-language model. Besides, to achieve effective knowledge transfer, we further introduce an entropy-based adaptive mechanism, which dynamically adjusts the distillation strength based on the information entropy to identify and prioritize valuable knowledge. Extensive experimental results on three fine-grained datasets demonstrate the effectiveness and efficiency of our proposed LEAD for unsupervised FVRL. Our source code is available at https://anonymous.4open.science/r/EAD-FFAB.
Poster
Chenxin Li · Yifan Liu · Panwang Pan · Hengyu Liu · Xinyu Liu · Wuyang Li · Cheng Wang · Weihao Yu · Yiyang LIN · Yixuan Yuan

[ Exhibit Hall I ]

Abstract
Developing systems that can interpret diverse real-world signals remains a fundamental challenge in multimodal learning. Current approaches to multimodal fusion face significant obstacles stemming from inherent modal heterogeneity. While existing methods attempt to enhance fusion through cross-modal alignment or interaction mechanisms, they often struggle to balance effective integration with preserving modality-specific information, and frequently neglect crucial contextual nuances unique to each modality. We introduce ModBridge, a novel framework grounded in conditional information maximization principles that addresses these limitations. Our approach reframes multimodal fusion through two key innovations: (1) we formulate fusion as a conditional mutual information optimization problem with an integrated protective margin that simultaneously encourages cross-modal information sharing while safeguarding against over-fusion that could eliminate unique modal characteristics; and (2) we enable fine-grained contextual fusion by leveraging modality-specific conditions (such as audio event detection signals) to guide the integration process. Comprehensive evaluations across multiple benchmarks demonstrate that ModBridge consistently outperforms state-of-the-art multimodal architectures, establishing a more principled and effective approach to multimodal learning that better captures complementary information across diverse input signals.
Poster
Qiyu Xu · Zhanxuan Hu · Yu Duan · Ercheng Pei · Yonghang Tai

[ Exhibit Hall I ]

Abstract
Generalized Category Discovery (GCD) aims to classify unlabeled data from both known and unknown categories by leveraging knowledge from labeled known categories. While existing methods have made notable progress, they often overlook a hidden stumbling block in GCD: distracted attention. Specifically, when processing unlabeled data, models tend to focus not only on key objects in the image but also on task-irrelevant background regions, leading to suboptimal feature extraction. To remove this stumbling block, we propose Attention Focusing (AF), an adaptive mechanism designed to sharpen the model's focus by pruning non-informative tokens. AF consists of two simple yet effective components: Token Importance Measurement (TIME) and Token Adaptive Pruning (TAP), working in a cascade. TIME quantifies token importance across multiple scales, while TAP prunes non-informative tokens by utilizing the multi-scale importance scores provided by TIME. AF is a lightweight, plug-and-play module that integrates seamlessly into existing GCD methods with minimal computational overhead. When incorporated into one prominent GCD method, SimGCD, AF achieves up to $15.4\%$ performance improvement over the baseline with minimal computational overhead. The implementation code is provided in:\url{https://anonymous.4open.science/r/AFGCD-E652}.
Poster
Chenyu Mu · Yijun Qu · Jiexi Yan · Erkun Yang · Cheng Deng

[ Exhibit Hall I ]

Abstract
The sample selection approach is a widely adopted strategy for learning with noisy labels, where examples with lower losses are effectively treated as clean during training. However, this clean set often becomes dominated by easy examples, limiting the model’s meaningful exposure to more challenging cases and reducing its expressive power. To overcome this limitation, we introduce a novel metric called Dynamic Center Distance (DCD), which can quantify sample difficulty and provide information that critically complements loss values. Unlike approaches that rely on predictions, DCD is computed in feature space as the distance between sample features and a dynamically updated center, established through a proposed meta-learning framework. Building on preliminary semi-supervised training that captures fundamental data patterns, we incorporate DCD to further refine the classification loss, down-weighting well-classified examples and strategically focusing training on a sparse set of hard instances. This strategy prevents easy examples from dominating the classifier, leading to more robust learning. Extensive experiments across multiple benchmark datasets, including synthetic and real-world noise settings, as well as natural and medical images, consistently demonstrate the effectiveness of our method.
Poster
Zhengzhuo Xu · Sinan Du · Yiyan Qi · Siwen Lu · Chengjin Xu · Chun Yuan · Jian Guo

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions. We further introduce an automated pipeline to construct ChartPoint-SFT-62k, a dataset featuring 19.2K high-quality chart samples with step-by-step CoT, bounding box, and re-rendered visualizations. Leveraging this data, we develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks, e.g., +5.04\% on ChartBench.
Poster
Xinyu Sun · Zhikun Zhao · congyan lang · Bing Li · Juan Wang

[ Exhibit Hall I ]

Abstract
The image signal processing (ISP) pipeline is responsible for converting the RAW images collected from the sensor into high-quality RGB images. It contains a series of image processing modules and associated ISP hyperparameters. Recent learning-based approaches aim to automate ISP hyperparameter optimization using solely image data. However, their unimodal nature limits their ability to capture richer contextual information, reducing robustness and adaptability across diverse application scenarios. To address this limitation, we propose a Multimodal Large Language Model (MLLM)-guided ISP hyperparameter optimization framework, which integrates textual insights generated by MLLMs into the optimization process. By incorporating both high-level semantic cues and low-level image quality descriptors, our method enhances contextual understanding and task adaptability. Additionally, we introduce a Dynamic Pair Generation (DPG) refinement strategy based on Direct Preference Optimization (DPO), facilitating efficient preference alignment without the need for extensive human-labeled data. This novel framework not only improves the directional consistency of optimization but also significantly reduces the computational and data preparation overhead. We validate our proposed methods on both high-level and low-level vision tasks, demonstrating superior performance compared to existing methods.
Poster
Xinyu Fang · Zhijian Chen · Kai Lan · Lixin Ma · Shengyuan Ding · Yingji Liang · Xiangyu Zhao · Farong Wen · Zicheng Zhang · Guofeng Zhang · Haodong Duan · Kai Chen · Dahua Lin

[ Exhibit Hall I ]

Abstract
Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks.To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM’s creative abilities.Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code will be released soon.
Poster
Jiawei Gu · Ziyue Qiao · Zechao Li

[ Exhibit Hall I ]

Abstract
Out-of-Distribution (OOD) detection is critical for safely deploying deep models in open-world environments, where inputs may lie outside the training distribution. During inference on a model trained exclusively with In-Distribution (ID) data, we observe a salient \emph{gradient} phenomenon: around an ID sample, the local gradient directions for “enhancing” that sample’s predicted class remain relatively consistent, whereas OOD samples—unseen in training—exhibit disorganized or conflicting gradient directions in the same neighborhood. Motivated by this observation, we propose an inference-stage technique to \emph{short-circuit} those feature coordinates that spurious gradients exploit to inflate OOD confidence, while leaving ID classification largely intact. To circumvent the expense of recomputing the logits after this gradient short-circuit, we further introduce a local first-order approximation that accurately captures the post-modification outputs without a second forward pass. Experiments on standard OOD benchmarks show our approach yields substantial improvements. Moreover, the method is lightweight and requires minimal changes to the standard inference pipeline, offering a practical path toward robust OOD detection in real-world applications.
Poster
Xiaofei Hui · Haoxuan Qu · Ping Hu · Hossein Rahmani · Jun Liu

[ Exhibit Hall I ]

Abstract
Alongside the rapid development of Large Multimodel Models (LMMs) like GPT-4V, privacy concerns also rise. As LMMs are commonly deployed as cloud services, users are typically required to upload their personal images and videos to the cloud to access these services, raising great concerns about visual privacy leakage. In this paper, we investigate the critical but underexplored problem of keeping LMM's good performance while protecting visual privacy information in the input data. We tackle this problem in the practical scenario where the LMM remains a black box, i.e., we can only access its input and output without knowing the LMM's internal information. To address such a challenging problem, we propose a new Privacy-Aware Boundary Probing (PABP) framework, which, from a novel perspective, converts this problem into a privacy optimization problem guided by the decision boundary between the "satisfactory" and "unsatisfactory" LMM utility states. We propose two tailored schemes, Gradually-Expanding-Probing (GEP) and Prior-Guided-Probing (PGP), to maintain satisfactory LMM performance while achieving privacy protection. We show the effectiveness of our framework on different benchmarks (code will be released).
Poster
Shiming Chen · Bowen Duan · Salman Khan · Fahad Khan

[ Exhibit Hall I ]

Abstract
Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization.
Poster
Yang Tian · Zheng Lu · Mingqi Gao · Zheng Liu · Bo Zhao

[ Exhibit Hall I ]

Abstract
Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs' capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with just 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.
Poster
Debasmit Das · Hyoungwoo Park · Munawar Hayat · Seokeon Choi · Sungrack Yun · Fatih Porikli

[ Exhibit Hall I ]

Abstract
Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.
Poster
Xiao Zhang · Fei Wei · Yong Wang · Wenda Zhao · Feiyi Li · Xiangxiang Chu

[ Exhibit Hall I ]

Abstract
Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios.
Poster
Youneng Bao · Yiping Liu · Zhuo Chen · Yongsheng Liang · Mu Li · Kede Ma

[ Exhibit Hall I ]

Abstract
The ``scale-is-everything" paradigm in machine learning has resulted in escalating computational and storage demands as datasets and models grow increasingly large. Dataset distillation addresses this challenge by compressing datasets into compact latent representations that generate synthetic data capable of matching the performance of models trained on the original data, formulated as a rate-utility optimization problem. Existing dataset distillation methods fail to achieve Pareto optimality due to their inability to jointly optimize compression rate and utility within a differentiable framework.Drawing inspiration from learned image compression (LIC), we propose a unified framework where latent representations are modeled as optimizable parameter grids (codes) and a generator (decoder) to transform codes to synthesized images. This approach subsumes nearly all existing latent representations while explicitly modeling the rate as an optimizable term through precise entropy estimation of the latent. To quantify compression efficiency, we introduce bits per class (BPC), a novel metric for distilled datasets. We optimize the uniform latent representation according to joint rate-utility trade off and achieve state-of-the-art results on CIFAR-10/100 and ImageNet-128. For instance, on the ImageNet-Subset dataset, our method achieves a 170$\times$ compression rate improvement over the baseline approach while maintaining comparable utility.The framework is compatible with most existing distillation algorithms …
Poster
Shangbo Wu · Yu-an Tan · Ruinan Ma · Wencong Ma · Dehua Zhu · Yuanzhang Li

[ Exhibit Hall I ]

Abstract
The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA---a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts.
Poster
Hai Huang · Yan Xia · Shulei Wang · Hanting Wang · Minghui Fang · Shengpeng Ji · Sashuai Zhou · Tao Jin · Zhou Zhao

[ Exhibit Hall I ]

Abstract
This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance generalization. CUJP enhances feature diversity and model uncertainty by integrating modality-agnostic feature selection with self-supervised learning, thereby strengthening the model’s ability to handle unknown categories in open-set tasks. Extensive experiments on CMG and the newly proposed OSCMG validate the effectiveness of our approach. Code is available in supplementary material.
Poster
ZUYU ZHANG · Ning Chen · Yongshan Liu · Qinghua Zhang · Xu Zhang

[ Exhibit Hall I ]

Abstract
Single Domain Generalization (SDG) aims to develop models capable of generalizing to unseen target domains using only one source domain, a task complicated by substantial domain shifts and limited data diversity. Existing SDG approaches primarily rely on data augmentation techniques, which struggle to effectively adapt training dynamics to accommodate large domain shifts. To address this, we propose LEAwareSGD, a novel Lyapunov Exponent (LE)-guided optimization approach inspired by dynamical systems theory. By leveraging LE measurements to modulate the learning rate, LEAwareSGD encourages model training near the edge of chaos, a critical state that optimally balances stability and adaptability. This dynamic adjustment allows the model to explore a wider parameter space and capture more generalizable features, ultimately enhancing the model's generalization capability. Extensive experiments on PACS, OfficeHome, and DomainNet demonstrate that LEAwareSGD yields substantial generalization gains, achieving up to 9.47% improvement on PACS in low-data regimes. These results underscore the effectiveness of training near the edge of chaos for enhancing model generalization capability in SDG tasks.
Poster
Xiaoyue Mi · Fan Tang · Zonghan Yang · Danding Wang · Juan Cao · Peng Li · Yang Liu

[ Exhibit Hall I ]

Abstract
Despite the remarkable advances that have been made in continual learning, the adversarial vulnerability of such methods has not been fully discussed. We delve into the adversarial robustness of memory-based continual learning algorithms and observe limited robustness improvement by directly applying adversarial training techniques. Our preliminary studies reveal the twin challenges for building adversarial robust continual learners: \textbf{accelerated forgetting} in continual learning and \textbf{gradient obfuscation} in adversarial robustness. In this study, we put forward a novel adversarial robust memory-based continual learner that adjusts data logits to mitigate the forgetting of pasts caused by adversarial samples. Furthermore, we devise a gradient-based data selection mechanism to overcome the gradient obfuscation caused by limited stored data. The proposed approach can widely integrate with existing memory-based continual learning and adversarial training algorithms in a plug-and-play way. Extensive experiments on Split-CIFAR10/100 and Split-Tiny-ImageNet demonstrate the effectiveness of our approach, achieving a maximum forgetting reduction of 34.17% in adversarial data for ResNet, and 20.10% for ViT.
Poster
Amirhossein Ansari · Ke Wang · Pulei Xiong

[ Exhibit Hall I ]

Abstract
Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set, and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. We evaluate NegRefine on large-scale benchmarks, including ImageNet-1K. Source code is available in the supplementary material.
Poster
Yue Duan · Taicai Chen · Lei Qi · Yinghuan Shi

[ Exhibit Hall I ]

Abstract
Semi-supervised continual learning (SSCL) seeks to leverage both labeled and unlabeled data in a sequential learning setup, aiming to reduce annotation costs while managing continual data arrival. SSCL introduces complex challenges, including ensuring effective unlabeled learning (UL), while balancing memory stability (MS) and learning plasticity (LP). Previous SSCL efforts have typically focused on isolated aspects of the three, while this work presents USP, a divide-and-conquer framework designed to synergistically enhance these three aspects: (1) Feature Space Reservation (FSR) strategy for LP, which constructs reserved feature locations for future classes by shaping old classes into an equiangular tight frame; (2) Divide-and-Conquer Pseudo-labeling (DCP) approach for UL, which assigns reliable pseudo-labels across both high- and low-confidence unlabeled data; and (3) Class-mean-anchored Unlabeled Distillation (CUD) for MS, which reuses DCP's outputs to anchor unlabeled data to stable class means for distillation to prevent forgetting. Comprehensive evaluations show USP outperforms prior SSCL methods, with gains up to 18.26% in the last accuracy, validating its effectiveness. The source code will be made available upon acceptance of the paper.
Poster
Xiaorui Jiang · Buyun He · Peng Yuan Zhou · Xinyue Chen · Jingcai Guo · Jie Xu · Yong Liao

[ Exhibit Hall I ]

Abstract
Incomplete multi-view clustering (IMVC) has gained increasing attention due to its ability to analyze incomplete multi-view data.Despite deep IMVC methods achieved significant progress, they still face two challenges: (I) The method-specific inseparable designs limit their application. (II) Non-independent and identically distributed (Non-IID) missing patterns has not been considered and caused degeneration. To address these issues, we propose a novel unified framework that bridges from deep MVC to deep IMVC, while emphasizing the robustness against Non-IID missing patterns. Our framework has a two-stage process: (I) Multi-view learning on complete data, where our framework is modularly established to be compatible with different multi-view interaction objectives. (II) Transfer learning and clustering on incomplete data, where we propose a multi-view domain adversarial learning method to improve the model robustness to Non-IID missing patterns. Moreover, an intra-view and inter-view imputation strategy is introduced for more reliable clustering.Based on our unified framework, we easily construct multiple IMVC instances and extensive experiments verified their clustering effectiveness.
Poster
Vedaant V Jain · Gabriel Kreiman · Felipe Feitosa

[ Exhibit Hall I ]

Abstract
Despite significant advancements in image segmentation and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical humor as a paradigmatic example of image interpretation that requires elucidating the interaction of different scene elements in the context of prior cognitive knowledge. This paper introduces HumorDB, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches, and AI-generated content, including minimally contrastive pairs where subtle edits differentiate between humorous and non-humorous versions. We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on the precise regions that make the image funny. Preliminary mechanistic interpretability studies and evaluation of model explanations provide initial insights into how different architectures process humor. Our results identify promising trends and current limitations, …
Poster
Zhenghao He · Sanchit Sinha · Guangzhi Xiong · Aidong Zhang

[ Exhibit Hall I ]

Abstract
Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV on GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts.
Poster
Junhao Dong · Jiao Liu · Xinghua Qu · YEW-SOON ONG

[ Exhibit Hall I ]

Abstract
Adversarially robust knowledge distillation transfers the robustness of a large-scale teacher model to a lightweight student while preserving natural performance. However, foundation Vision-Language Models (VLMs) also demand the transfer of zero-shot inference capabilities. We find that standard robust distillation using untargeted adversarial examples fails to transfer out-of-distribution (zero-shot) robustness, as these adversaries primarily push inputs away from their original distribution, exploring a limited portion of the teacher’s decision space and miss more diverse failure modes. A natural solution is to generate multiple targeted adversaries that traverse diverse paths across decision boundaries. Thus, these adversaries probe a broader region of the teacher’s decision surface. However, naive targeted adversary optimization often converges to local optima within a single category’s decision region, limiting the diversity. To address this, we propose a Multi-Objective Optimization (MOO)-based adversarial distillation framework that transfers robustness from large VLMs to lightweight ones by exploiting adversaries with two main objectives: misclassification and category-level adversarial diversity. Theoretically, we show that optimizing for diversity mitigates adversarial collapse into local optima, ensuring adversaries span multiple decision regions and capture the teacher’s generalizable robust features. Extensive experiments demonstrate the superiority of our method over state-of-the-art adversarial learning across diverse scenarios.
Poster
Shangpin Peng · Senqiao Yang · Li Jiang · Zhuotao Tian

[ Exhibit Hall I ]

Abstract
Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose **SENTINEL** (**S**entence-level **E**arly i**N**tervention **T**hrough **IN**-domain pr**E**ference **L**earning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to iteratively build context-aware preference data. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by 90\% over the original model and outperforms the previous state-of-the-art method on both the hallucination benchmarks and general capabilities benchmarks, manifesting its superiority and generalization ability. The proposed models, datasets and code will be made publicly available.
Poster
Daniel DeAlcala · Aythami Morales · Julian Fierrez · Gonzalo Mancera · Ruben Tolosana · Javier Ortega-Garcia

[ Exhibit Hall I ]

Abstract
Active Membership Inference Test (aMINT) is a method designed to detect if given data was used during the training of machine learning models. In Active MINT, we propose a novel multi-task learning process that involves training simultaneously two models: the original or Audited Model, and a secondary model, referred to as the MINT Model, responsible for identifying the data used for training the Audited Model. This novel multi-task learning approach has been designed to incorporate the auditability of the model as an optimization objective during the training process of neural networks. The proposed approach incorporates intermediate activation maps as inputs to MINT layers, which are trained to enhance the detection of the training data. We present results using a wide range of neural networks, from lighter architectures like MobileNet to more complex ones such as Vision Transformers, evaluated across 5 public benchmarks. Our proposed Active MINT achieves over 80% accuracy in detecting if given data was used for training, significantly outperforming previous approaches in the literature. Our proposed aMINT and related methodological developments contribute to increasing transparency in AI training, therefore facilitating stronger safeguards in AI deployments in order to achieve proper security, privacy, and copyright protection (Code will be …
Poster
Rui Ma · Qilong Wang · Bing Cao · Qinghua Hu · Yahong Han

[ Exhibit Hall I ]

Abstract
Recently, vision-language models (e.g., CLIP) with prompt learning have shown great potential in few-shot learning. However, an open issue remains for the effective extension of CLIP-based models to few-shot open-set recognition (FSOR), which requires classifying known classes and detecting unknown samples using a few known samples. The core challenge is that unknown samples and their textual descriptions are unavailable. To address this, we propose an Unknown Text Learning (UTL) method for CLIP-based FSOR tasks with only known samples. Specifically, UTL involves two key components, i.e., universal unknown words optimization (U$^{2}$WO) and unknown label smoothing (ULS). Specifically, U$^{2}$WO constructs the universal space of unknown words with basis vectors and characterizes unknown text based on a linear combination of those basis vectors. To efficiently learn unknown text without unknown samples, ULS is presented to perform contrast learning between unknown text and known samples by regulating the label of unknown classes to a small constant, which flexibly empowers unknown text to be non-matching with and confused on known visual samples. In addition, our UTL incorporates an additional context for known classes to mitigate conflicts of context optimization between known and unknown classes. UTL effectively regularizes the predicted probability by integrating learnable unknown text. …
Poster
Longhua Li · Lei Qi · Xin Geng

[ Exhibit Hall I ]

Abstract
Edge computing in person re-identification (ReID) is crucial for reducing the load on central cloud servers and ensuring user privacy. Conventional methods for obtaining compact models require computations for each individual student model. When multiple models of varying sizes are needed to accommodate different resource conditions, this leads to repetitive and cumbersome calculations.To address this challenge, we propose a novel knowledge inheritance approach named OSKT (One-Shot Knowledge Transfer), which consolidates the knowledge of the teacher model into an intermediate carrier called a weight chain. When a downstream scenario demands a model that meets specific resource constraints, this weight chain can be expanded to the target model size without additional computation.OSKT significantly outperforms state-of-the-art compression methods, with the added advantage of one-time knowledge transfer that eliminates the need for frequent computations for each target model.On the Market1501 benchmark, using pre-trained ResNet50 or ViT-S as the teacher model, OSKT generates smaller student models (1/64th and 1/10th the parameters respectively) achieving accuracies of 89.4\% and 87.1\%, outperforming pruning (80.7\%, 74.1\%) and knowledge distillation (65.7\%, 38.7\%).
Poster
Xiefan Guo · Miaomiao Cui · Liefeng Bo · Di Huang

[ Exhibit Hall I ]

Abstract
Backpropagation-based approaches aim to align diffusion models with reward functions through end-to-end backpropagation of the reward gradient within the denoising chain, offering a promising perspective. However, due to the computational costs and the risk of gradient explosion associated with the lengthy denoising chain, existing approaches struggle to achieve complete gradient backpropagation, leading to suboptimal results. In this paper, we introduce Shortcut-based Fine-Tuning (ShortFT), an efficient fine-tuning strategy that utilizes the shorter denoising chain. More specifically, we employ the recently researched trajectory-preserving few-step diffusion model, which enables a shortcut over the original denoising chain, and construct a shortcut-based denoising chain of shorter length. The optimization on this chain notably enhances the efficiency and effectiveness of fine-tuning the foundational model. Our method has been rigorously tested and can be effectively applied to various reward functions, significantly improving alignment performance and surpassing state-of-the-art alternatives.
Poster
Mahdiyar Molahasani · Azadeh Motamedi · Michael Greenspan · Il-Min Kim · Ali Etemad

[ Exhibit Hall I ]

Abstract
We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions.PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text embeddings. Extensive experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: https://anonymous.4open.science/r/PRISM_official.
Poster
Zhaoyang Li · Zhu Teng · Baopeng Zhang · Jianping Fan

[ Exhibit Hall I ]

Abstract
Deepfake detection methods are becoming increasingly crucial for identity security and have recently been employed to support legal proceedings. However, these methods often exhibit unfairness due to flawed logical reasoning, undermining the reliability of their predictions and raising concerns about their applicability in legal contexts. To mitigate this bias, existing approaches typically rely on predefined demographic attributes, such as race and gender. However, these assumptions are inherently limited, as different deepfake detectors exhibit substantial variations in fairness performance, often uncovering intricate and unforeseen bias patterns. To this end, we propose the Adversarial Open-Unfairness Discovery and Mitigation Network (AdvOU), a novel framework designed to mitigate unpredictable unfairness in deepfake detection. Our approach strengthens general deepfake detectors by equipping them with a lightweight Unfairness Regulator (UR), which dynamically identifies and mitigates bias. Furthermore, we propose an adversarial learning paradigm that alternates between the training of the Open-Unfairness Discovery (OUD) module and the Unfairness Adversarial Mitigation (UAM) module. The former intensifies unfairness within UR to reveal underlying bias patterns, while the latter leverages fairness in the detector by enforcing adversarial robustness against unfairness. Extensive experiments on widely used deepfake datasets validate the effectiveness of our approach, outperforming state-of-the-art methods in both fairness and …
Poster
Young-Jun Lee · Byung-Kwan Lee · Jianshu Zhang · Yechan Hwang · Byungsoo Ko · Han-Gyu Kim · Dongyu Yao · Xuankun Rong · Eojin Joo · Seung-Ho Han · Bowon Ko · Ho-Jin Choi

[ Exhibit Hall I ]

Abstract
Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues—each averaging four turns—derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.
Poster
Han Qiu · Peng Gao · Lewei Lu · Xiaoqin Zhang · Ling Shao · Shijian Lu

[ Exhibit Hall I ]

Abstract
Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will …
Poster
Chi-Ping Su · Ching-Hsun Tseng · Bin Pu · Lei Zhao · Jiewen Yang · Zhuangzhuang Chen · Shin-Jye Lee

[ Exhibit Hall I ]

Abstract
Knowledge distillation (KD) enables a smaller "student" model to mimic a larger "teacher" model by transferring knowledge from the teacher's output or features. However, most KD methods treat all samples uniformly, overlooking the varying learning value of each sample and thereby limiting effectiveness. In this paper, we propose Entropy-based Adaptive Knowledge Distillation (EA-KD), a simple yet effective plug-and-play KD method that prioritizes learning from valuable samples. EA-KD quantifies each sample’s learning value by strategically combining the entropy of the teacher and student output, then dynamically reweights the distillation loss to place greater emphasis on high-entropy samples. Extensive experiments across diverse KD frameworks and tasks—including image classification, object detection, and large language model (LLM) distillation—demonstrate that EA-KD consistently enhances performance, achieving state-of-the-art results with negligible computational cost. Our code will be publicly available.
Poster
Guohao Sun · Can Qin · Yihao Feng · Zeyuan Chen · Ran Xu · Sohail Dianat · MAJID RABBANI · Raghuveer Rao · Zhiqiang Tao

[ Exhibit Hall I ]

Abstract
Preference optimization algorithms typically enhance LLM response quality by leveraging human feedback on multiple answers given a fixed instruction. However, these methods often lack capturing the dynamic nature of conversational exchanges. For large vision-language models (LVLMs), direct preference optimization (DPO) can over-emphasize linguistic nuances while overlooking visual context. To address this challenge, we introduce structured policy optimization (SPO) -- a novel preference optimization method that simultaneously aligns preference instructions, responses, and dialogue interactions to improve multi-modal understanding and reasoning capabilities. The efficacy of SPO is attributed to one key design:treating the questioning and answering as a sequential action and binding them through a trajectory reward. This reward formulation better aligns with real-world dialogue studies and eliminates the need for fixed instructions. We evaluate our models on interleaved benchmarks, including image, multi-image, and video-based understanding and reasoning tasks. Experimental results show that the proposed SPO fine-tuning LVLM with multi-modal preference data can align with human preference more efficiently than DPO.
Poster
Ivan Sabolic · Matej Grcic · Siniša Šegvić

[ Exhibit Hall I ]

Abstract
We propose VIBE, a model-agnostic framework that trains classifiers resilient to backdoor attacks.The key concept behind our approachis to treat malicious inputs and corrupted labels from the training dataset as observed random variables,while the actual clean labelsare latent.VIBE then recovers the corresponding latent clean label posteriorthrough variational inference. The resulting training procedure follows the expectation-maximization (EM) algorithm.The E-step infers the clean pseudolabels by solvingan entropy-regularized optimal transport problem,while the M-step updates the classifier parameters via gradient descent.Being modular,VIBE can seamlessly integratewith recent advancements in self-supervised representation learning,which enhance its ability to resist backdoor attacks.We experimentally validate the method effectiveness against contemporary backdoor attacks on standard datasets, a large-scale setup with 1$k$ classes,and a dataset poisoned with multiple attacks.VIBE consistently outperforms previous defenses across all tested scenarios.
Poster
Hongyang He · Hongyang Xie · Haochen You · Victor Sanchez

[ Exhibit Hall I ]

Abstract
Semi-supervised learning (SSL) is often hindered by learning biases when imbalanced datasets are used for training, which limits its effectiveness in real-world applications. In this paper, we propose Semi-ViM, a novel SSL framework based on Vision Mamba, a bidirectional state space model (SSM) that serves as a superior alternative to Transformer-based architectures for visual representation learning. Semi-ViM effectively deals with label imbalance and improves model stability through two key innovations: LyapEMA, a stability-aware parameter update mechanism inspired by Lyapunov theory, and SSMixup, a novel mixup strategy applied at the hidden state level of bidirectional SSMs. Experimental results on ImageNet-1K and ImageNet-LT demonstrate that Semi-ViM significantly outperforms state-of-the-art SSL models, achieving 85.40% accuracy with only 10% of the labeled data, surpassing Transformer-based methods such as Semi-ViT.
Poster
Marco P. Apolinario · Sakshi Choudhary · Kaushik Roy

[ Exhibit Hall I ]

Abstract
Continual learning (CL) — the ability to progressively acquire and integrate new concepts — is essential to intelligent systems to adapt to dynamic environments. However, deep neural networks struggle with catastrophic forgetting (CF) when learning tasks sequentially, as training for new tasks often overwrites previously learned knowledge. To address this, recent approaches constrain updates to orthogonal subspaces using gradient projection, effectively preserving important gradient directions for previous tasks. While effective in reducing forgetting, these approaches inadvertently hinder forward knowledge transfer (FWT), particularly when tasks are highly correlated. In this work, we propose Conceptor-based gradient projection for Deep Continual Learning (CODE-CL), a novel method that leverages conceptor matrix representations, a form of regularized reconstruction, to adaptively handle highly correlated tasks. CODE-CL mitigates CF by projecting gradients onto pseudo-orthogonal subspaces of previous task feature spaces while simultaneously promoting FWT. It achieves this by learning a linear combination of shared basis directions, allowing efficient balance between stability and plasticity and transfer of knowledge between overlapping input feature representations. Extensive experiments on continual learning benchmarks validate CODE-CL’s efficacy, demonstrating superior performance, reduced forgetting, and improved FWT as compared to state-of-the-art methods.
Poster
Hao Ban · Gokul Ram Subramani · Kaiyi Ji

[ Exhibit Hall I ]

Abstract
Multi-task learning (MTL) enables a joint model to capture commonalities across multiple tasks, reducing computation costs and improving data efficiency. However, a major challenge in MTL optimization is task conflicts, where the task gradients differ in direction or magnitude, limiting model performance compared to single-task counterparts. Sharpness-aware minimization (SAM) minimizes task loss while simultaneously reducing the sharpness of the loss landscape. Our empirical observations show that SAM effectively mitigates task conflicts in MTL. Motivated by these findings, we explore integrating SAM into MTL but face two key challenges. On one hand, both the average loss gradient and individual task gradients--referred to as global and local information--contribute to SAM, but how to combine them remains unclear. On the other hand, directly computing each task gradient introduces significant computational and memory overheads. To address these challenges, we propose SAMO, a lightweight **S**harpness-**A**ware **M**ulti-task **O**ptimization approach, that leverages a joint global-local perturbation. The local perturbations are approximated using only forward passes and are layerwise normalized to improve efficiency. Extensive experiments on a suite of multi-task benchmarks demonstrate both the effectiveness and efficiency of our method.
Poster
Haidong Kang · Lianbo Ma · Pengjun Chen · Guo Yu · Xingwei Wang · Min Huang

[ Exhibit Hall I ]

Abstract
Training-free Neural Architecture Search (NAS) has emerged an efficient way to discover high-performing lightweight models with zero-cost proxies (e.g., the activation-based proxies (AZP)). In this paper, we observe a new \textit{negative correlation phenomenon} that the correlations of the AZP dramatically decrease to be negative with the increasing number of convolutions, significantly degrading the prediction performance of AZP over target architectures. No existing works focus on such negative correlation and its underlying mechanism. To address this, through deep analysis of the architectural characteristics scored by AZP, we propose a series of AZP design principles and reveal the potential reason of the above phenomenon that \textit{high non-linearity dramatically degrades the magnitude of AZP score}. Those findings show that existing AZP designs do not obey the proposed principles. Finally, grounded in these insights, we propose a simple yet efficient \underline{N}egative \underline{C}orrelations-\underline{D}efied (\textbf{NCD}) method, which utilize stochastic activation masking (SAM) and non-linear rescaling (NIR) to effectively eliminate negative correlation of AZP and significantly improve performance. Extensive experimental results validate the effectiveness and efficiency of our method, outperforming state-of-the-art methods on mainstream 12 search spaces with 4 real-world tasks.
Poster
yan wang · Da-Wei Zhou · Han-Jia Ye

[ Exhibit Hall I ]

Abstract
Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Despite Pre-trained Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting still occurs as the model learns new concepts. Existing methods often freeze the pre-trained network and adapt to incremental tasks using additional lightweight modules. At inference time, the model must accurately identify the most suitable module, but errors in retrieving irrelevant modules can lead to a decline in performance. Additionally, the selected module concentrates solely on task-specific knowledge and neglects the general knowledge shared across tasks, so it is prone to make erroneous predictions when it is presented with several similar classes from different tasks. To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. Specifically, we design an orthogonal mechanism to train task-specific adapters, so that they can capture the most crucial features relevant to their respective tasks. Furthermore, we introduce an adapter fusion strategy to construct a universal adapter, which encodes the shared general knowledge across tasks. During inference, we combine predictions from both the task-specific adapter and the universal adapter to effectively utilize both specialized and general knowledge. Extensive experiments on various benchmark datasets demonstrate …
Poster
David A Kelly · Akchunya Chanchal · Nathan Blake

[ Exhibit Hall I ]

Abstract
Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model's classification accuracy statistically, our understanding of the way these models work is unfortunately quite limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets. These pixels capture the essence of an image through the lens of the model. By comparing position, overlap and size of sets of pixels, we identify that different architectures have statistically different minimal pixels sets, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with statistically significant larger pixels sets than correct classifications.
Poster
Mainak Biswas · Ambedkar Dukkipati · Devarajan Sridharan

[ Exhibit Hall I ]

Abstract
Deep learning models are seldom deployed widely for real-world applications (e.g., medicine), because source models do not generalize well to \``domain-shifted'' target data. Many successful domain adaptation approaches require full access to source data and reliably labeled target data. Yet, such requirements are unrealistic in scenarios where source data cannot be shared either because of privacy concerns or are too large, and incur prohibitive storage or computation costs. Moreover, resource constraints may limit the availability of labeled targets. We illustrate this challenge in a neuroscience setting where source data are unavailable, labeled target data are meager, and predictions involve continuous-valued outputs. We build upon Contradistinguisher (CUDA), an efficient framework that learns a shared model across the labeled source and unlabeled target samples, without intermediate alignment of representations. Yet, CUDA was designed for unsupervised DA, with full access to source data and for classification tasks. We develop CRAFT -- a CUDA-based Regularization Approach for Flexible Training -- for source-free (SF), semi-supervised transfer of pretrained models in regression tasks. We showcase the efficacy of CRAFT in two important neuroscience settings: gaze prediction with electroencephalography (EEG) data and ``brain age'' prediction with structural MRI data. For both datasets, CRAFT yielded up to $9\\%$ …
Poster
Wenwen Yu · Zhibo Yang · Yuliang Liu · Xiang Bai

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks.In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating interpretable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency.Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more interpretable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding.
Poster
Lingyun Huang · Jianxu Mao · Junfei YI · Ziming Tao · Yaonan Wang

[ Exhibit Hall I ]

Abstract
In recent years, the rapid expansion of model sizes has introduced huge computational overhead. To address these issues, Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced. This method optimizes large-scale pre-trained models for specific tasks by fine-tuning a select group of parameters. Among these PEFT methods, adapter-based and prompt-based methods are the primary techniques. Specifically, in the field of visual fine-tuning, adapters gain prominence over prompts because of the latter’s relatively weaker performance and efficiency. Under the circumstances, we conducted a detailed analysis of Visual Prompt Tuning (VPT) and attributed its shortcomings to the deployment of prompts in VPT. Consequently, we proposed Cross Visual Prompt Tuning (CVPT), which introduces cross-attention to directly capture the relationships between prompts and the original tokens, allowing the prompts to integrate visual features efficiently. This changes the original deployment of prompts, thereby decoupling the prompts from the original tokens and avoiding the distortion of self-attention. Furthermore, we introduce the weight-sharing mechanism to initialize the parameters of cross-attention, which avoids massive learnable parameters from cross-attention and enhances the representative capability of cross-attention. We conduct comprehensive testing across 25 datasets and the result indicates that CVPT significantly improves VPT’s performance and efficiency in visual tasks. For example, on …
Poster
Hang Du · Jiayang Zhang · Guoshun Nan · Wendi Deng · Zhenyan Chen · Chenyang Zhang · Wang Xiao · Shan Huang · Yuqi Pan · Tao Qi · Sicong Leng

[ Exhibit Hall I ]

Abstract
Multi-image Interleaved Reasoning aims to improve Multimodal Large Language Models' (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks.While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations.To bridge this gap, we introduce a novel benchmark \textbf{MIR}, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images.To enhance MLLMs' ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an ``easy to hard'' approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks.Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models' reasoning performance on MIR and other established benchmarks, highlighting the challenges current MLLMs face with multi-image interleaved reasoning.We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs' …
Poster
Guanghui Shi · Xuefeng liang · Wenjie Li · Xiaoyu Lin

[ Exhibit Hall I ]

Abstract
Learning fine-grained representations from coarse labels for fine-grained visual recognition (FGVR) is a challenging yet valuable task, as it alleviates the reliance on labor-intensive fine-grained annotations. Early approaches focused primarily on minimizing intra-fine-grained-class variation but overlooked inter-fine-grained-class separability, resulting in limited FGVR performance. Subsequent studies employed a top-down paradigm to enhance separability via deep clustering, yet these methods require predefining the number of fine-grained classes, which is often impractical to obtain. Here, we introduce a bottom-up learning paradigm that constructs a hierarchical dendrogram by iteratively merging similar instances/clusters, inferring higher-level semantics from lowest-level instances without predefining class numbers. Leveraging this, we propose BuCSFR, a novel method that integrates a Bottom-up Construction (BuC) module to build the dendrogram based on a minimal information loss criterion, and a Separable Fine-grained Representation (SFR) module that treats dendrogram nodes as pseudo-labels to ensure representation separability. The synergistic interaction between these modules enables iterative enhancement, grounded theoretically in the Expectation-Maximization (EM) framework. Extensive experiments on five benchmark datasets demonstrate the superiority of our approach, showcasing its effectiveness in learning separable representations for FGVR.
Poster
Jeong Woon Lee · Hyoseok Hwang

[ Exhibit Hall I ]

Abstract
Reinforcement learning (RL) has proven its potential in complex decision-making tasks. Yet, many RL systems rely on manually crafted state representations, requiring effort in feature engineering. Visual Reinforcement Learning (VRL) offers a way to address this challenge by enabling agents to learn directly from raw visual input. Nonetheless, VRL continues to face generalization issues, as models often overfit to specific domain features.To tackle this issue, we propose Diffusion Guided Adaptive Augmentation (DGA2), an augmentation method that utilizes Stable Diffusion to enhance domain diversity.We introduce an Adaptive Domain Shift strategy that dynamically adjusts the degree of domain shift according to the agent’s learning progress for effective augmentation with Stable Diffusion.Additionally, we employ saliency as the mask to preserve the semantics of data.Our experiments on the DMControl-GB, Adroit, and Procgen environments demonstrate that DGA2 improves generalization performance compared to existing data augmentation and generalization methods.
Poster
Tianjiao Jiang · Zhen Zhang · Yuhang Liu · Javen Qinfeng Shi

[ Exhibit Hall I ]

Abstract
Few-shot learning (FSL) aims to enable models to learn effectively from limited labeled data. However, existing methods often struggle with overfitting due to the high dimensionality of feature spaces and the small sample sizes typically available. More precisely, the features used in most FSL applications can be viewed as a mixture of latent disentangled features. As a result, the learner is often required to implicitly infer the mixing procedure, which involves estimating a large number of parameters and frequently leads to overfitting. Building on recent theoretical advances in multi-modal contrastive learning, we propose the Causal CLIP Adapter (CCA), a novel approach that disentangles visual features obtained from CLIP by applying independent component analysis (ICA). While ICA effectively disentangles latent features, it may inadvertently introduce misalignment in the feature space. To address this, we leverage CLIP's inherent cross-modal alignment and enhance it both unidirectionally and bidirectionally through fine-tuning and cross-attention mechanisms. The logits from uni-modal and cross-modal classifications are then combined linearly to improve overall classification accuracy. Extensive experiments conducted across 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art (SOTA) techniques in terms of robustness to distributional shifts and resistance to adversarial noise, all while maintaining computational efficiency. These …
Poster
Frano Rajič · Haofei Xu · Marko Mihajlovic · Siyuan Li · Irem Demir · Emircan Gündoğdu · Lei Ke · Sergey Prokudin · Marc Pollefeys · Siyu Tang

[ Exhibit Hall I ]

Abstract
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or previous multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks—Panoptic Studio and DexYCB—where we achieve median trajectory errors of 3.2 cm and 2.3 cm, respectively. Notably, on DexYCB, our method surpasses the strongest single-view tracker by 58.2% and a simpler multi-view triplane-based baseline by 46.5%. It also generalizes better to diverse camera setups of 1–8 cameras with varying vantage points and video lengths of 24–150 frames. By releasing our pre-trained tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for a wide range of real-world applications.
Poster
Simon Kiefhaber · Stefan Roth · Simone Schaub-Meyer

[ Exhibit Hall I ]

Abstract
Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor in optical flow methods regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows to remove the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being $1.2\times$ faster and having a $6\times$ lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at $20\mathrm{FPS}$ using only $500\mathrm{MB}$ of memory.
Poster
Mengmeng Sheng · Zeren Sun · Tianfei Zhou · Xiangbo Shu · Jinshan Pan · Yazhou Yao

[ Exhibit Hall I ]

Abstract
Label noise learning (LNL), a practical challenge in real-world applications, has recently attracted significant attention. While demonstrating promising effectiveness, existing LNL approaches typically rely on various forms of prior knowledge, such as noise rates or thresholds, to sustain performance. This dependence limits their adaptability and practicality in real-world scenarios where such priors are usually unavailable. To this end, we propose a novel LNL approach, termed CA2C (Combined Asymmetric Co-learning and Co-training), which alleviates the reliance on prior knowledge through an integration of complementary learning paradigms. Specifically, we first introduce an asymmetric co-learning strategy with paradigm deconstruction. This strategy trains two models simultaneously under distinct learning paradigms, harnessing their complementary strengths to enhance robustness against noisy labels. Then, we propose an asymmetric co-training strategy with cross-guidance label generation, wherein knowledge exchange is facilitated between the twin models to mitigate error accumulation. Moreover, we design a confidence-based re-weighting approach for label disambiguation, enhancing robustness against potential disambiguation failures. Extensive experiments on synthetic and real-world noisy datasets demonstrate the effectiveness and superiority of CA2C.
Poster
Paul Roetzer · Florian Bernard

[ Exhibit Hall I ]

Abstract
Geometric consistency, i.e. the preservation of neighbourhoods, is a natural and strong prior in 3D shape matching. Geometrically consistent matchings are crucial for many downstream applications, such as texture transfer or statistical shape modelling. Yet, in practice, geometric consistency is often overlooked, or only achieved under severely limiting assumptions (e.g.~a good initialisation). In this work, we propose a novel formalism for computing globally optimal and geometrically consistent matchings between 3D shapes which is scalable in practice. Our key idea is to represent the surface of the source shape as a collection of cyclic paths, which are then consistently matched to the target shape. Mathematically, we construct a hyper product graph (between source and target shape), and then cast 3D shape matching as a minimum-cost circulation flow problem in this hyper graph, which yields global geometrically consistent matchings between both shapes. We empirically show that our formalism is efficiently solvable and that it leads to high-quality results.
Poster
Nurbek Tastan · Karthik Nandakumar

[ Exhibit Hall I ]

Abstract
While foundation models (FMs) pre-trained on large-scale data exhibit good zero-shot performance in many downstream tasks, there is often scope for performance improvement via task-specific adaptation of the FM. However, the data required for this adaptation is typically spread across multiple entities (data owners) and cannot be collated at a central location due to privacy concerns. At the same time, a learning service provider (LSP) who owns the FM cannot share the model with data owners due to proprietary reasons. In this work, we propose the **BlindFed** framework, which enables multiple data owners to collaboratively adapt an FM (owned by an LSP) for a specific downstream task while preserving the interests of both the data owners and the LSP. Specifically, data owners do not see the FM as well as each other's data, and the LSP does not see sensitive task-specific data. The BlindFed framework relies on fully homomorphic encryption (FHE) and consists of three key innovations: (i) We introduce **FHE-friendly architectural modifications** of the given FM, leveraging existing tools such as polynomial approximations and low-rank parallel adapters. (ii) We propose a **two-stage split learning** process, where FHE-friendly FM blocks are learned through offline knowledge distillation and task-specific local parallel …
Poster
Yuyang Ji · Zeyi Huang · Haohan Wang · Yong Jae Lee

[ Exhibit Hall I ]

Abstract
In this paper, we study domain generalization, where the goal is to develop models that can effectively generalize from multiple source domains to unseen target domains. Different from traditional approaches that aim to create a single, style-invariant model, we propose a new ``Customized Domain Adapters'' method, named CDA. This method leverages parameter-efficient adapters to construct a model with domain-specific components, each component focusing on learning from its respective domain. We focus on integrating the unique strengths of different adapter architectures, such as ViT and CNN, to create a model adept at handling the distinct statistical properties of each domain. Our experimental results on standard domain generalization datasets demonstrate the superiority of our method over traditional approaches, showcasing its enhanced adaptability and robustness in domain generalization tasks.
Poster
Xingyu Miao · Haoran Duan · Quanhao Qian · Jiuniu Wang · Yang Long · Ling Shao · Deli Zhao · Ran Xu · Gongjie Zhang

[ Exhibit Hall I ]

Abstract
Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations — including point clouds, camera poses, depth maps, and pseudo-RGBD — via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release multiple generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various spatial tasks, ranging from basic perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.
Poster
Liwen Xiao · Zhiyu Pan · Zhicheng Wang · Zhiguo Cao · Wei Li

[ Exhibit Hall I ]

Abstract
Accurate prediction of multi-agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory, we propose a novel trajectory refinement approach, Soft-Braid Refiner (SRefiner), guided by the soft-braid topological structure of trajectories using Soft-Braid Attention. Soft-Braid Attention captures spatio-temporal topological relationships between trajectories by considering both spatial proximity and vehicle motion states at ``soft intersection points". Additionally, we extend this approach to model interactions between trajectories and lanes, further improving the prediction accuracy. SRefiner is a multi-iteration, multi-agent framework that iteratively refines trajectories, incorporating topological information to enhance interactions within traffic scenarios. SRefiner achieves significant performance improvements over four baseline methods across two datasets, establishing a new state-of-the-art in trajectory refinement.
Poster
Rui Liu · Sheng Fan · Wenguan Wang · Yi Yang

[ Exhibit Hall I ]

Abstract
Underwater visual simultaneous localization and mapping (SLAM) faces critical challenges in light attenuation and degraded geometric consistency. Despite recent advances of visual SLAM in indoor and urban scenes, these approaches typically assume a clear medium and neglect medium-light interactions, leading to performance degradation in underwater environments. To overcome these limitations, we propose DUV-SLAM, a dense underwater visual SLAM framework that integrates uncertainty-aware geometry estimation with physics-inspired neural scattering modeling. Our method introduces two core innovations: i) depth uncertainty quantification derived from differentiable bundle adjustment, which propagates geometric confidence to guide mapping optimization; and ii) a neural-Gaussian hybrid representation that combines adaptive 3D Gaussians for underwater reconstruction with a neural field capturing wavelength-dependent medium properties, optimized using a combination of photometric, geometric, and distribution losses. Experiments on synthetic and real-world datasets demonstrate that DUV-SLAM achieves high-quality monocular reconstruction while maintaining real-time efficiency and robust tracking accuracy. Our code will be released.
Poster
Junming Liu · Siyuan Meng · Yanting Gao · Song Mao · Pinlong Cai · Guohang Yan · Yirong Chen · Zilin Bian · DING WANG · Botian Shi

[ Exhibit Hall I ]

Abstract
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models.
Poster
Eyad Alshami · Shashank Agnihotri · Bernt Schiele · Margret Keuper

[ Exhibit Hall I ]

Abstract
It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features.In this work, we propose ''Amending Inherent Interpretability via Self-Supervised Masking'' (AIM), a simple yet surprisingly effective method that promotes the network’s utilization of genuine features over spurious alternatives without requiring additional annotations.In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM allows training well-performing and inherently interpretable models that faithfully summarize the decision process.When tested on challenging datasets designed to assess reliance on spurious features and out-of-domain generalization, AIM networks demonstrate significant dual benefits: Evaluations show that AIM improves interpretability, as measured by the Energy Pointing Game (EPG) score, by $\sim$6$-$37\%, while simultaneously enhancing accuracy by $\sim$10$-$40\%. These impressive performance gains are further validated on the standard in-domain CUB-200 dataset for fine-grained classification. The results provide compelling evidence supporting our hypothesis that AIM finds genuine and meaningful features that directly contribute to its improved human interpretability.
Poster
Suorong Yang · Peijia Li · Furao Shen · Jian Zhao

[ Exhibit Hall I ]

Abstract
Modern deep architectures often rely on large-scale datasets, but training on these datasets incurs high computational and storage overhead. Real-world datasets often contain substantial redundancies, prompting the need for more data-efficient training paradigms. Data selection has shown promise to mitigate redundancy by identifying the most representative samples, thereby reducing training costs without compromising performance. Existing methods typically rely on static scoring metrics or pretrained models, overlooking the combined effect of selected samples and their evolving dynamics during training. To address this, we introduce the concept of $\epsilon$-sample cover, which quantifies sample redundancy based on inter-sample relationships, capturing the intrinsic structure of the dataset. Based on this, we reformulate data selection as a reinforcement learning (RL) process, where a lightweight RL agent optimizes the selection policy by leveraging $\epsilon$-sample cover derived from evolving dataset distribution as a reward signal. Extensive experiments across benchmark datasets and diverse architectures demonstrate that our method consistently outperforms existing state-of-the-art baselines. Models trained with our selected datasets show enhanced generalization performance with improved training efficiency. Code will be made publicly available soon.
Poster
Jiaqi Jin · Siwei Wang · Zhibin Dong · Xihong Yang · Xinwang Liu · En Zhu · Kunlun He

[ Exhibit Hall I ]

Abstract
Multi-view clustering leverages complementary representations from diverse sources to enhance performance. However, real-world data often suffer incomplete cases due to factors like privacy concerns and device malfunctions. A key challenge is effectively utilizing available instances to recover missing views. Existing methods frequently overlook the heterogeneity among views during recovery, leading to significant distribution discrepancies between recovered and true data. Additionally, many approaches focus on cross-view correlations, neglecting insights from intra-view reliable structure and cross-view clustering structure. To address these issues, we propose BURG, a novel method for incomplete multi-view clustering with distri\textbf{B}ution d\textbf{U}al-consistency \textbf{R}ecovery \textbf{G}uidance. We treat each sample as a distinct category and perform cross-view distribution transfer to predict the distribution space of missing views. To compensate for the lack of reliable category information, we design a dual-consistency guided recovery strategy that includes intra-view alignment guided by neighbor-aware consistency and cross-view alignment guided by prototypical consistency. Extensive experiments on benchmarks demonstrate the superiority of BURG in the incomplete multi-view scenario.
Poster
Daniil Zverev · Thaddäus Wiedemer · Ameya Prabhu · Matthias Bethge · Wieland Brendel · A. Sophia Koepke

[ Exhibit Hall I ]

Abstract
Designing effective foundation models requires high-quality evaluation datasets. With the emergence of audio-visual foundation models, reliable assessment of their multi-modal understanding is essential. The current gold standard for evaluating audio-visual understanding is the popular classification dataset VGGSound. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of models' true auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is explicitly designed to accurately evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. We believe VGGSounder offers a robust and reliable benchmark supporting the future development of audio-visual foundation models.
Poster
Chen Zhu · Wangbo Zhao · Huiwen Zhang · Yuhao Zhou · Weidong Tang · Shuo Wang · Rui Xie · Yuzhang Shang · Xiaojiang Peng · Kai Wang · Dawei Yang

[ Exhibit Hall I ]

Abstract
Vision Transformer (ViT) has emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, supporting diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and expensive. In this paper, we propose \emph{Efficient Elastic ViT Adaptation}, a single ViT framework that encapsulates multiple submodels of varying sizes, eliminating the need for repeated adaptation.We introduce elastic configurations along four key dimensions—embedding dimension, attention heads, MLP expansion ratio, and layer depth—and a lightweight router that selects the optimal submodel under different computational budgets. Training proceeds in two stages: \emph{Staged Elastic Adaptation} progressively introduces complexity for efficient joint training of submodels while preserving as much pre-trained knowledge as possible; Subsequently, we integrate the router to refine the model by balancing accuracy and MACs, guiding it to initially focus on a small set of promising submodels for faster convergence within the large design space.Our approach captures an exponentially large family of submodels in a single adaptation process. Extensive experiments demonstrate that, for any resource constraint, the router identifies the best submodel, delivering high performance and reduced overhead compared to previous methods.
Poster
Maan Qraitem · Piotr Teterwak · Kate Saenko · Bryan Plummer

[ Exhibit Hall I ]

Abstract
Vision-language models (VLMs) (\eg, CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a typographic attack. These attacks succeed due to VLMs' text-heavy bias—a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them harder to defend against but also more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100\% attack …
Poster
Zhixuan Li · Binqian Xu · Xiangbo Shu · Jiachao Zhang · Yazhou Yao · Guo-Sen Xie · Jinhui Tang

[ Exhibit Hall I ]

Abstract
The combination of Large Language Models (LLMs) and Federated Learning (FL) to leverage privacy-preserving data has emerged as a promising approach to further enhance the Parameter-Efficient Fine-Tuning (PEFT) capabilities of LLMs. In real-world FL settings with resource heterogeneity, the training process of Low-Rank Adaptation (LoRA), the representative PEFT method, still faces two major challenges: aggregagion noise and aggregagion misalignment. In this paper, we propose a novel Tensor-aggregated LoRA (Te-LoRA) in Federated Fine-tuning based on an alternating-freeze training strategy to avoid aggregating noise without additional server-side computational costs, while mitigating aggregation suboptimality caused by parameter misalignment between heterogeneous LoRAs. Especially in addressing the aggregation suboptimality issue, we design the Pre-Aggregation Alignment strategy (PAA-strategy) and Tensor-to-Matrix strategy (T2M-strategy) for aligning heterogeneous LoRAs and aggregating them into an united tensor, which is then decomposed into matrices adapted for client download. Extensive experiments demonstrate the effectiveness and robustness of Te-LoRA in both homogeneous and heterogeneous settings.
Poster
Changsheng Gao · Yifan Ma · Qiaoxi Chen · Xu yenan · Dong Liu · Weisi Lin

[ Exhibit Hall I ]

Abstract
Large models have achieved remarkable performance across various tasks, yet they incur significant computational costs and privacy concerns during both training and inference. Distributed deployment has emerged as a potential solution, but it necessitates the exchange of intermediate information between model segments, with feature representations serving as crucial information carriers. To optimize information exchange, feature coding is required to reduce transmission and storage overhead. Despite its importance, feature coding for large models remains an under-explored area.In this paper, we draw attention to large model feature coding and make three fundamental contributions. First, we introduce a comprehensive dataset encompassing diverse features generated by three representative types of large models. Second, we establish unified test conditions, enabling standardized evaluation pipelines and fair comparisons across future feature coding studies. Third, we introduce two baseline methods derived from widely used image coding techniques and benchmark their performance on the proposed dataset. These contributions aim to provide a foundation for future research and inspire broader engagement in this field. To support a long-term study, all source code and the dataset will be made publicly available and actively maintained.
Poster
Xiao Liu · Nan Pu · Haiyang Zheng · Wenjing Li · Nicu Sebe · Zhun Zhong

[ Exhibit Hall I ]

Abstract
In this paper, we investigate a practical yet challenging task: On-the-fly Category Discovery (OCD). This task focuses on the online identification of newly arriving stream data that may belong to both known and unknown categories, utilizing the category knowledge from only labeled data. Existing OCD methods are devoted to fully mining transferable knowledge from only labeled data. However, the transferability learned by these methods is limited because the knowledge contained in known categories is often insufficient, especially when few annotated data/categories are available in fine-grained recognition. To mitigate this limitation, we propose a diffusion-based OCD framework, dubbed DiffGRE, which integrates Generation, Refinement, and Encoding in a multi-stage fashion. Specifically, we first design an attribute-composition generation method based on cross-image interpolation in the diffusion latent space to synthesize novel samples. Then, we propose a diversity-driven refinement approach to select the synthesized images that differ from known categories for subsequent OCD model training. Finally, we leverage a semi-supervised leader encoding to inject additional category knowledge contained in synthesized data into the OCD models, which can benefit the discovery of both known and unknown categories during the on-the-fly inference process. Extensive experiments demonstrate the superiority of our DiffGRE over previous methods on six …
Poster
Zhifeng Gu · Bing WANG

[ Exhibit Hall I ]

Abstract
Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene presentation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code will be released.
Poster
Shengyuan Ding · Wu Shenxi · Xiangyu Zhao · Yuhang Zang · Haodong Duan · Xiaoyi Dong · Pan Zhang · Yuhang Cao · Dahua Lin · Jiaqi Wang

[ Exhibit Hall I ]

Abstract
The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and doing it right.Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints.To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs.Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO).We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both textual constraints for output responses and visual constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating rule-based assessment and LLM-as-a-Judge evaluation.We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieve notable gains on various IF benchmarks, such as MM-IFEval (+11.8$\%$), MIA (+7.7$\%$), and IFEval (+10.5$\%$).
Poster
Wujie Sun · Defang Chen · Siwei Lyu · Genlang Chen · Chun Chen · Can Wang

[ Exhibit Hall I ]

Abstract
Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions, creating an exacerbated divergence between the standard distillation loss and the cross-entropy loss, which can undermine the consistency of the student model's learning objectives. Previous attempts to use labels to empirically correct teacher predictions may undermine the class correlation. In contrast, our RLD employs labeling information to dynamically refine teacher logits. In this way, our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations, thus enhancing the value and efficiency of distilled knowledge. Experimental results on CIFAR-100 and ImageNet demonstrate its superiority over existing methods. The code is provided in the supplementary material.
Poster
Zongheng Tang · Yi Liu · Yifan Sun · Yulu Gao · Jinyu Chen · Runsheng Xu · Si Liu

[ Exhibit Hall I ]

Abstract
Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultanesouly. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatial-temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.
Poster
Kiseong Hong · Gyeong-Hyeon Kim · Eunwoo Kim

[ Exhibit Hall I ]

Abstract
Prompt-based continual learning provides a rehearsal-free solution by tuning small sets of parameters while keeping pre-trained models frozen. To meet the complex demands of sequential tasks, it is crucial to integrate task-specific knowledge within prompts effectively. However, existing works rely on either fixed learned prompts (i.e., prompts whose representations remain unchanged during new task learning) or on prompts generated from an uninformative task-shared space, limiting the representational diversity of the integrated prompt. To address this issue, we propose a novel prompt-evolving mechanism to adaptively aggregate base prompts (i.e., task-specific prompts) into a unified prompt while ensuring diversity. By transforming and aligning all base prompts, both previously learned and newly introduced, our approach continuously evolves accumulated knowledge to facilitate learning new tasks. We further introduce a learnable probabilistic gate that adaptively determines which layers to activate during the evolution process. We validate our method on image classification and video action recognition tasks in class-incremental learning, achieving average gains of 9.07% and 7.40% over existing methods across all scenarios.
Poster
Mengyu Gao · Qiulei Dong

[ Exhibit Hall I ]

Abstract
Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods in literature only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.
Poster
Xinhua Lu · Runhe Lai · Yanqi Wu · Kanghao Chen · Wei-Shi Zheng · Ruixuan Wang

[ Exhibit Hall I ]

Abstract
Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. The codes will be released publicly.
Poster
Zongyang Ma · Yuxin Chen · Ziqi Zhang · Zhongang Qi · Chunfeng Yuan · Shaojie Zhu · Chengxiang Zhuo · Bing Li · Ye Liu · Zang Li · Ying Shan · Weiming Hu

[ Exhibit Hall I ]

Abstract
Mathematical problems in real-world scenarios are often presented in a purely vision-form, where textual problem statement and accompanying math figures, e.g., geometry figures and functional graphs, are integrated into a single image. This vision-form problem-solving task requires precise comprehension and reasoning on both textual and graphical elements in the images, posing significant challenge to current Multimodal Large Language Models (MLLMs), which process text and math figures in isolation. In this work, we propose VisionMath, the first exploration for vision-form mathematical problem-solving model, which employs a three-stage progressive multimodal reasoning alignment strategy to systematically enhance task-specific capabilities. Building upon a LLM proficient in unimodal mathematical reasoning, VisionMath first establishes foundational OCR capabilities through capturing rendered mathematical problem images. Subsequently, the model develops comprehensive understanding of figure structures and properties via learning from figure descriptions and mathematical educational videos. Finally, the model's reasoning capacity is activated using carefully constructed visual-form problem-solving datasets VisionMath-IT with chain-of-thought annotations. For comprehensive evaluation, we construct multilingual benchmarks covering diverse problem types, including geometry, algebra, function problems in both English and Chinese. Our model weights, data and code will be public available.
Poster
Xiyao Wang · Zhengyuan Yang · Linjie Li · Hongjin Lu · Yuancheng Xu · Chung-Ching Lin · Kevin Lin · Furong Huang · Lijuan Wang

[ Exhibit Hall I ]

Abstract
Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs.
Poster
George Stoica · Vivek Ramanujan · Xiang Fan · Ali Farhadi · Ranjay Krishna · Judy Hoffman

[ Exhibit Hall I ]

Abstract
Unconditional flow-matching trains diffusion models to efficiently transport samples from a source distribution to samples of target distribution by enforcing that the flows between sample pairs from the source and target distributions are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed—flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching (CFM) an extension to the flow-matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying SiT model sizes on the popular ImageNet-1 (256x256) and (512x512) benchmarks.Notably, we find that training models with CFM (1) improves training speed by a factor of up to 2x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow-matching.We commit to releasing our code upon publication.
Poster
Wenjin Zhang · Xinyu Li · Chenyang Gao · Ivan Marsic

[ Exhibit Hall I ]

Abstract
Deep learning models rely on large-scale labeled datasets, but collecting such data is expensive and time-consuming. Semi-supervised learning (SSL) mitigates this issue by learning from a small set of labeled samples along with a large pool of unlabeled data. However, existing SSL methods struggle with fine-grained classification when dealing with visually similar classes, as they rely solely on visual features and ignore the semantics information within label names.This paper introduces \algo, an SSL enhancement approach that utilizes semantic information from label names to guide visual feature learning, addressing the challenges of fine-grained classification. By aligning text embeddings from label names with visual features, our method helps the model capture subtle visual distinctions that purely visual representations may overlook. To enhance robustness, we propose two key components: (1) text embedding de-similarity (TEDS) to reduce confusion caused by similar text embeddings across different class names, and (2) class-aware visual-text alignment loss to accurately define positive and negative pairs during visual-text alignment. Our method achieves state-of-the-art performance on the latest SSL benchmarks. Additionally, on the challenging Food-101 dataset, which contains many visually similar classes and uses only 404 labeled images, our approach improves performance by approximately 13.6\% over the second-best method. Code is …
Poster
Haoyang Liu · Peiran Wang · Yijiang Li · Tiancheng Xing · Vibhu Dalal · Luwei LI · Jingrui He · Haohan Wang

[ Exhibit Hall I ]

Abstract
Dataset Distillation (DD) aims to generate a compact synthetic dataset that enables models to achieve performance comparable to training on the full large dataset, significantly reducing computational costs. Drawing from optimal transport theory, we introduce WMDD (Dataset Distillation with Wasserstein Metric-based Feature Matching), a straightforward yet powerful method that employs the Wasserstein metric to enhance distribution matching.We compute the Wasserstein barycenter of features from a pretrained classifier to capture essential characteristics of the original data distribution. By optimizing synthetic data to align with this barycenter in feature space and leveraging per-class BatchNorm statistics to preserve intra-class variations, WMDD maintains the efficiency of distribution matching approaches while achieving state-of-the-art results across various high-resolution datasets. Our extensive experiments demonstrate WMDD's effectiveness and adaptability, highlighting its potential for advancing machine learning applications at scale.
Poster
Chenxu Zhao · Wei Qian · Aobo Chen · Mengdi Huai

[ Exhibit Hall I ]

Abstract
Recent studies have shown that deep learning models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. To analyze and study these vulnerabilities, various MIA methods have been proposed. Despite the significance and popularity of MIAs, existing works on MIAs are limited in providing guarantees on the false discovery rate (FDR), which refers to the expected proportion of false discoveries among the identified positive discoveries. However, it is very challenging to ensure the false discovery rate guarantees, because the underlying distribution is usually unknown, and the estimated non-member probabilities often exhibit interdependence. To tackle the above challenges, in this paper, we design a novel membership inference attack method, which can provide the guarantees on the false discovery rate. Additionally, we show that our method can also provide the marginal probability guarantee on labeling true non-member data as member data. Notably, our method can work as a wrapper that can be seamlessly integrated with existing MIA methods in a post-hoc manner, while also providing the FDR control. We perform the theoretical analysis for our method. Extensive experiments in various settings (e.g., the black-box setting and the …
Poster
Chongyan Chen · Yu-Yun Tseng · Zhuoheng Li · Anush Venkatesh · Danna Gurari

[ Exhibit Hall I ]

Abstract
No existing work on visual question answering explicitly acknowledges there can be ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each region described in the question that is necessary to arrive at the answer. We next analyze and compare our dataset to existing datasets to reveal its unique properties. Finally, we benchmark modern models for two novel tasks related to acknowledging focus ambiguity: recognizing whether a visual question has focus ambiguity and locating all plausible focus regions within the image. Results show that the dataset is challenging for modern models. To facilitate future progress on these tasks, we publicly-share the dataset with an evaluation server at https://placeholder.github.io/.
Poster
Chengyao Qian · Trung Le · Mehrtash Harandi

[ Exhibit Hall I ]

Abstract
Knowledge distillation (KD) is an effective method for enhancing a small model, named student, by training it under the supervision of larger teacher models. However, existing studies indicate that a substantial capacity gap between the student and teacher can lead to poor learning for the student model. This capacity gap problem limits the applicability of KD and necessitates careful selection of the teacher's size.%Despite its importance, the underlying cause of the capacity gap problem remains underexplored. In this paper, we reveal that a substantial disparity in the output distributions of teacher and student models is a key factor behind this issue. To demonstrate this, we decompose the KD loss into two components: class-wise similarity and inner-class distribution, and analyze the contribution of each term. Our analysis shows that a large distributional mismatch can lead to poor student learning.%Inspired by this observation, we propose the Adapted Inner-class Distribution (AID) method, wherein the teacher model is fine-tuned to optimize its inner-class distribution to better align with the student's capacity prior to knowledge distillation. This approach effectively bridges the capacity gap between teacher and student models and consistently achieves state-of-the-art performance across a diverse range of architectures.
Poster
Hongyu Zhu · Sichu Liang · Wenwen Wang · Zhuomeng Zhang · Fangqi Li · Shi-Lin Wang

[ Exhibit Hall I ]

Abstract
Modern over-parameterized deep models are highly data-dependent, with large scale general-purpose and domain-specific datasets serving as the bedrock for rapid advancements. However, many datasets are proprietary or contain sensitive information, making unrestricted model training problematic. In the open world where data thefts cannot be fully prevented, Dataset Ownership Verification (DOV) has emerged as a promising method to protect copyright by detecting unauthorized model training and tracing illicit activities. Due to its diversity and superior stealth, evading DOV is considered extremely challenging. However, this paper identifies that previous studies have relied on oversimplistic evasion attacks for evaluation, leading to a false sense of security. We introduce a unified evasion framework, in which a teacher model first learns from the copyright dataset and then transfers task-relevant yet identifier-independent domain knowledge to a surrogate student using an out-of-distribution (OOD) dataset as the intermediary. Leveraging Vision-Language Models and Large Language Models, we curate the most informative and reliable subsets from the OOD gallery set as the final transfer set, and propose selectively transferring task-oriented knowledge to achieve a better trade-off between generalization and evasion effectiveness. Experiments across diverse datasets covering eleven DOV methods demonstrate our approach simultaneously eliminates all copyright identifiers and significantly outperforms …
Poster
Jinjia Peng · Zeze Tao · Huibing Wang · Meng Wang · Yang Wang

[ Exhibit Hall I ]

Abstract
Deep neural networks are susceptible to adversarial examples, which can lead to incorrect predictions by introducing imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to victim models deployed in black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes can alleviate overfitting on surrogate models and exhibit superior transferability. However, these works ignore the influence of perturbation directions, resulting in limited transferability. To overcome this limitation, this paper proposes a new attack method named Residual Perturbation Attack (ResPA), which employs the residual gradient as the perturbation direction to guide the adversarial examples to search toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average operation on the input gradients to obtain the first moment as the referenced gradient, which encompasses the direction information of historical gradients. Moreover, to avoid over-relying on the local flatness, instead of directly using the current gradient as the perturbation direction, ResPA further considers the residual between the current gradient and the referenced gradient, which can capture the changes in the global perturbation direction. Comprehensive experimental comparisons show that ResPA can remarkably enhance adversarial transferability. In addition, ResPA can be naturally combined with …
Poster
Ioannis Sarridis · Christos Koutlis · Symeon Papadopoulos · Christos Diou

[ Exhibit Hall I ]

Abstract
Mitigating biases in computer vision models is an essential step towards the trustworthiness of artificial intelligence models. Existing bias mitigation methods focus on a small set of predefined biases, limiting their applicability in visual datasets where multiple, possibly unknown biases exist. To address this limitation, we introduce MAVias, an open-set bias mitigation approach leveraging foundation models to discover spurious associations between visual attributes and target classes. MAVias first captures a wide variety of visual features in natural language via a foundation image tagging model, and then leverages a large language model to select those visual features defining the target class, resulting in a set of language-coded potential visual biases. We then translate this set of potential biases into vision-language embeddings and introduce an in-processing bias mitigation approach to prevent the model from encoding information related to them. Our experiments on diverse datasets, including CelebA, Waterbirds, ImageNet, and UrbanCars, show that MAVias effectively detects and mitigates a wide range of biases in visual recognition tasks outperforming current state-of-the-art.
Poster
Junfu Tan · Peiguang Jing · Yu Zhu · YU LIU

[ Exhibit Hall I ]

Abstract
Open-set fine-grained recognition (OSFGR) is the core exploration of building open-world intelligent systems. The challenge lies in the gradual semantic drift during the transition from coarse-grained to fine-grained categories. However, although existing methods leverage hierarchical representations to assist progressive reasoning, they neglect semantic consistency across hierarchies. To address this, we propose a multimodal progressive bidirectional reasoning framework: (1) In forward reasoning, the model progressively refines visual features to capture hierarchical structural representations, while (2) in backward reasoning, variational inference integrates multimodal information to constraint consistency in category-aware latent spaces. This mechanism mitigates semantic drift through bidirectional information flow and cross-hierarchical feature consistency constraints. Extensive experiments on iNat2021-OSR dataset, the largest open-set fine-grained dataset with over 600K images, demonstrate that our proposed method achieves superior performance over the state-of-the-art methods.
Poster
Paul Albert · Frederic Zhang · Hemanth Saratchandran · Anton Hengel · Ehsan Abbasnejad

[ Exhibit Hall I ]

Abstract
Parameter-efficient fine-tuning (PEFT) has become a standard for adapting large pre-trained models. While low-rank adaptation (LoRA) has achieved notable success, recent studies highlight its limitations when compared to full-rank variants, particularly when scaling to demanding tasks such as vision-language classification or common-sense reasoning.We propose to quantitavely compare full and rank-restricted PEFT methods using a spectrum-controlled matrix approximation benchmark. Our results validate LoRA's rank limitations when approximating matrix presenting highly decorrelated or high frequency features. We further show that full-rank methods can reduce LoRA's approximation error on these matrix types for an equal parameter count.Our evaluation then extends beyond synthetic tasks where we observe that LoRA's restricted work subspace can produce high norm updates, leading to over-fitting and poor out-of-distribution generalization. We address these limits by introducing KRAdapter, a novel PEFT algorithms that uses properties of the Kathri-Rao matrix product to produce weight matrices of higher effective rank and lower norm than related PEFT algorithms.We show the performance improvements of KRAdapter on vision-language models up to 1B parameters and 8B %32Bfor LLMs where we report from 20 to 25 points of accuracy improvements over LoRA when reasoning on commonsense tasks unseen during training. Crucially, KRAdapter maintains the favorable training speed and …
Poster
Yongwei Jiang · Yixiong Zou · Yuhua Li · Ruixuan Li

[ Exhibit Hall I ]

Abstract
Few-Shot Class-Incremental Learning (FSCIL) faces dual challenges of data scarcity and incremental learning in real-world scenarios. While pool-based prompting methods have demonstrated success in traditional incremental learning, their effectiveness in FSCIL settings remains unexplored. This paper presents the first study of current prompt pool methods in FSCIL tasks, revealing an unanticipated performance degradation in incremental sessions. Through comprehensive analysis, we identify that this phenomenon stems from token-dimension saturation: with limited data, excessive prompts compete for task-relevant information, leading to model overfitting. Based on this finding, we propose LGSP-Prompt (Local-Global Spatial Prompting), which innovatively shifts pool-based prompt learning from the token dimension to the spatial dimension. LGSP-Prompt generates spatial prompts by synergistically combining local spatial features and global frequency-domain representations to highlight key patterns in input images. We construct two spatial prompt pools enabling dynamic prompt selection to maintain acquired knowledge while effectively learning novel sessions. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple FSCIL benchmarks, showing significant advantages in both base knowledge preservation and incremental learning. Our codes will be released.
Poster
Liping Yi · Han Yu · Gang Wang · xiaoguang Liu · Xiaoxiao Li

[ Exhibit Hall I ]

Abstract
Model-heterogeneous federated learning (MHFL) is a challenging FL paradigm designed to allow FL clients to train structurally heterogeneous models under the coordination of an FL server. Existing MHFL methods face significant limitations when it comes to transferring global knowledge to clients as a result of sharing only partial homogeneous model parameters or calculating distance loss, leading to inferior model generalization. To bridge this gap, we propose a novel model-heterogeneous Federated learning method with Representation Angle Learning (FedRAL). It consists of three innovative designs: (1) We first introduce representation angle learning into MHFL. Specifically, we embed a homogeneous square matrix into the local heterogeneous model of each client, which learns the angle information of local representations. These homogeneous representation angle square matrices are aggregated on the server to fuse representation angle knowledge shared by clients for enhancing the generalization of local representations. (2) As different clients might have heterogeneous system resources, we propose an adaptive diagonal sparsification strategy to reduce the numbers of the parameters of representation angle square matrices uploaded to the server, to improve FL communication efficiency. (3) To enable the effective fusion of sparsified homogeneous local representation angle square matrices, we design an element-wise weighted aggregation approach. Experiments …
Poster
Haiyang Guo · Fanhu Zeng · Fei Zhu · Wenzhuo Liu · Da-Han Wang · Jian Xu · Xu-Yao Zhang · Cheng-Lin Liu

[ Exhibit Hall I ]

Abstract
A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Our source code and dataset will be made publicly available.
Poster
Yiyuan Zhang · Handong Li · Jing Liu · Xiangyu Yue

[ Exhibit Hall I ]

Abstract
This work introduces Multimodal Context (MiCo), a scalable pretraining framework designed to advance omni-modal intelligence—an AI system capable of understanding and learning from multiple modalities to achieve universal representation learning. MiCo allows for efficient scaling of both the number of modalities and the volume of data, along with model parameters, during the pretraining phase. We evaluate the pretrained models across a diverse set of tasks, including: (i) single-modality perception benchmarks covering 10 distinct modalities, (ii) 25 cross-modal tasks spanning retrieval, question-answering, and captioning, and (iii) 18 large-scale multimodal language model benchmarks. MiCo consistently delivers state-of-the-art results, setting 37 new benchmarks across these tasks. The pretrained models, along with the collected datasets and codebase, will be made publicly available to support the development of omni-modal intelligence and broader research in multimodal learning.
Poster
Luong Tran · Thieu Vo · Anh Nguyen · Sang Dinh · Van Nguyen

[ Exhibit Hall I ]

Abstract
Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.
Poster
Aritra Bhowmik · Mohammad Mahdi Derakhshani · Dennis Koelma · Yuki Asano · Martin Oswald · Cees Snoek

[ Exhibit Hall I ]

Abstract
Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propose TWIST, a twin-expert stepwise tuning module that modifies the decoder of the language model using one frozen module pre-trained on image understanding tasks and another learnable one for visual grounding tasks. This allows the MLLM to retain previously learned knowledge and skills, while acquiring what is missing. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT, which mimics human reasoning in visual grounding. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process, thereby simplifying the task of visual grounding. We evaluate our approach on several standard benchmark datasets, encompassing grounded image captioning, zero-shot localization, and visual grounding tasks. Our method consistently delivers strong performance across all tasks, while retaining the pre-trained image understanding capabilities.
Poster
Oindrila Saha · Logan Lawrence · Grant Horn · Subhransu Maji

[ Exhibit Hall I ]

Abstract
Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 9.5% and 4.0% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning.
Poster
Shengao Wang · Arjun Chandra · Aoming Liu · Boqing Gong · Venkatesh Saligrama

[ Exhibit Hall I ]

Abstract
Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned—they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or existing general-purpose datasets. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.
Poster
Oscar Mañas · Pierluca D'Oro · Koustuv Sinha · Adriana Romero-Soriano · Michal Drozdzal · Aishwarya Agrawal

[ Exhibit Hall I ]

Abstract
As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off between object precision and recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while matching or surpassing the performance of existing hallucination mitigation methods.
Poster
Jefferson Hernandez · Jing Shi · Simon Jenni · Vicente Ordonez · Kushal Kafle

[ Exhibit Hall I ]

Abstract
Traditional alignment methods for Large Vision and Language Models (LVLMs) primarily rely on human-curated preference data. Human-generated preference data is costly; machine-generated preference data is limited in quality; and self-supervised preference data often introduces hallucinations. To overcome these limitations, we propose a novel Panel-of-Peers learning framework inspired by collaborative learning among humans. This approach leverages a panel of LVLMs, each evaluating and learning from their collective outputs through an iterative self-improvement process. By simulating a peer review system, our models generate, assess, and refine outputs in response to a curated set of prompts, mimicking a classroom learning environment. We demonstrate that this methodology enhances model performance without requiring extensive human-labeled datasets. Our experiments show significant improvement across multiple benchmarks, demonstrating the potential of peer evaluations as a scalable alternative to self-supervised alignment. Notably, we show that Panel-of-Peers increases the average score on fifteen benchmarks from 48% to 57%.
Poster
Michihiro Kuroki · Toshihiko Yamasaki

[ Exhibit Hall I ]

Abstract
Although saliency maps can highlight important regions to explain the reasoning behind image classification in artificial intelligence (AI), the meaning of these regions is left to the user's interpretation. In contrast, concept-based explanations decompose AI predictions into human-understandable concepts, clarifying their contributions. However, few methods can simultaneously reveal what concepts an image classifier learns, which regions are associated with them, and how they contribute to predictions.We propose a novel concept-based explanation method, Concept-based Explanation via Fusion of Activation Maps (CE-FAM). It employs a branched network that shares activation maps with an image classifier and learns to mimic the embeddings of a Vision and Language Model (VLM). The branch network predicts concepts in an image, and their corresponding regions are represented by a weighted sum of activation maps, with weights given by the gradients of the concept prediction scores. Their contributions are quantified based on their impact on the image classification score. Our method provides a general framework for identifying the concept regions and their contributions while leveraging VLM knowledge to handle arbitrary concepts without requiring an annotated dataset. Furthermore, we introduce a novel evaluation metric to assess the accuracy of the concept regions. Our qualitative and quantitative evaluations demonstrate our …
Poster
Zihan Zhou · LI LI · Yanli Ren · Chuan Qin · Guorui Feng

[ Exhibit Hall I ]

Abstract
Adversarial examples, crafted with imperceptible perturbations, reveal a significant vulnerability of Deep Neural Networks (DNNs). More critically, the transferability of adversarial examples allows attackers to induce unreasonable predictions without requiring knowledge about the target model. DNNs exhibit spatial invariance, meaning that the position of an object does not affect the classification result. However, existing input transformation-based adversarial attacks solely focus on behavioral patterns at a singular position, failing to fully exploit the spatial invariance exhibited by DNNs across multiple positions, thus constraining the transferability of adversarial examples. To address this, we propose a multi-scale, multi-position input transformation-based attack called Spatial Invariance Diversity (SID). Specifically, SID uses hybrid spatial-spectral fusion mechanisms within localized receptive fields, followed by multi-scale spatial downsampling and positional perturbations via random transformations, thereby crafting an ensemble of inputs to activate diverse behavioral patterns for effective adversarial perturbations. Extensive experiments on the ImageNet dataset demonstrate that SID could achieve better transferability than the current state-of-the-art input transformation-based attacks. Additionally, SID can be flexibly integrated with other input transformation-based or gradient-based attacks, further enhancing the transferability of adversarial examples.
Poster
Yongxin Guo · Lin Wang · Xiaoying Tang · Tao Lin

[ Exhibit Hall I ]

Abstract
Federated Learning (FL) is a privacy-preserving distributed machine learning paradigm. Nonetheless, the substantial distribution shifts among clients pose a considerable challenge to the performance of current FL algorithms. To mitigate this challenge, various methods have been proposed to enhance the FL training process.This paper endeavors to tackle the issue of data heterogeneity from another perspective---by improving FL algorithms prior to the actual training stage. Specifically, we introduce the Client2Vec mechanism, which generates a unique client index that contains clients' distribution shifts information for each client before the commencement of FL training. Subsequently, we leverage the generated client index to enhance the subsequent FL training process. To demonstrate the effectiveness of the proposed Client2Vec method, we conduct three case studies that assess the impact of the client index on the FL training process. These case studies encompass enhanced client sampling, model aggregation, and local training. Extensive experiments conducted on diverse datasets and model architectures show the efficacy of Client2Vec across all three case studies. Our code will be publicly available.
Poster
xinyi lai · Luojun Lin · Weijie Chen · yuanlong yu

[ Exhibit Hall I ]

Abstract
Long-Tailed Class-Incremental Learning (LT-CIL) faces critical challenges due to biased gradient updates arising from imbalanced data distributions and the inherent stability-plasticity trade-off, which collectively degrade tail-class performance and induce catastrophic forgetting. To address these limitations, we introduce Geometric Prototype Alignment (GPA), a model-agnostic classifier initialization method that calibrates learning dynamics through geometric feature space alignment. GPA initializes classifier weights by aligning them with frozen class prototypes onto a unit hypersphere, explicitly disentangling magnitude imbalance from directional discriminability. During incremental training, we introduce Dynamic Anchoring to adjust weights while preserving geometric consistency, thereby balancing plasticity for new classes while keeping stability for previously learned knowledge. When integrated into state-of-the-art CIL frameworks such as LUCIR and DualPrompt, GPA demonstrates significant improvements: achieving an average incremental accuracy increase of 12.3% and decreasing forgetting rates by 12.2% on CIFAR100-LT. Theoretical analysis reveals that GPA accelerates convergence by 2.7x and achieves nearly Fisher-optimal decision boundaries. Our work lays a geometric foundation for stable representation learning in LT-CIL scenarios, which addresses both catastrophic forgetting and tail-class degradation.
Poster
PRAFFUL KHOBA · Zijian Wang · Chetan Arora · Mahsa Baktashmotlagh

[ Exhibit Hall I ]

Abstract
Selecting an optimal Parameter-Efficient Fine-Tuning (PEFT) technique for a downstream task is a fundamental challenge in transfer learning. Unlike full fine-tuning, where all model parameters are updated, PEFT techniques modify only a small subset of parameters while keeping the backbone frozen, making them computationally efficient. However, this introduces a unique problem: selecting the most effective PEFT method for a given dataset. Existing transferability estimation (TE) metrics primarily focus on ranking distinct architectures and struggle to detect subtle embedding differences introduced by various PEFT methods sharing the same backbone. To address this limitation, we propose a novel diffusion-based metric explicitly designed for PEFT selection. Unlike conventional metrics, our approach models the fine-grained geometric relationships of embedding spaces through a diffusion process, effectively quantifying intra-class compactness and inter-class separability. Extensive evaluations on the VTAB-1k benchmark validate our method’s effectiveness, demonstrating a substantial 68.95\% improvement over LogME, 1297.29\% over $\mathcal{N}$LEEP, 149.75\% over NCTI, and 140.46\% over SFDA—four widely used TE methods designed for ranking pre-trained models.
Poster
Liang Chen · Ghazi Shazan Ahmad · Tianjun Yao · Lingqiao Liu · Zhiqiang Shen

[ Exhibit Hall I ]

Abstract
Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective $\textbf{R}$ational $\textbf{Ada}$ptaion (RAda) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings ($i.e.$ updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings.
Poster
Qi Guo · Zhen Tian · Minghao Yao · Saiyu Qi · Yong Qi · Bingyi Liu

[ Exhibit Hall I ]

Abstract
Federated Unlearning (FU) should satisfy three key requirements: a guarantee of data erasure, preservation of model utility, and reduction of unlearning time. Recent studies focus on identifying and modifying original model parameters relevant to unlearning data. While they can achieve faster unlearning, they degrade the model performance on remaining data or fail to forget unlearning data due to the difficulty in isolating specific parameters of the unlearning data. By revisiting the representation distribution of the optimal unlearning models (i.e., the retrained models), we observe that unlearning data tends to cluster within semantically related categories of remaining data. This inspired us to transform the distribution of unlearning data to fuse with similar categories in the remaining data for effective FU. Based on this insight, we propose a novel framework, named FUCRT, to achieve Federated Unlearning via Class-aware Representation Transformation. FUCRT consists of two key components: (1) a transformation class identification strategy (TCI) that leverages the original model to identify appropriate transformation classes for unlearning data, and (2) a targeted transformation learning process (TTL) with cross-class fusion mechanism to ensure effective and consistent transformation of unlearning data. Extensive experiments on four datasets demonstrate that FUCRT not only achieves 100\% of data erasure …
Poster
Tianhong Gao · Yannian Fu · Weiqun Wu · Haixiao Yue · Shanshan Liu · Gang Zhang

[ Exhibit Hall I ]

Abstract
Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection format (ORR). By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness …
Poster
Biao Zhang · Jing Ren · Peter Wonka

[ Exhibit Hall I ]

Abstract
Neural representations of 3D data have been widely adopted across various applications, particularly in recent work leveraging coordinate-based networks to model scalar or vector fields. However, these approaches face inherent challenges, such as handling thin structures and non-watertight geometries, which limit their flexibility and accuracy. In contrast, we propose a novel geometric data representation that models geometry as distributions-a powerful representation that makes no assumptions about surface genus, connectivity, or boundary conditions. Our approach uses diffusion models with a novel network architecture to learn surface point distributions, capturing fine-grained geometric details. We evaluate our representation qualitatively and quantitatively across various object types, demonstrating its effectiveness in achieving high geometric fidelity. Additionally, we explore applications using our representation, such as textured mesh representation, neural surface compression, dynamic object modeling, and rendering, highlighting its potential to advance 3D geometric learning.
Poster
Zhenghao Gao · Shengjie Xu · Zijing Li · Meixi Chen · Chaojian Yu · Yuanjie Shao · Changxin Gao

[ Exhibit Hall I ]

Abstract
Adversarial attack plays a critical role in evaluating the robustness of deep learning models. Jacobian-based Saliency Map Attack (JSMA) is an interpretable adversarial method that offers excellent pixel-level control and provides valuable insights into model vulnerabilities. However, its quadratic computational complexity $O(M^2 \times N)$ renders it impractical for large-scale datasets, limiting its application despite its inherent value. This paper proposes FastJSMA, an efficient attack method that addresses these computational limitations. Our approach introduces a gradient decoupling mechanism that decomposes the Jacobian calculation into complementary class suppression ($g^-$) and class excitation ($g^+$) gradients, reducing complexity to $O(M\sqrt{N})$. Additionally, we implement a class probing mechanism and an adaptive saliency threshold to further optimize the process. Experimental results across multiple datasets demonstrate that FastJSMA maintains high attack success rates (98.4\% relative efficiency) while dramatically reducing computation time—requiring only 1.8\% of JSMA's processing time on CIFAR-100 and successfully operating on ImageNet where traditional JSMA fails due to memory constraints. This advancement enables the practical application of interpretable saliency map-based attacks on large-scale datasets, balancing effectiveness with computational efficiency.
Poster
Yuhang Li · Zhuying Li · Yuheng Jia

[ Exhibit Hall I ]

Abstract
The problem of learning from long-tailed noisy data, referred to as Long-Tailed Noisy Label Learning (LTNLL), presents significant challenges in deep learning. LTNLL datasets are typically affected by two primary issues: class imbalance and label noise. While previous methods have addressed these problems separately, the simultaneous presence of both in real-world applications remains underexplored. In this paper, we introduce a simple yet effective method, **I**nstances **B**enefitting **C**lasses (**IBC**). Our philosophy is to simultaneously overcome overfitting to noisy classes and transfer knowledge between semantically related classes. At the instance level, we propose selecting top-$k$ semantically similar classes and use them to construct soft labels. Specifically, we soften noisy hard labels by reducing the probability of noisy classes and reallocating this probability to the semantically similar classes. **This reduces the model's overconfidence in noisy classes while enhancing its focus on tail classes.** We next propose a novel shot-specific multi-expert ensemble learning framework to make knowledge transfer more targeted, where we maintain multiple shot-specific soft labels for each instance, with each expert supervised by one of these labels. By integrating these experts, we demonstrate that IBC exhibits more separable representations, improving both overall and partition performance. Extensive experiments show that IBC outperforms existing …
Poster
Junbo Zhao · Ting Zhang · Jiayu Sun · Mi Tian · Hua Huang

[ Exhibit Hall I ]

Abstract
Geometry problem solving has garnered increasing attention due to its potential applications in intelligent education field. Inspired by the observation that text often introduces ambiguities that diagrams can clarify, this paper presents Pi-GPS, a novel framework that unleashes the power of diagrammatic information to resolve textual ambiguities, an aspect largely overlooked in prior research. Specifically, we design a micro module comprising a rectifier and verifier: the rectifier employs MLLMs to disambiguate text based on the diagrammatic context, while the verifier ensures the rectified output adherence to geometric rules, mitigating model hallucinations. Additionally, we explore the impact of LLMs in theorem predictor based on the disambiguated formal language. Empirical results demonstrate that Pi-GPS surpasses state-of-the-art models, achieving a nearly 10\% improvement on Geometry3K over prior neural-symbolic approaches. We hope this work highlights the significance of resolving textual ambiguity in multimodal mathematical reasoning, a crucial factor limiting performance.
Poster
Heejeong Nam · Jinwoo Ahn · Keummin Ka · Jiwan Chung · Youngjae Yu

[ Exhibit Hall I ]

Abstract
Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them.
Poster
Han Jiang · Wenfei Yang · Tianzhu Zhang · Yongdong Zhang

[ Exhibit Hall I ]

Abstract
Single domain generalized object detection aims to train an object detector on a single source domain and generalize it to any unseen domain. Although existing approaches based on data augmentation exhibit promising results, they overlook domain discrepancies across multiple augmented domains, which limits the performance of object detectors. To tackle these problems, we propose a novel diffusion-based framework, termed SDG-DiffDet, to mitigate the impact of domain gaps on object detectors. The proposed SDG-DiffDet consists of a memory-guided diffusion module and a source-guided denoising module. Specifically, in the memory-guided diffusion module, we design feature statistics memories that mine diverse style information from local parts to augment source features. The augmented features further serve as noise in the diffusion process, enabling the model to capture distribution differences between practical domain distributions. In the source-guided denoising module, we design a text-guided condition to facilitate distribution transfer from any unseen distribution to source distribution in the denoising process. By combining these two designs, our proposed SDG-DiffDet effectively models feature augmentation and target-to-source distribution transfer within a unified diffusion framework, thereby enhancing the generalization ability of object detector. Extensive experiments demonstrate that the proposed SDG-DiffDet achieves state-of-the-art performance across two challenge scenarios.
Poster
Chikai Shang · Mengke Li · Yiqun Zhang · Zhen Chen · Jinlin Wu · Fangqing Gu · Yang Lu · Yiu-ming Cheung

[ Exhibit Hall I ]

Abstract
Visual prompt tuning (VPT) provides an efficient and effective solution for adapting pre-trained models to various downstream tasks by incorporating learnable prompts. However, most prior art indiscriminately applies a fixed prompt distribution across different tasks, neglecting the importance of each block differing depending on the task. In this paper, we investigate adaptive distribution optimization (ADO) by addressing two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through in-depth analysis, we provide an affirmative answer that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution building upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy for ADO derived from this formulation, comprising two optimization steps: identifying and pruning idle prompts, followed by determining the optimal blocks for their relocation. By iteratively performing prompt relocation and VPT, our proposal adaptively learns the optimal prompt distribution, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly …
Poster
Sarthak Kumar Maharana · Baoming Zhang · Leonid Karlinsky · Rogerio Feris · Yunhui Guo

[ Exhibit Hall I ]

Abstract
Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose $\texttt{BATCLIP}$, a bimodal $\textbf{online}$ TTA method designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for improving image features but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in online TTA for CLIP. Furthermore, we evaluate our proposed TTA approach on various domain generalization datasets to demonstrate its generalization capabilities.
Poster
Xingshuo Han · Xuanye Zhang · Xiang Lan · Haozhao Wang · Shengmin Xu · Shen Ren · Jason Zeng · Ming Wu · Michael Heinrich · Tianwei Zhang

[ Exhibit Hall I ]

Abstract
By using a control variate to calibrate the local gradient of each client, Scaffold has been widely known as a powerful solution to mitigate the impact of data heterogeneity in Federated Learning. Although Scaffold achieves significant performance improvements, we show that this superiority is at the cost of increased security vulnerabilities. Specifically, this paper presents BadSFL, the first backdoor attack targeting Scaffold, which turns benign clients into accomplices to amplify the attack effect. The core idea of BadSFL is to uniquely tamper with the control variate to subtly steer benign clients' local gradient updates towards the attacker's poisoned direction, effectively turning them into unwitting accomplices, significantly enhancing the backdoor persistence. Additionally, BadSFL leverages a GAN-enhanced poisoning strategy to enrich the attacker’s dataset, maintaining high accuracy on both benign and backdoored samples while remaining stealthy. Extensive experiments demonstrate that BadSFL achieves superior attack durability, maintaining effectiveness for over 60 global rounds—lasting up to three times longer than existing baselines even after ceasing malicious model injections.
Poster
Sanjoy Chowdhury · Sayan Nag · Subhrajyoti Dasgupta · Yaoting Wang · Mohamed Elhoseiny · Ruohan Gao · Dinesh Manocha

[ Exhibit Hall I ]

Abstract
With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 16 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.
Poster
Cheng-Fu Yang · Da Yin · Wenbo Hu · Heng Ji · Nanyun Peng · Bolei Zhou · Kai-Wei Chang

[ Exhibit Hall I ]

Abstract
Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller model. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks.
Poster
Hongxin Li · Jingran Su · Jingfan CHEN · Zheng Ju · Yuntao Chen · Li Qing · Zhaoxiang Zhang

[ Exhibit Hall I ]

Abstract
Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence. Central to these agents is the capability for GUI interaction, which involves GUI understanding and planning capabilities. Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs). However, the limited scenario, insufficient size, and heterogeneous action spaces hinder the progress of building generalist GUI agents. To resolve these issues, this paper proposes UIPro, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data, coupled with a unified action space. We first curate a comprehensive dataset encompassing 20.6 million GUI understanding tasks to pre-train UIPro, granting it a strong GUI grounding capability which is key to downstream GUI agent tasks. Subsequently, we establish a unified action space to harmonize heterogeneous GUI agent task datasets and produce a merged dataset to foster the action prediction ability of UIPro via continued fine-tuning. Experimental results demonstrate UIPro's superior performance across multiple GUI task benchmarks on various platforms, highlighting the effectiveness of our approach. We will release the data curation programs and cleaned dataset.
Poster
Han Ji · Yuqi Feng · Jiahao Fan · Yanan Sun

[ Exhibit Hall I ]

Abstract
Evaluation is a critical but costly procedure in neural architecture search (NAS). Performance predictors have been widely adopted to reduce evaluation costs by directly estimating architecture performance. The effectiveness of predictors is heavily influenced by the choice of loss functions. While traditional predictors employ regression loss functions to evaluate the absolute accuracy of architectures, recent approaches have explored various ranking-based loss functions, such as pairwise and listwise ranking losses, to focus on the ranking of architecture performance. Despite their success in NAS, the effectiveness and characteristics of these loss functions have not been thoroughly investigated. In this paper, we conduct the first comprehensive study on loss functions in performance predictors, categorizing them into three main types: regression, ranking, and weighted loss functions. Specifically, we assess eight loss functions using a range of NAS-relevant metrics on 13 tasks across five search spaces. Our results reveal that specific categories of loss functions can be effectively combined to enhance predictor-based NAS. Furthermore, our findings could provide practical guidance for selecting appropriate loss functions for various tasks. We hope this work provides meaningful insights to guide the development of loss functions for predictor-based methods in the NAS community.
Poster
Umaima Rahman · Mohammad Yaqub · Dwarikanath Mahapatra

[ Exhibit Hall I ]

Abstract
We introduce \textbf{DiMPLe} (\textbf{Di}sentangled \textbf{M}ulti-Modal \textbf{P}rompt \textbf{Le}arning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe \textbf{disentangles} features \textbf{within and across modalities} while maintaining consistent alignment, enabling better generalization to \textbf{novel classes} and robustness to \textbf{distribution shifts}.Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy. The code will be released publicly upon acceptance.
Poster
Jonathan Roberts · Kai Han · Samuel Albanie

[ Exhibit Hall I ]

Abstract
Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.
Poster
Bowei Guo · Shengkun Tang · Cong Zeng · Zhiqiang Shen

[ Exhibit Hall I ]

Abstract
Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called ***MosaicDiff*** that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model's inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.
Poster
Yunqi Liu · Xiaohui Cui · Ouyang Xue

[ Exhibit Hall I ]

Abstract
Vision-language pre-training (VLP) models leverage large-scale cross-modal pre-training to align vision and text modalities, achieving impressive performance on tasks like image-text retrieval and visual grounding. However, these models are highly vulnerable to adversarial attacks, raising critical concerns about their robustness and reliability in safety-critical applications. Existing black-box attack methods are limited by insufficient data augmentation mechanisms or the disruption of global semantic structures, leading to poor adversarial transferability. To address these challenges, we propose the Global-Local Enhanced Adversarial Multimodal attack (GLEAM), a unified framework for generating transferable adversarial examples in vision-language tasks. GLEAM introduces a local feature enhancement module that achieves diverse local deformations while maintaining global semantic and geometric integrity. It also incorporates a global distribution expansion module, which expands feature space coverage through dynamic transformations. Additionally, a cross-modal feature alignment module leverages intermediate adversarial states to guide text perturbations. This enhances cross-modal consistency and adversarial text transferability. Extensive experiments on Flickr30K and MSCOCO datasets show that GLEAM outperforms state-of-the-art methods, with over 10\%-30\% higher attack success rates in image-text retrieval tasks and over 30\% improved transferability on large models like Claude 3.5 Sonnet and GPT-4o. GLEAM provides a robust tool for exposing vulnerabilities in VLP models and offers …
Poster
YI ZHANG · Yuhang Chen · Zhen Chen · Wenjie Ruan · Xiaowei Huang · Siddartha Khastgir · Xingyu Zhao

[ Exhibit Hall I ]

Abstract
Deep learning (DL) has shown transformative potential across industries, yet its sensitivity to adversarial examples (AEs) limits its reliability and broader deployment. Research on DL robustness has developed various techniques, with adversarial training (AT) established as a leading approach to counter AEs. Traditional AT focuses on worst-case robustness (WCR), but recent work has introduced probabilistic robustness (PR), which evaluates the proportion of AEs within a local perturbation range, providing an overall assessment of the model's local robustness and acknowledging residual risks that are more practical to manage. However, existing AT methods are fundamentally designed to improve WCR, and no dedicated methods currently target PR. To bridge this gap, we reformulate a new min-max optimization as the theoretical foundation for AT focused on PR, and introduce an AT-PR training scheme with effective numerical algorithms to solve the new optimization problem. Our experiments, based on 38 DL models trained on common datasets and architectures, demonstrate that AT-PR achieves higher improvements in PR than AT-WCR methods and shows more consistent effectiveness across varying local inputs, with a smaller trade-off in model generalization. Open-source tools and all experiments are publicly accessible.
Poster
Sijia Chen · Bin Song

[ Exhibit Hall I ]

Abstract
Visual Language Models (VLMs) have achieved remarkable success in many domains due to their ability to perform step-by-step reasoning. However, progress in the telecommunication (Telecom) domain remains limited, primarily due to the lack of high-quality datasets and domain-specific insights. In this paper, we introduce RMultiplex200K, a multimodal dataset designed to present step-wise reasoning rationales and correctness scores for real-world TC questions. This enables VLMs to engage in step-level reasoning and verification using multimodal information, thereby facilitating reliable problem-solving. RMultiplex200K is highly scalable as it is constructed without human annotations, relying instead on our automatic plan-based annotation (ApPA) method, which automatically synthesizes reasoning steps labeled with reward scores. With this dataset, we introduce TC-NAVIGATOR, a new mechanism for training multimodal process reward models to serve as reliable reasoning verifiers for VLMs. For instance, the Qwen-2-VL-72B and Llama-3.2-90B models, which initially achieve only 21.3\% and 19.8\% respectively on practice Telecom questions, reached 48.5\% and 46.1\% accuracy, respectively, after training with RMultiplex200K and verifying with TC-NAVIGATOR.
Poster
Yijun Liang · Shweta Bhardwaj · Tianyi Zhou

[ Exhibit Hall I ]

Abstract
Low-quality or scarce data has posed significant challenges for training deep neural networks in practice. While classical data augmentation cannot contribute very different new data, diffusion models opens up a new door to build self-evolving AI by generating high-quality and diverse synthetic data through text-guided prompts. However, text-only guidance cannot control synthetic images' proximity to the original images, resulting in out-of-distribution data detrimental to the model performance. To overcome the limitation, we study image guidance to achieve a spectrum of interpolations between synthetic and real images. With stronger image guidance, the generated images are similar to the training data but hard to learn. While with weaker image guidance, the synthetic images will be easier for model but contribute to a larger distribution gap with the original data. The generated full spectrum of data enables us to build a novel "Diffusion CurricuLum (DisCL)". DisCL adjusts the image guidance level of image synthesis for each training stage: It identifies and focuses on hard samples for the model and assesses the most effective guidance level of synthetic images to improve hard data learning. We apply DisCL to two challenging tasks: long-tail (LT) classification and learning from low-quality data. It focuses on lower-guidance images …
Poster
Hongrui Yu · Lu Qi · Wanyu Lin · Jian Chen · Hailong Sun · chengbin sun

[ Exhibit Hall I ]

Abstract
Backdoor attacks pose a significant threat to deep neural networks (DNNs), as attackers can inject a backdoor by tampering with only a few samples. The variety of backdoor attacks makes comprehensive defense extremely challenging. Previous defenses typically assume that backdoor samples are out-of-distribution (OOD) data of benign samples. However, backdoor samples can also be in-distribution (ID) data of benign samples and hard to identify as outliers, potentially causing defenses to fail. To address this issue, we propose a two-stage backdoor defense based on Enhanced Splitting and Trap Isolation (ESTI), leveraging attackers' tampering to defend against their attacks. In the first stage, we introduce backdoored models in conjunction with a benign model to split the dataset into a reliable clean subset and a poisoned subset. In the second stage, we introduce a trap mechanism to isolate the poisoned subset into a trap class to train a trap-model. The trap-model can flip the predictions of poisoned samples from the attacker's target class to the trap class. Through extensive experiments on three benchmark datasets and five model architectures, we demonstrate that ESTI effectively defends against various backdoor attacks while maintaining model performance on benign data, proving the superiority of our approach. Our code …
Poster
David G. Shatwell · Ishan Rajendrakumar Dave · Swetha Sirnam · Mubarak Shah

[ Exhibit Hall I ]

Abstract
Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, we utilize Random Fourier Features for effective temporal representation. Instead of conventional contrastive learning with hard positives and negatives, we propose a metric-learning objective providing soft targets by modeling temporal differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses methods focused solely on time prediction and even those utilizing geo-location during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, while the unified embedding space facilitates compositional and text-based image retrieval.
Poster
Mark YU · Wenbo Hu · Jinbo Xing · Ying Shan

[ Exhibit Hall I ]

Abstract
We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method. Code and pre-trained model will be released.
Poster
Minghe Gao · Xuqi Liu · Zhongqi Yue · Yang Wu · Shuang Chen · Juncheng Li · Siliang Tang · Fei Wu · Tat-Seng Chua · Yueting Zhuang

[ Exhibit Hall I ]

Abstract
Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought (CoT) reward model automatically. It generates code for solving visual tasks and transforms the analysis of code blocks into the evaluation of CoT step as training samples. Then, we train SVIP-Reward model using a multi-head attention mechanism called TriAtt-CoT. The advantages of SVIP-Reward are evident throughout the entire process of MLLM. We also introduce a benchmark for CoT reward model training and testing. Experimental results demonstrate that SVIP-Reward improves MLLM performance across training and inference-time scaling, yielding better results on benchmarks while reducing hallucinations and enhancing reasoning ability.
Poster
Lujun Li · Dezhi Li · Cheng Lin · Wei Li · Wei Xue · Sirui Han · Yike Guo

[ Exhibit Hall I ]

Abstract
Low-Rank Adaptation (LoRA) is a widely used method for efficiently fine-tuning large models by introducing low-rank matrices into weight updates. However, existing LoRA techniques fail to account for activation information, such as outliers, which significantly impact model performance. This omission leads to suboptimal adaptation and slower convergence. To address this limitation, we present Activation-Informed Low-Rank Adaptation (AIRA), a novel approach that integrates activation information into initialization, training, and rank assignment to enhance model performance. Specifically, AIRA introduces: (1) Outlier-weighted SVD decomposition to reduce approximation errors in low-rank weight initialization, (2) Outlier-driven dynamic rank assignment using offline optimization for better layer-wise adaptation, and (3) Activation-informed training to amplify updates on significant weights. This cascaded activation-informed paradigm enables faster convergence and fewer fine-tuned parameters while maintaining high performance. Extensive experiments on multiple large models demonstrate that AIRA outperforms state-of-the-art LoRA variants, achieving superior performance-efficiency trade-offs in vision-language instruction tuning, few-shot learning, and image generation. Codes are available in Appendix.
Poster
Harry Cheng · Yangyang Guo · Qingpei Guo · Ming Yang · Tian Gan · Weili Guan · Liqiang Nie

[ Exhibit Hall I ]

Abstract
Multi-modal Large Language Models (MLLMs) have dramatically advanced the reseach field recently and delivered powerful vision-language understanding capabilities. However, these models often inherit deep-rooted social biases from their training data, leading to uncomfortable responses with respect to attributes such as race and gender.This paper addresses the issue of social biases in MLLMs by i) introducing a comprehensive Counterfactual dataset with multiple social concepts (CMSC), which complements existing datasets by providing 18 diverse and balanced social concepts; and ii) proposing a Counter-Stereotype Debiasing (CSD) strategy that mitigates social biases in MLLMs by leveraging the opposites of prevalent stereotypes. CSD incorporates both a novel bias-aware data sampling method and a loss rescaling method, thereby enabling the model to more effectively reduce biases. We conduct extensive experiments with four prevalent MLLM architectures. The results demonstrate the advantage of the CMSC dataset and the edge of CSD strategy in reducing social biases compared to existing competing methods, without compromising the overall performance on general multi-modal reasoning benchmarks.
Poster
Shizhen Zhao · Jiahui Liu · Xin Wen · Haoru Tan · Xiaojuan Qi

[ Exhibit Hall I ]

Abstract
Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$\beta$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate …
Poster
Zhen Zhou · Tong Wang · Yunkai Ma · Xiao Tan · Fengshui Jing

[ Exhibit Hall I ]

Abstract
Existing language instruction-guided online 3D reconstruction systems mainly rely on explicit instructions or queryable maps, showing inadequate capability to handle implicit and complex instructions. In this paper, we first introduce a reasoning reconstruction task. This task inputs an implicit instruction involving complex reasoning and an RGB-D sequence, and outputs incremental 3D reconstruction of instances that conform to the instruction. To handle this task, we propose LIRA: Language Instructed Reconstruction Assistant. It leverages a multimodal large language model to actively reason about the implicit instruction and obtain instruction-relevant 2D candidate instances and their attributes. Then, candidate instances are back-projected into the incrementally reconstructed 3D geometric map, followed by instance fusion and target instance inference. In LIRA, to achieve higher instance fusion quality, we propose TIFF, a Text-enhanced Instance Fusion module operating within Fragment bounding volume, which is learning-based and fuses multiple keyframes simultaneously. Since the evaluation system for this task is not well established, we propose a benchmark ReasonRecon comprising the largest collection of scene-instruction data samples involving implicit reasoning. Experiments demonstrate that LIRA outperforms existing methods in the reasoning reconstruction task and is capable of running in real time. Code and benchmark will be publicly available.
Poster
Gyuejeong Lee · Daeyoung Choi

[ Exhibit Hall I ]

Abstract
Federated learning (FL) enables collaborative model training across distributed clients without centralizing data. However, existing approaches like Federated Averaging ($\texttt{FedAvg}$) often perform poorly with heterogeneous data distributions, failing to achieve personalization due to their inability to capture class-specific information effectively.To overcome $\texttt{FedAvg}$'s personalization limitations, we propose Class-wise Federated Averaging ($\texttt{cwFedAvg}$), a novel personalized FL (PFL) framework that performs Federated Averaging for each class.$\texttt{cwFedAvg}$ creates class-specific global models via weighted aggregation of local models using class distributions, then combines them to generate personalized local models.To facilitate effective class-wise aggregation, we further propose Weight Distribution Regularizer ($\texttt{WDR}$), which encourages deep networks to encode class-specific information efficiently by aligning empirical and approximated class distributions derived from output layer weights.Our experiments demonstrate $\texttt{cwFedAvg}$'s superior performance over existing PFL methods through efficient personalization while maintaining $\texttt{FedAvg}$'s communication cost and avoiding additional local training and pairwise computations.
Poster
Young Kyun Jang · Ser-Nam Lim

[ Exhibit Hall I ]

Abstract
Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades …
Poster
Yin Xie · Kaicheng Yang · Xiang An · Kun Wu · Yongle Zhao · Weimo Deng · Zimin Ran · Yumeng Wang · Ziyong Feng · Roy Miles · Ismail Elezi · Jiankang Deng

[ Exhibit Hall I ]

Abstract
The vision towers of Multimodal Language Models (MLLM) have significantly enhanced the performance of large multimodal models. This success is primarily attributed to extensive language alignment training, which enhances human-like understanding. However, these models predominantly rely on global category representations, limiting their performance in tasks that require localized representations, such as grounding, OCR, and segmentation. To address this limitation, we propose a novel Locality-Aware Cluster Contrastive Learning strategy. Our approach leverages local feature clustering and contrastive learning to improve the model's ability to understand and represent localized information. Furthermore, our method can be easily scaled to billion-level training, ensuring its applicability to large-scale datasets and models. We demonstrate the effectiveness of our method by achieving state-of-the-art results on the Visual Question Answering (VQA) and RefCOCO benchmarks, showcasing its superior capabilities in handling tasks that require fine-grained visual understanding. Our results indicate a significant improvement in performance, validating the potential of our approach in advancing MLLM tasks. It outperforms the widely used SigLIP.
Poster
Jieyi Tan · Chengwei Zhang · Bo Dang · Yansheng Li

[ Exhibit Hall I ]

Abstract
Traditional Remote Sensing Foundation models (RSFMs) are pre-trained with a data-centralized paradigm, through self-supervision on large-scale curated remote sensing data. For each institution, however, pre-training RSFMs with limited data in a standalone manner may lead to suboptimal performance, while aggregating remote sensing data from multiple institutions for centralized pre-training raises privacy concerns. Seeking for collaboration is a promising solution to resolve this dilemma, where multiple institutions can collaboratively train RSFMs without sharing private data. In this paper, we propose a novel privacy-preserved pre-training framework (FedSense), which enables multiple institutions to collaboratively train RSFMs without sharing private data. However, it is a non-trivial task hindered by a vicious cycle, which results from model drift by remote sensing data heterogeneity and high communication overhead. To break this vicious cycle, we introduce Federated Mutual-guidance Learning. Specifically, we propose a Server-to-Clients Guidance (SCG) mechanism to guide clients updates towards global-flatness optimal solutions. Additionally, we propose a Clients-to-Server Guidance (CSG) mechanism to inject local knowledge into the server by low-bit communication. Extensive experiments on four downstream tasks demonstrate the effectiveness of our FedSense in both full-precision and communication-reduced scenarios, showcasing remarkable communication efficiency and performance gains.
Poster
meihan wu · Tao Chang · Cui Miao · Jie Zhou · Chun Li · Xiangyu Xu · Ming Li · Xiaodong Wang

[ Exhibit Hall I ]

Abstract
Federated learning research has recently shifted from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) due to their superior capacity. ViTs training demands higher computational resources due to the lack of 2D inductive biases inherent in CNNs. However, efficient federated training of ViTs on resource-constrained clients remains unexplored in the community. In this paper, we propose EFTViT, a hierarchical federated framework that leverages masked images to enable efficient, full-parameter training on resource-constrained clients, offering substantial benefits for learning on heterogeneous data. In general, we patchify images and randomly mask a portion of the patches, observing that excluding them from training has minimal impact on performance while substantially reducing computation costs and enhancing data content privacy protection. Specifically, EFTViT comprises a series of lightweight local modules and a larger global module, updated independently on clients and the central server, respectively. The local modules are trained on unmasked image patches, while the global module is trained on intermediate patch features uploaded from the local client, balanced through a proposed median sampling strategy to erase client data distribution privacy. We analyze the computational complexity and privacy protection of EFTViT. We analyze the computational complexity and privacy protection of EFTViT . Extensive experiments on …
Poster
Qinqian Lei · Bo Wang · Robby Tan

[ Exhibit Hall I ]

Abstract
Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce LoRD-HOI (Low-Rank Decomposed VLM Feature Adaptation for Zero-Shot HOI Detection), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, LoRD-HOI decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting.
Poster
Haipeng Xiong · Kai Xu · Angela Yao

[ Exhibit Hall I ]

Abstract
This work questions a common assumption of OOD detection, that models with higher in-distribution (ID) accuracy tend to have better OOD performance. Recent findings show this assumption doesn’t always hold. A direct observation is that the later version of torchvision models improves ID accuracy but suffers from a significant drop in OOD performance. We systematically diagnose torchvision training recipes andexplain this effect by analyzing the maximal logits of ID and OOD samples. We then propose post-hoc and training-time solutions to mitigate the OOD decrease by fixing problematic augmentations in torchvision recipes. Both solutions enhance OOD detection and maintain strong ID performance. Code will be released upon acceptance.
Poster
Han Yu · Kehan Li · Dongbai Li · Yue He · Xingxuan Zhang · Peng Cui

[ Exhibit Hall I ]

Abstract
Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols of previous literature are not consistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We will provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.
Poster
Jingyi Zhang · Jiaxing Huang · Huanjin Yao · Shunyu Liu · Xikun ZHANG · Shijian Lu · Dacheng Tao

[ Exhibit Hall I ]

Abstract
Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are.In this work, we aim to enhance the MLLMs’ reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed step-wise reward mechanisms, StepGRPO effectively mitigates the sparse reward issue for MLLMs and encourages more structured and logically consistent reasoning process. Extensive experiments over 8 benchmarks demonstrate the superiority of the proposed StepGRPO.
Poster
Zhongquan Jian · Yanhao Chen · Wangyancheng Wangyancheng · Junfeng Yao · Meihong Wang · Qingqiang Wu

[ Exhibit Hall I ]

Abstract
Long-tailed data poses a significant challenge for deep learning models, which tend to prioritize accurate classification of head classes while largely neglecting tail classes. Existing techniques, such as class re-balancing, logit adjustment, and data augmentation, aim to enlarge decision regions of tail classes or achieve clear decision boundaries, leaving the robustness of decision regions under-considered. This paper proposes a simple yet effective Supervised Exploratory Learning (SEL) framework to achieve these goals simultaneously from space exploration perspectives. SEL employs the adaptive Optimal Foraging Algorithm (OFA) to generate diverse exploratory examples, integrating Class-biased Complement (CbC) for balanced class distribution and Fitness-weighted Sampling (FwS) for space exploration. Both theoretical analysis and empirical results demonstrate that SEL enhances class balance, sharpens decision boundaries, and strengthens decision regions. SEL is a plug-and-play training framework that can be seamlessly integrated into model training or classifier adjustment stages, making it highly adaptable and compatible with existing methods and facilitating further performance improvements. Extensive experiments on various long-tailed benchmarks demonstrate SEL's superiority.
Poster
Zhihui Zhang · Luanyuan Dai · Qika Lin · Yunfeng Diao · Guangyin Jin · Yufei Guo · Jing Zhang · Xiaoshuai Hao

[ Exhibit Hall I ]

Abstract
Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are missing. To address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability. The source …
Poster
Hao Zheng · Shunzhi Yang · Zhuoxin He · Jinfeng Yang · Zhenhua Huang

[ Exhibit Hall I ]

Abstract
Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL's superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Our code will be made publicly available.
Poster
Alexey Kravets · Da Chen · Vinay Namboodiri

[ Exhibit Hall I ]

Abstract
CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method. All the models, code and baselines will be released on acceptance of …
Poster
Boyong He · Yuxiang Ji · Zhuoyue Tan · Liaoni Wu

[ Exhibit Hall I ]

Abstract
Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain …
Poster
Renye Yan · Jikang Cheng · Yaozhong Gan · Shikun Sun · You Wu · Yunfan Yang · Ling Liang · JinLong Lin · Yeshuang Zhu · Jie Zhou · Jinchao Zhang · Junliang Xing · Yimao Cai · Ru Huang

[ Exhibit Hall I ]

Abstract
While fine-tuning diffusion models with reinforcement learning (RL) has demonstrated effectiveness in directly optimizing downstream objectives, existing RL frameworks are prone to overfitting the rewards, leading to outputs that deviate from the true data distribution and exhibit reduced diversity. To address this issue, we introduce entropy as a quantitative measure to enhance the exploratory capacity of diffusion models' denoising policies. We propose an adaptive mechanism that dynamically adjusts the application and magnitude of entropy and regularization, guided by real-time quality estimation of intermediate noised states. Theoretically, we prove the convergence of our entropy-enhanced policy optimization and establish two critical properties: 1) global entropy increases through training, ensuring robust exploration capabilities, and 2) entropy systematically decreases during the denoising process, enabling a phase transition from early-stage diversity promotion to late-stage distributional fidelity. Building on this foundation, we propose a plug-and-play RL module that adaptively controls entropy and optimizes denoising steps. Extensive evaluations demonstrate our method's theoretical soundness and empirical robustness, achieving state-of-the-art quality-diversity trade-offs across benchmarks. Notably, our framework significantly improves the rewards and reduces denoising steps in training by up to 40\%. The code is available in the supplementary.
Poster
Taeuk Jang · Hoin Jung · Xiaoqian Wang

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLMs) like CLIP have shown remarkable zero-shot performance by aligning different modalities in the embedding space, enabling diverse applications from image editing to visual question answering (VQA). However, these models often inherit biases from their training data, resulting in performance disparities across specific subpopulations. Traditional debiasing methods for VLMs primarily focus on specific downstream tasks using labeled datasets, which we argue is insufficient given the broad applicability of VLMs. Specifically, these methods struggle with generalizability, transferability, and feasibility due to overfitting, limited task applicability, and regulatory constraints on the use of sensitive data, making them less practical in real-world scenarios. To address these challenges, we propose a novel task-agnostic method for learning debiased image embeddings in VLMs. Our approach does not require expensive annotated datasets or curated prompts for downstream tasks, while still preserving the inherent zero-shot capabilities of these models. Instead, we leverage easily accessible information: 1) a bias text corpus generated by a large language model, and 2) a generic unsupervised vision dataset. Our method disentangles the image embedding into bias and neutral components by applying centered kernel alignment (CKA) regularization to the text-vision representational similarity, using the bias text corpus over the generic vision dataset. …
Poster
Jialiang Wang · Xianming Liu · Xiong Zhou · Gangfeng Hu · Deming Zhai · Junjun Jiang · Xiangyang Ji

[ Exhibit Hall I ]

Abstract
Learning with noisy labels is an important and challenging task for training accurate deep neural networks.To mitigate label noise, prior studies have proposed various robust loss functions, particularly symmetric losses. Nevertheless, symmetric losses usually suffer from the underfitting issue due to the overly strict symmetric condition. To address this problem, the Active Passive Loss (APL) jointly optimizes an active and a passive loss to mutually enhance the overall fitting ability.Within APL, symmetric losses have been successfully extended, yielding advanced robust loss functions.Despite these advancements, emerging theoretical analyses indicate that asymmetric loss functions, a new class of robust loss functions, possess superior properties compared to symmetric losses. However, existing asymmetric losses are not compatible with advanced optimization frameworks such as APL, limiting their practical potential and applicability. Motivated by this theoretical gap and the promising properties of asymmetric losses, we extend the asymmetric loss function to the more complex passive loss scenario and propose the Asymetric Mean Square Error (AMSE), a novel asymmetric loss function. We rigorously establish the necessary and sufficient condition under which AMSE satisfies the asymmetric condition.By substituting the traditional symmetric passive loss in APL with our proposed AMSE, we introduce a novel robust loss framework termed Joint …
Poster
qian feng · Jiahang Tu · Mintong Kang · Hanbin Zhao · Chao Zhang · Hui Qian

[ Exhibit Hall I ]

Abstract
Incremental unlearning (IU) is critical for pre-trained models to comply with sequential data deletion requests, yet existing methods primarily suppress parameters or confuse knowledge without explicit constraints on both feature and gradient level, resulting in \textit{superficial forgetting} where residual information remains recoverable. This incomplete forgetting risks security breaches and disrupts retention balance, especially in IU scenarios. We propose FG-OrIU (\textbf{F}eature-\textbf{G}radient \textbf{Or}thogonality for \textbf{I}ncremental \textbf{U}nlearning), the first framework unifying orthogonal constraints on both features and gradients level to achieve deep forgetting, where the forgetting effect is irreversible. FG-OrIU decomposes feature spaces via Singular Value Decomposition (SVD), separating forgetting and remaining class features into distinct subspaces. It then enforces dual constraints: feature orthogonal projection on both forgetting and remaining classes, while gradient orthogonal projection prevents the reintroduction of forgotten knowledge and disruption to remaining classes during updates. Additionally, dynamic subspace adaptation merges newly forgetting subspaces and contracts remaining subspaces, ensuring a stable balance between removal and retention across sequential unlearning tasks. Extensive experiments demonstrate the effectiveness of our method.
Poster
Shouwen Wang · Qian Wan · Junbin Gao · Zhigang Zeng

[ Exhibit Hall I ]

Abstract
Recent methods learn class-unified prompt contexts by image data to adapt CLIP to zero-shot multi-label image classification, which achieves impressive performance. However, simply tuning prompts is insufficient to deal with novel classes across different semantic granularity levels. This limitation arises due to the sparse semantic detail in prompt class names and the hierarchical granularity competition among class names caused by CLIP’s contrastive loss. We propose a language-driven zero-shot multi-label learning framework to bridge associations among classes across multiple granularity levels through class name reconstruction. To achieve this, we first leverage a language model to generate structured text descriptions for each class, which explicitly capture (1) visual attributes, (2) hierarchical relationships, and (3) co-occurrence scenes. With the enriched descriptions, we then learn class names by extracting and aligning semantic relationships and features from them in the CLIP’s shared image-text embedding space. Furthermore, we consider that similar text descriptions among different classes may introduce ambiguities. We mitigate these ambiguities by imposing a pair-based loss on learnable class names to enhance their distinctiveness. During inference, we aggregate semantic predictions from multiple image snippets to reinforce the identification of classes across different granularity levels. Comprehensive experiments demonstrate that our method surpasses state-of-the-art methods in …
Poster
Yifan Li · Xin Li · Tianqin Li · Wenbin He · Yu Kong · Liu Ren

[ Exhibit Hall I ]

Abstract
Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: \textbf{the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features}. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ …
Poster
Xihong Yang · Siwei Wang · Jiaqi Jin · Fangdi Wang · Tianrui Liu · Yueming Jin · Xinwang Liu · En Zhu · Kunlun He

[ Exhibit Hall I ]

Abstract
Multi-view clustering (MVC) aims to explore the common clustering structure across multiple views. Many existing MVC methods heavily rely on the assumption of view consistency, where alignments for corresponding samples across different views are ordered in advance. However, real-world scenarios often present a challenge as only partial data is consistently aligned across different views, restricting the overall clustering performance. In this work, we consider the model performance decreasing phenomenon caused by data order shift (i.e., from fully to partially aligned) as a generalized multi-view clustering problem. To tackle this problem, we design a causal multi-view clustering network, termed CauMVC. We adopt a causal modeling approach to understand multi-view clustering procedure. To be specific, we formulate the partially aligned data as an intervention and multi-view clustering with partially aligned data as an post-intervention inference. However, obtaining invariant features directly can be challenging. Thus, we design a Variational Auto-Encoder for causal learning by incorporating an encoder from existing information to estimate the invariant features. Moreover, a decoder is designed to perform the post-intervention inference. Lastly, we design a contrastive regularizer to capture sample correlations. To the best of our knowledge, this paper is the first work to deal generalized multi-view clustering via …
Poster
Hanwen Cao · Haobo Lu · Xiaosen Wang · Kun He

[ Exhibit Hall I ]

Abstract
Ensemble-based attacks have been proven to be effective in enhancing adversarial transferability by aggregating the output of models with various architectures. However, existing research primarily focuses on refining ensemble weights or optimizing the ensemble path, overlooking the exploration of ensemble models to enhance the transferability of adversarial attacks. In this study, we attempt to adversarially augment ensemble models by modifying inner modules to mitigate this gap. Moreover, observing that ensemble Vision Transformers (ViTs) gain less attention, we propose ViT-EnsembleAttack, the first ensemble-based attack method tailored for ViTs to the best of our knowledge. Our approach generates augmented models for each surrogate ViT using three strategies: Multi-head dropping, Attention score scaling and MLP feature mixing, with the associated parameters optimized by Bayesian optimization. These adversarially augmented models are ensembled to generate adversarial examples. Furthermore, we introduce an automatic reweighting module that dynamically adjusts the influence of each surrogate model in the ensemble, while also enlarging the step size in each iteration to enhance convergence. Extensive experiments demonstrate that ViT-EnsembleAttack significantly enhances the adversarial transferability of ensemble-based attacks on ViTs, outperforming existing methods by a substantial margin.
Poster
Zitian Wang · Yue Liao · RONG KANG · Fengyun Rao · Yibo Yang · Si Liu

[ Exhibit Hall I ]

Abstract
Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning. While existing preference alignment methods predominantly target hallucination factors, they overlook the factors essential for multi-modal comprehension capabilities, often narrowing their improvements on hallucination mitigation. To bridge this gap, we propose Instruction-oriented Preference Alignment (IPA), a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy. Our method involves an automated preference construction coupled with a dedicated verification process that identifies instruction-oriented factors, avoiding significant variability in response representations. Additionally, IPA incorporates a progressive preference collection pipeline, further recalling challenging samples through model self-evolution and reference-guided refinement. Experiments conducted on Qwen2VL-7B demonstrate IPA's effectiveness across multiple benchmarks, including hallucination evaluation, visual question answering, and text understanding tasks, highlighting its capability to enhance general comprehension.
Poster
Jiong Yin · Liang Li · Jiehua Zhang · Yuhan Gao · Chenggang Yan · Xichun Sheng

[ Exhibit Hall I ]

Abstract
Audio-visual multi-task incremental learning aims to continuously learn from multiple audio-visual tasks without the need for joint training on all tasks. The challenge of the problem is how to preserve the old task knowledge while facilitating the learning of new task with previous experiences. To address these challenges, we introduce a three-stage Progressive Homeostatic and Plastic audio-visual prompt (PHP) method. In the shallow phase, we design the task-shared modality aggregating adapter to foster cross-task and cross-modal audio-visual representation learning to enhance shared understanding between tasks. In the middle phase, we propose the task-specific modality-shared dynamic generating adapter, which constructs prompts that are tailored to individual tasks while remaining general across modalities, which balances the model’s ability to retain knowledge against forgetting with its potential for versatile multi-task transferability. In the deep phase, we introduce the task-specific modality-independent prompts to further refine the understand ability by targeting individual information for each task and modality. By incorporating these three phases, PHP retains task-specific prompts while adapting shared parameters for new tasks to effectively balance knowledge sharing and specificity. Our method achieves SOTA performance in different orders of three tasks~(AVE, AVVP and AVQA). We will release the source codes on GitHub.
Poster
Ziyu Liu · Zeyi Sun · Yuhang Zang · Xiaoyi Dong · Yuhang Cao · Haodong Duan · Dahua Lin · Jiaqi Wang

[ Exhibit Hall I ]

Abstract
Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce.Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is possibly one key direction in reproducing o1.While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored.This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks.Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO).We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection.Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT).For example, Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples.In few-shot object detection, Visual-RFT also …
Poster
Shiji Zhao · Ranjie Duan · Fengxiang Wang · Chi Chen · Caixin KANG · Shouwei Ruan · Jialing Tao · YueFeng Chen · Hui Xue · Xingxing Wei

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Based on the exploration, we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the …
Poster
Melih Barsbey · Lucas Prieto · Stefanos Zafeiriou · Tolga Birdal

[ Exhibit Hall I ]

Abstract
Robustness and resource-efficiency are two highly desirable properties for modern machine learning models. However, achieving them jointly remains a challenge. In this paper, we position high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility. We demonstrate that large learning rates also produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity. Importantly, our findings indicate that large learning rates compare favorably to other hyperparameters and regularization methods, in consistently satisfying these properties in tandem. In addition to demonstrating the positive effect of large learning rates across diverse spurious correlation datasets, models, and optimizers, we also present strong evidence that the previously documented success of large learning rates in standard classification tasks is likely due to its effect on addressing hidden/rare spurious correlations in the training dataset.
Poster
Qi He · Xiao Wu · Jun-Yan He · Shuai Li

[ Exhibit Hall I ]

Abstract
Source-Free Domain Adaptive Object Detection (SF-DAOD) transfers knowledge acquired from the labeled source domain to the unlabeled target domain while preserving data privacy by restricting access to source data during adaptation. Existing approaches predominantly leverage the Mean Teacher framework for self-training in the target domain. The Exponential Moving Average (EMA) mechanism in Mean Teacher stabilizes training by averaging the student weights over training steps. However, in domain adaptation, its inherent lag in responding to emerging knowledge can hinder the student's rapid adaptation to target-domain shifts. To address this challenge, we propose the Dual-rate Dynamic Teacher (DDT) with an Asynchronous EMA (AEMA), which implements group-wise parameter updates. Unlike conventional EMA, which synchronously updates all parameters, AEMA dynamically partitions teacher parameters into two functional groups based on the contribution to capture the target domain shift. By applying a distinct smoothing coefficient to these groups, AEMA enables simultaneous fast adaptation and historical knowledge retention. Comprehensive experiments conducted on three widely used traffic benchmarks have demonstrated that the proposed DDT achieves superior performance, outperforming the state-of-the-art methods by a clear margin. The codes are available at https://anonymous.4open.science/r/Dual-Rate-Dynamic-Teacher-for-Source-Free-Domain-Adaptive-Object-Detection-17BF.
Poster
Borui Kang · Lei Wang · Zhiping Wu · Tao Feng · Yawen Li · Yang Gao · Wenbin Li

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLM) have emerged as a highly promising approach for Continual Learning (CL) due to their powerful generalized features. While adapter-based VLM can exploit both task-specific and task-agnostic features, current CL methods have largely overlooked the distinct and evolving parameter distributions in visual and language modalities, which are found crucial for effectively mitigating catastrophic forgetting.In this study, we find that the visual modality experiences a broader parameter distribution and greater variance during class increments than the textual modality, leading to higher vulnerability to forgetting.Consequently, we handle the branches of the two modalities asymmetrically. Specifically, we propose a Dynamic Multi-layer Null Space Projection (DMNSP) strategy and apply it only to the visual modality branch, while optimizing the language branch according to the original optimizer. DMNSP can restrict the update of visual parameters within the common subspace of multiple null spaces, further limiting the impact of non-zero residual terms. Simultaneously, combined with a dynamic projection coefficient, we can precisely control the magnitude of gradient projection to the null space, endowing the model with good stability and plasticity.Extensive experiments on TinyImageNet, CIFAR100 and ImageNet-R demonstrate that our method outperforms current approaches in accuracy and knowledge retention, setting a new standard for state-of-the-art …
Poster
Guowei Xu · Peng Jin · ZiangWu ZiangWu · Li Hao · Yibing Song · Lichao Sun · Li Yuan

[ Exhibit Hall I ]

Abstract
Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights will be made publicly available.
Poster
Junjie Wu · Jiangtao Xie · Zhaolin Zhang · Qilong Wang · Qinghua Hu · Peihua Li · Sen Xu

[ Exhibit Hall I ]

Abstract
Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly …
Poster
Lijie Hu · Tianhao Huang · Huanyi Xie · Xilin Gong · Chenyang Ren · Zhengyu Hu · Lu Yu · Ping Ma · Di Wang

[ Exhibit Hall I ]

Abstract
Concept Bottleneck Models (CBMs) have garnered increasing attention due to their ability to provide concept-based explanations for black-box deep learning models while achieving high final prediction accuracy using human-like concepts. However, the training of current CBMs heavily relies on the accuracy and richness of annotated concepts in the dataset. These concept labels are typically provided by experts, which can be costly and require significant resources and effort. Additionally, concept saliency maps frequently misalign with input saliency maps, causing concept predictions to correspond to irrelevant input features - an issue related to annotation alignment. To address these limitations, we propose a new framework called SSCBM (Semi-supervised Concept Bottleneck Model). Our SSCBM is suitable for practical situations where annotated data is scarce. By leveraging joint training on both labeled and unlabeled data and aligning the unlabeled data at the concept level, we effectively solve these issues. We proposed a strategy to generate pseudo labels and an alignment loss. Experiments demonstrate that our SSCBM is both effective and efficient. With only 10% labeled data, our model's concept and task accuracy on average across four datasets is only 2.44% and 3.93% lower, respectively, compared to the best baseline in the fully supervised learning setting.
Poster
Linjing You · Jiabao Lu · Xiayuan Huang · Xiangli Nie

[ Exhibit Hall I ]

Abstract
Test-Time Adaptation (TTA) aims to enhance the generalization of deep learning models when faced with test data that exhibits distribution shifts from the training data. In this context, only a pre-trained model and unlabeled test data are available, making it particularly relevant for privacy-sensitive applications. In practice, we observe that feature redundancy in embeddings tends to increase as domain shifts intensify in TTA. However, existing TTA methods often overlook this redundancy, which can hinder the model’s adaptability to new data. To address this issue, we introduce Feature Redundancy Elimination for Test-time Adaptation (FRET), a novel perspective for TTA. A straightforward approach (S-FRET) is to directly minimize the feature redundancy score as an optimization objective to improve adaptation. Despite its simplicity and effectiveness, S-FRET struggles with label shifts, limiting its robustness in real-world scenarios. To mitigate this limitation, we further propose Graph-based FRET (G-FRET), which integrates a Graph Convolutional Network (GCN) with contrastive learning. This design not only reduces feature redundancy but also enhances feature discriminability in both the representation and prediction layers. Extensive experiments across multiple model architectures, tasks, and datasets demonstrate the effectiveness of S-FRET and show that G-FRET achieves state-of-the-art performance. Further analysis reveals that G-FRET enables the …
Poster
Hongcheng Gao · Tianyu Pang · Chao Du · Taihang Hu · Zhijie Deng · Min Lin

[ Exhibit Hall I ]

Abstract
With the rapid progress of diffusion models (DMs), significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained DMs to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs to *relearn the unlearned concepts*. This occurs partly because certain benign concepts (e.g., ''skin'') retained in DMs are related to the unlearned ones (e.g., ''nudity''), facilitating their relearning via finetuning. To address this, we propose **meta-unlearning** on DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered to *self-destruct*, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies.
Poster
Xu Cheng · Xin Jiang · Zechao Li

[ Exhibit Hall I ]

Abstract
This paper explains training-time out-of-distribution (OOD) detection from a novel view, i.e., interactions between different input variables of deep neural networks (DNNs). Specifically, we provide a unified understanding of the effectiveness of current training-time OOD detection methods, i.e., DNNs trained with these methods all encode more complex interactions for inference than those trained without training-time methods, which contributes to their superior OOD detection performance. We further conduct thorough empirical analyses and verify that complex interactions play a primary role in OOD detection, by developing a simple-yet-efficient method to force the DNN to learn interactions of specific complexities and evaluate the change of OOD detection performances. Besides, we also use interactions to investigate why near-OOD samples are more difficult to distinguish from in-distribution (ID) samples than far-OOD samples, mainly because compared to far-OOD samples, the distribution of interactions in near-OOD samples is more similar to that of ID samples. Moreover, we discover that training-time OOD detection methods can effectively decrease such similarities. The code will be released when the paper is accepted.
Poster
Xu Chen · Yang Li · Yahong Han · Guangquan Xu · Jialie Shen

[ Exhibit Hall I ]

Abstract
Data-Free Knowledge Distillation (DFKD) avoids accessing the original training data during knowledge transferring from a large model to a smaller one, possessing significant potential in ensuring the widespread promotion of industry-level applications while safeguarding user privacy and data security. Unfortunately, due to the lack of precise estimation of the original data distribution, existing DFKD methods often rely on manually induced priors to constrain the generator to produce samples that comply with the rules as much as possible. In this paper, we propose a novel method dubbed \textbf{C}ou\textbf{P}ling \textbf{Net}work (\textbf{CPNet}) that constructs a generator to explicitly approximate the inverse transformation of the teacher model. Consequently, the two components can be integrated into an autoencoder specifically tailored for label information, where the generated images are treated as latent variables. Since real labels are typically uniformly distributed and the parameters of the teacher model are fixed, this enables our generator to produce images that closely approximate the true distribution. Besides, we transform real labels into feature-level constraints through the inverse transformation of a network classifier with fixed parameters, thereby converting the classification problem of generated images into an issue of distance measurement between features. We utilize this constraint for adversarial training and enhancing …
Poster
Junhyeog Yun · Minui Hong · Gunhee Kim

[ Exhibit Hall I ]

Abstract
Neural fields provide a memory-efficient representation of data, which can effectively handle diverse modalities and large-scale data.However, learning to map neural fields often requires large amounts of training data and computations, which can be limited to resource-constrained edge devices.One approach to tackle this limitation is to leverage Federated Meta-Learning (FML), but traditional FML approaches suffer from privacy leakage.To address these issues, we introduce a novel FML approach called FedMeNF.FedMeNF utilizes a new privacy-preserving loss function that regulates privacy leakage in the local meta-optimization. This enables the local meta-learner to optimize quickly and efficiently without retaining the client's private data.Our experiments demonstrate that FedMeNF achieves fast optimization speed and robust reconstruction performance, even with few-shot or non-IID data across diverse data modalities, while preserving client data privacy.
Poster
Heitor Rapela Medeiros · Atif Belal · Srikanth Muralidharan · Eric Granger · Marco Pedersoli

[ Exhibit Hall I ]

Abstract
The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability.
Poster
Xavier Thomas · Deepti Ghadiyaram

[ Exhibit Hall I ]

Abstract
Domain Generalization aims to develop models that can generalize to novel and unseen data distributions. In this work, we study how model architectures and pre-training objectives impact feature richness and propose a method to effectively leverage them for domain generalization. Specifically, given a pre-trained feature space, we first discover latent domain structures, referred to as pseudo-domains, that capture domain-specific variations in an unsupervised manner. Next, we augment existing classifiers with these complementary pseudo-domain representations making them more amenable to diverse unseen test domains. We analyze how different pre-training feature spaces differ in the domain-specific variances they capture. Our empirical studies reveal that features from diffusion models excel at separating domains in the absence of explicit domain labels and capture nuanced domain-specific information. On 5 datasets, we show that our very simple framework improves generalization to unseen domains by a maximum test accuracy improvement of over **4%** compared to the standard baseline Empirical Risk Minimization (ERM). Crucially, our method outperforms most algorithms that access domain labels during training. Code is available at: https://anonymous.4open.science/r/GUIDE-B567/README.md.
Poster
Venkat Adithya Amula · Sunayana Samavedam · Saurabh Saini · Avani Gupta · P J Narayanan

[ Exhibit Hall I ]

Abstract
Deep learning models are susceptible to {\em backdoor attacks} involving malicious attackers perturbing a small subset of training data with a {\em trigger} to causes misclassifications. Various triggers have been used including semantic triggers that are easily realizable without requiring attacker to manipulate the image. The emergence of generative AI has eased generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements towards the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images.
Poster
Pegah KHAYATAN · Mustafa Shukor · Jayneel Parekh · Arnaud Dapogny · Matthieu Cord

[ Exhibit Hall I ]

Abstract
Multimodal LLMs have reached remarkable levels of proficiency in understanding multimodal inputs, driving extensive research to develop increasingly powerful models. However, far less attention has been given to understanding and explaining the underlying mechanisms of these models. Most existing explainability research examines these models only in their final states, overlooking the dynamic representational shifts that occur during training. In this work, we systematically analyze the evolution of hidden-state representations to reveal how fine-tuning alters a model’s internal structure to specialize on new multimodal tasks. Using a concept-based approach, we map hidden states to interpretable visual and textual concepts, enabling us to trace changes in encoded concepts across modalities as training progresses. We also demonstrate the use of shift vectors to capture this concepts changes. These shift vectors allow us to recover fine-tuned concepts by shifting those in the original model. Finally, we explore the practical impact of our findings on model steering, showing that we can adjust multimodal LLMs behaviors without any training, such as modifying answer types, captions style or biasing the model toward specific responses. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks. …
Poster
Lukas Kuhn · sari sadiya · Jörg Schlötterer · Florian Buettner · Christin Seifert · Gemma Roig

[ Exhibit Hall I ]

Abstract
Shortcut learning, i.e., a model's reliance on undesired features not directly relevant to the task, is a major challenge that severely limits the applications of machine learning algorithms, particularly when deploying them to assist in making sensitive decisions, such as in medical diagnostics. In this work, we leverage recent advancements in machine learning to create an unsupervised framework that is capable of both detecting and mitigating shortcut learning in transformers. We validate our method on multiple datasets. Results demonstrate that our framework significantly improves both worst-group accuracy (samples misclassified due to shortcuts) and average accuracy, while minimizing human annotation effort. Moreover, we demonstrate that the detected shortcuts are meaningful and informative to human experts, and that our framework is computationally efficient, allowing it to be run on consumer hardware.
Poster
Ada-Astrid Balauca · Sanjana Garai · Stefan Balauca · Rasesh Shetty · Naitik Agrawal · Dhwanil Shah · Yuqian Fu · Xi Wang · Kristina Toutanova · Danda Pani Paudel · Luc Gool

[ Exhibit Hall I ]

Abstract
Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models. Visitors engage with exhibits through curiosity and questions, making expert domain-specific models essential for interactive query resolution and gaining historical insights. Understanding exhibits from images requires analyzing visual features and linking them to historical knowledge to derive meaningful correlations. We facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs for exhibits from all around the world; (b) training large vision-language models (VLMs) on the collected dataset; (c) benchmarking their ability on five visual question answering tasks, specifically designed to reflect real-world inquiries and challenges observed in museum settings.The complete dataset is labeled by museum experts, ensuring the quality and the practical significance of the labels. We train two VLMs from different categories: BLIP with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through extensive experiments, we find that while both model types effectively answer visually grounded questions, large vision-language models …
Poster
Townim Chowdhury · Vu Phan · Kewen Liao · Nanyu Dong · Minh-Son To · Anton Hengel · Johan Verjans · Zhibin Liao

[ Exhibit Hall I ]

Abstract
Counterfactual explanations (CFE) for deep image classifiers aim to reveal how minimal input changes lead to different model decisions, providing critical insights for model interpretation and improvement. However, existing CFE methods often rely on additional image encoders and generative models to create plausible images, neglecting the classifier's own feature space and decision boundaries. As such, they do not explain the intrinsic feature space and decision boundaries learned by the classifier. To address this limitation, we propose Mirror-CFE, a novel method that generates faithful counterfactual explanations by operating directly in the classifier's feature space, treating decision boundaries as mirrors that ``reflect'' feature representations in the mirror. Mirror-CFE learns a mapping function from feature space to image space while preserving distance relationships, enabling smooth transitions between source images and their counterfactuals. Through extensive experiments on four image datasets, we demonstrate that Mirror-CFE achieves superior performance in validity while maintaining input resemblance compared to state-of-the-art explanation methods. Finally, mirror-CFE provides interpretable visualization of the classifier's decision process by generating step-wise transitions that reveal how features evolve as classification confidence changes.
Poster
Shicai Wei · Chunbo Luo · Yang Luo

[ Exhibit Hall I ]

Abstract
Multimodal learning often encounters the under-optimized problem and may perform worse than unimodal learning. Existing approaches attribute this issue to imbalanced learning across modalities and tend to address it through gradient balancing. However, this paper argues that balanced learning is not the optimal setting for multimodal learning. With bias-variance analysis, we prove that imbalanced dependency on each modality obeying the inverse ratio of their variances contributes to optimal performance. To this end, we propose the Asymmetric Representation Learning(ARL) strategy to assist multimodal learning via imbalanced optimization. ARL introduces auxiliary regularizers for each modality encoder to calculate their prediction variance. ARL then calculates coefficients via the unimodal variance to re-weight the optimization of each modality, forcing the modality dependence ratio to be inversely proportional to the modality variance ratio. Moreover, to minimize the generalization error, ARL further introduces the prediction bias of each modality and jointly optimizes them with multimodal loss. Notably, all auxiliary regularizers share parameters with the multimodal model and rely only on the modality representation. Thus the proposed ARL strategy introduces no extra parameters and is independent of the structures and fusion methods of the multimodal model. Finally, extensive experiments on various datasets validate the effectiveness and versatility …
Poster
Junjie Nan · Jianing Li · Wei Chen · Mingkun Zhang · Xueqi Cheng

[ Exhibit Hall I ]

Abstract
Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.
Poster
Qiucheng Wu · Handong Zhao · Michael Saxon · Trung Bui · William Yang Wang · Yang Zhang · Shiyu Chang

[ Exhibit Hall I ]

Abstract
Multimodal large language models are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, how these capabilities integrate is often not intuitive and warrants direct investigation. One understudied capability in MLLMs is visual spatial planning---the ability to comprehend the spatial arrangements of objects and devise action plans to achieve desired outcomes in visual scenes. It is unclear why MLLMs fall short on these tasks generally considered easy for humans, given their successes across other diverse scenarios. To this end, we introduce VSP, a benchmark that 1) evaluates the spatial planning capability in MLLMs in general, and 2) diagnoses this capability via finer-grained sub-tasks, including perception and reasoning, and measure the capabilities of models through these sub-tasks. Our evaluation confirms that both open-source and private MLLMs fail to generate effective plans for even simple spatial planning tasks. Evaluations on the fine-grained analytical tasks further reveal fundamental deficiencies in the models’ visual perception and bottlenecks in reasoning abilities, explaining their worse performance in the general spatial planning tasks. Our work illuminates future directions for improving MLLMs' abilities in spatial planning.
Poster
Xinyu Chen · Haotian Zhai · Can Zhang · XIUPENG SHI · Ruirui Li

[ Exhibit Hall I ]

Abstract
In zero-shot setting, test-time adaptation (TTA) adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, a align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance.
Poster
Kang Zeng · Guojin Zhong · Jintao Cheng · Jin Yuan · Zhiyong Li

[ Exhibit Hall I ]

Abstract
The advancement of Multimodal Large Language Models (MLLMs) has driven significant progress in Visual Question Answering (VQA), evolving from Single-Image VQA to Multi-Image VQA (MVQA). However, the increased number of images in MVQA inevitably introduces substantial visual redundancy that is irrelevant to question answering (QA), negatively impacting both accuracy and efficiency.To address this issue, existing methods often lack flexibility in controlling the number of compressed visual tokens and tend to produce discrete visual fragments, which hinder MLLMs' ability to comprehend images holistically.In this paper, we propose a straightforward yet universal Adaptive Visual Anchoring strategy, which can be seamlessly integrated into existing MLLMs, offering significant accuracy improvements through adaptive compression. Technically, our approach first constructs a response map that captures local relevance within an image concerning a given textual question by measuring cross-modal similarity. Next, a series of anchor boxes are generated around the gravity center of the response map, with the highest-confidence box selected and fed into MLLMs for question answering. To further enhance performance, we introduce a novel collaborative decoding mechanism that balances the answering results derived from both global and compressed images. Since compressed images effectively filter out irrelevant visual regions, they enable MLLMs to establish a more …
Poster
Kesen Zhao · Beier Zhu · Qianru Sun · Hanwang Zhang

[ Exhibit Hall I ]

Abstract
Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority …
Poster
Ömer Veysel Çağatan · Ömer TAL · M. Emre Gursoy

[ Exhibit Hall I ]

Abstract
Self-supervised learning (SSL) has advanced significantly in visual representation learning, yet comprehensive evaluations of its adversarial robustness remain limited. In this study, we evaluate the adversarial robustness of seven discriminative self-supervised models and one supervised model across diverse tasks, including ImageNet classification, transfer learning, segmentation, and detection. Our findings suggest that discriminative SSL models generally exhibit better robustness to adversarial attacks compared to their supervised counterpart on ImageNet, with this advantage extending to transfer learning when using linear evaluation. However, when fine-tuning is applied, the robustness gap between SSL and supervised models narrows considerably. Similarly, this robustness advantage diminishes in segmentation and detection tasks. We also investigate how various factors might influence adversarial robustness, including architectural choices, training duration, data augmentations, and batch sizes. Our analysis contributes to the ongoing exploration of adversarial robustness in visual self-supervised representation systems.
Poster
Zhaoyang Wu · Fang Liu · Licheng Jiao · Shuo Li · Lingling Li · Xu Liu · Puhua Chen · wenping ma

[ Exhibit Hall I ]

Abstract
Vision-language models like CLIP have demonstrated strong zero-shot generalization, making them valuable for various downstream tasks through prompt learning. However, existing test-time prompt tuning methods, such as entropy minimization, treat both text and visual prompts as fixed learnable parameters, limiting their adaptability to unseen domains. In contrast, we propose Hierarchical Variational Test-Time Prompt Generation, a novel approach where both text and visual prompts are dynamically generated via a HyperTransformer at inference time. This enables the model to produce data-specific prompts for each modality, significantly improving generalization. To further address template sensitivity and distribution shifts, we introduce variational prompt generation, leveraging variational inference to mitigate biases introduced by different prompt templates and data augmentations. Additionally, our hierarchical variational prompt generation conditions prompts at each layer on those from previous layers, allowing the model to capture deeper contextual dependencies and refine prompt interactions for robust adaptation. Extensive experiments on domain generalization benchmarks demonstrate that our method significantly outperforms existing prompt-learning techniques, achieving state-of-the-art zero-shot accuracy while maintaining efficiency.
Poster
Yeming Yang · Qingling Zhu · Jianping Luo · Ka-Chun Wong · Qiuzhen Lin · Jianqiang Li

[ Exhibit Hall I ]

Abstract
Deep Neural Networks (DNNs) have succeeded remarkably in various computer tasks. However, they remain vulnerable to adversarial attacks, which could lead to severe security risks. In recent years, robust neural architecture search (NAS) has gradually become an emerging direction for designing adversarially robust architectures. However, existing robust NAS methods rely on repeatedly training numerous DNNs to evaluate robustness, which makes the search process extremely expensive. In this paper, we propose a training-free robust NAS method (TRNAS) that significantly reduces search costs. First, we design a zero-cost proxy model (R-Score) that formalizes adversarial robustness evaluation by exploring the theory of DNN's linear activation capability and feature consistency. This proxy only requires initialized weights for evaluation, which avoids expensive adversarial training costs. Secondly, we introduce a multi-objective selection (MOS) strategy to save candidate architectures with robustness and compactness. Experimental results show that TRNAS only requires 0.02 GPU days to find a promising robust architecture in a vast search space including approximately 10$^{20}$ networks.TRNAS surpasses other state-of-the-art robust NAS methods under both white-box and black-box attacks. Finally, we summarize a few meaningful conclusions for designing the robust architecture and promoting the development of robust NAS field.
Poster
Oliver Sutton · Qinghua Zhou · George Leete · Alexander Gorban · Ivan Tyukin

[ Exhibit Hall I ]

Abstract
We introduce new methods of staining and locking computer vision models, to protect their owners' intellectual property. Staining, also known as watermarking, embeds secret behaviour into a model which can later be used to identify it, while locking aims to make a model unusable unless a secret trigger is inserted into input images. Unlike existing methods, our algorithms can be used to stain and lock pre-trained models without requiring fine-tuning or retraining, and come with provable, computable guarantees bounding their worst-case false positive rates. The stain and lock are implemented by directly modifying a small number of the model's weights and have minimal impact on the (unlocked) model's performance. Locked models are unlocked by inserting a small `trigger patch' into the corner of the input image. We present experimental results showing the efficacy of our methods and demonstrating their practical performance on a variety of computer vision models.
Poster
Dahee Kwon · Sehyun Lee · Jaesik Choi

[ Exhibit Hall I ]

Abstract
Deep vision models have achieved remarkable classification performance by leveraging a hierarchical architecture in which human-interpretable concepts emerge through the composition of individual neurons across layers. Given the distributed nature of representations, pinpointing where specific concepts are encoded within a model remains a crucial yet challenging task in computer vision. In this paper, we introduce an effective circuit discovery method, called $\textit{Granular Concept Circuits (GCCs)}$, in which each circuit represents a concept relevant to a given query. Our method iteratively assesses inter-neuron connectivity—focusing on dependencies and semantic alignment—to construct each GCC. By automatically discovering multiple GCCs, each capturing specific concepts within that query, our approach offers a profound, concept-wise interpretation of models and is the first to identify circuits tied to specific visual concepts at a fine-grained level. We validate the versatility and effectiveness of GCCs across various deep image classification models. The source code will be publicly available.
Poster
Jianhan Wu · Xiaoyang Qu · Zhangcheng Huang · Jianzong Wang

[ Exhibit Hall I ]

Abstract
Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. In federated learning scenarios, data across different clients is often non-IID., leading to domain shift among clients, which poses a formidable challenge to the adaptation of downstream tasks. Federated domain generalization (FDG) methods typically learn fixed or residual soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts lack diversity and tend to ignore information about unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, in the training phase, we introduce domain-specific soft prompts (DSPs) for each domain and integrate domain and content knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Extensive experiments on several public datasets show that our method achieves state-of-the-art performance compared with the strong baselines in FDG.
Poster
yi yang · Xiaoxuan He · Hongkun Pan · Xiyan Jiang · Yan Deng · Xingtao Yang · Haoyu Lu · Dacheng Yin · Fengyun Rao · Minfeng Zhu · Bo Zhang · Wei Chen

[ Exhibit Hall I ]

Abstract
Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.
Poster
Yunfei Long · Zilin Tian · Liguo Zhang · Huosheng Xu

[ Exhibit Hall I ]

Abstract
Transferability makes the black-box attacks to be practical. Recent studies demonstrate that adversarial examples situated at the flat maxima on the loss landscape tend to exhibit higher transferability and propose effective strategies to optimize adversarial examples to converge toward that region. However, these works primarily consider the first-order gradient regularization and have yet to explore higher-order geometry properties of the flat loss landscape, which may lead to suboptimal results. In this work, we propose leveraging the trace of the Hessian matrix of loss function with respect to the adversarial example as a curvature-aware regularizer. For computationally efficient, we introduce an approximation method for the trace based on stochastic estimation and finite difference. We theoretically and empirically demonstrate that the trace of Hessian matrices for adversarial examples near local loss maxima is consistently negative. Following this insight, we propose **Negative Hessian Trace Regularization (NHTR)**, explicitly penalizing the negative Hessian trace to suppress curvature. Compared to existing first-order regularization methods, NHTR can generate adversarial examples at flatter local regions. Extensive experimental results on the ImageNet-compatible and CIFAR-10 datasets show that NHTR can significantly improve adversarial transferability than the state-of-the-art attacks.
Poster
Laura Niss · Kevin Vogt-Lowell · Theodoros Tsiligkaridis

[ Exhibit Hall I ]

Abstract
The fine-tuning of large vision-language foundation models remains an underexplored area, particularly regarding its impact on learning gains and catastrophic forgetting. Inspired by the significance of modality gaps in contrastive dual-encoders, we introduce the Inter-Intra Modal Measure (IIMM)—a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment. Through extensive empirical analysis across four state-of-the-art vision-language models and five fine-tuning techniques, we establish a strong linear relationship: tasks with higher IIMM scores yield greater in-domain performance improvements but suffer from more pronounced out-of-domain degradation, with some parameter-efficient fine-tuning (PEFT) methods exhibiting severe forgetting. Compared to existing transferability measures, the IIMM demonstrates significantly stronger predictive power for accuracy changes post fine-tuning in dual-encoder models. Moreover, we provide a theoretical bound, proving that changes in IIMM are limited by the Wasserstein distance between pre- and post-fine-tuning embedding distributions, ensuring its stability and robustness as a predictive measure. With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning. When combined with prior knowledge of a model’s performance across diverse tasks, the IIMM further enhances transferability predictions for novel …
Poster
Byungchul Chae · Seonyeong Heo

[ Exhibit Hall I ]

Abstract
Knowledge Distillation (KD) has been established as an effective technique for reducing the resource requirements of models when tackling computer vision tasks. Prior work has studied how to distill the knowledge of a teacher model better, but it overlooks how data affects the distillation result. This work examines the impact of data in knowledge distillation from two perspectives: (i) quantity of knowledge and (ii) quality of knowledge. Our examination finds that faster knowledge distillation can be achieved by using data with a large amount of high-quality knowledge in distillation. Based on the findings, this work proposes an efficient adaptive sampling method called KDAS for faster knowledge distillation, which enhances the distillation efficiency by selecting and applying 'good' samples for the distillation. This work shows that our adaptive sampling methods can effectively accelerate the training efficiency of a student model when combined with existing KD methods.
Poster
Yuhao Sun · Yihua Zhang · Gaowen Liu · Hongtao Xie · Sijia Liu

[ Exhibit Hall I ]

Abstract
With the increasing demand for the right to be forgotten, machine unlearning (MU) has emerged as a vital tool for enhancing trust and regulatory compliance by enabling the removal of sensitive data influences from machine learning (ML) models. However, most MU algorithms primarily rely on in-training methods to adjust model weights, with limited exploration of the benefits that data-level adjustments could bring to the unlearning process. To address this gap, we propose a novel approach that leverages digital watermarking to facilitate MU by strategically modifying data content. By integrating watermarking, we establish a controlled unlearning mechanism that enables precise removal of specified data while maintaining model utility for unrelated tasks. We first examine the impact of watermarked data on MU, finding that MU effectively generalizes to watermarked data. Building on this, we introduce an unlearning-friendly watermarking framework, termed Water4MU, to enhance unlearning effectiveness. The core of Water4MU is a bi-level optimization (BLO) framework: at the upper level, the watermarking network is optimized to minimize unlearning difficulty, while at the lower level, the model itself is trained independently of watermarking. Experimental results demonstrate that Water4MU is effective in MU across both image classification and image generation tasks. Notably, it outperforms existing …
Poster
Lixu Wang · Chenxi Liu · Junfeng Guo · Qingqing Ye · Heng Huang · Haibo Hu · Wei Dong

[ Exhibit Hall I ]

Abstract
Federated Learning (FL) studies often assume a static data distribution, whereas real-world scenarios involve dynamic changes. To address this gap, we study Federated Continuous Category Discovery and Learning (FC^2DL)---an essential yet underexplored problem that enables FL models to evolve continuously by discovering and learning novel data categories. The key challenge in FC^2DL lies in merging and aligning categories discovered and learned by different clients, all while maintaining privacy. To tackle this, we propose the Global Prototype Alignment (GPA) framework. GPA first estimates the number of categories and constructs global prototypes by locating high-density regions in the representation space through bi-level clustering. To mitigate pseudo-label noise, GPA then employs a semantic-weighted loss to capture correlations between global prototypes and the novel data. This semantic weighting strategy is also used for contrastive loss, facilitating unsupervised novel-category learning. Besides, GPA incorporates a mixup-based mechanism for both data and models, effectively mitigating interference between known and novel categories while alleviating forgetting. Extensive experiments across multiple datasets demonstrate GPA’s superiority over state-of-the-art baseline approaches. Notably, GPA achieves absolute gains of 5.7\% to 13.1\% in novel category accuracy while preserving known category performance. Furthermore, GPA is highly adaptable, equipping various mainstream FL algorithms with category discovery …
Poster
Hoang Phan · Tung Lam Tran · Quyen Tran · Ngoc Tran · Tuan Truong · Qi Lei · Nhat Ho · Dinh Phung · Trung Le

[ Exhibit Hall I ]

Abstract
Multi-task learning (MTL) trains deep neural networks to optimize several objectives simultaneously using a shared backbone, which leads to reduced computational costs, improved data efficiency, and enhanced performance through cross-task knowledge sharing. Although recent gradient manipulation techniques seek a common descent direction to benefit all tasks, conventional empirical loss minimization still leaves models prone to overfitting and gradient conflicts. To address this, we introduce a novel MTL framework that leverages weight perturbation to regulate gradient norms. thus improve generalization. By carefully modulating weight perturbations, our approach harmonizes task-specific gradients, reducing conflicts and encouraging more robust learning across tasks. Theoretical insights reveal that controlling the gradient norm through weight perturbation directly contributes to better generalization. Extensive experiments across diverse applications demonstrate that our method significantly outperforms existing gradient-based MTL techniques in terms of task performance and overall model robustness.
Poster
Wufei Xie · Yalin Wang · Chenliang Liu · Zhaohui Jiang · Xue Yang

[ Exhibit Hall I ]

Abstract
Few-Shot Class-Incremental Learning (FSCIL) is challenged by limited data and expanding class spaces, leading to overfitting and catastrophic forgetting. Existing methods, which often freeze feature extractors and use Nearest Class Mean classifiers, sacrifice adaptability to new feature distributions. To address these issues, we propose Flexi-FSCIL, a semi-supervised framework that integrates three novel strategies: Adaptive Gated Residual Fusion (AGRF), Attention-Guided Dynamic Hybrid Distillation (ADHD), and Prototype Offset Equilibrium (POE). Flexi-FSCIL effectively balances stability and plasticity in FSCIL. AGRF resolves the rigidity of frozen feature extractors by integrating both frozen and trainable components, enabling adaptive feature learning while retaining old-class knowledge. ADHD tackles the imbalance between old and new tasks by dynamically aligning features using cross-attention maps and direct matching, preserving old-class knowledge while facilitating new-class learning. POE addresses the issue of prototype drift in semi-supervised settings by selecting high-quality unlabeled samples, maintaining feature space separability and preventing overfitting. Evaluated on three benchmark datasets, Flexi-FSCIL achieves state-of-the-art performance, significantly outperforming existing FSCIL methods with only 12.97 performance drop on CUB200.
Poster
Jiaqi Wu · Simin Chen · Jing Tang · Yuzhe YANG · Yiming Chen · Lixu Wang · Song Lin · Zehua Wang · Wei Chen · Zijian Tian

[ Exhibit Hall I ]

Abstract
General-purpose Vision-Language Models (VLMs) have driven major advancements in multimodal AI. Fine-tuning these models with task-specific data enhances adaptability to various downstream tasks but suffers from privacy risks. While potential solutions like federated learning can address user data privacy concerns, model protection is also essential. Other methods that rely on black-box VLM APIs usually require the access of prediction logits, leaving them open to inversion attacks. Moreover, addressing the challenges of tuning complexity and data transmission efficiency in federated VLM scenarios is also crucial. To address these challenges, we propose FDPT—a federated discrete prompt tuning method utilizing black-box VLMs. During client optimization stage, FDPT employs an agent-driven framework leveraging large language models (LLMs) with enhanced reasoning capacities to systematically optimize discrete prompt representations, and also utilizes feedback mechanisms and chain of thought to enhance prediction accuracy. Importantly, it performs optimization by relying not on the predicted logic vectors output by LLMs but on textual results, avoiding reverse attack risks. During global aggregation stage, We mimic human electoral activities by employing evolutionary computation methods underpinned by semantic similarity computation to implement enhanced zero-order optimization for acquiring representative global tokens, thereby achieving knowledge aggregation. FDPT significantly outperforms nine state-of-the-art methods in image …
Poster
Muhammad Anwar Ma'sum · Mahardhika Pratama · Savitha Ramasamy · Lin Liu · H Habibullah · Ryszard Kowalczyk

[ Exhibit Hall I ]

Abstract
The data privacy constraint in online continual learning (OCL), where the data can be seen only once, complicates the catastrophic forgetting problem in streaming data. A common approach applied by the current SOTAs in OCL is with the use of memory saving exemplars or features from previous classes to be replayed in the current task. On the other hand, the prompt-based approach performs excellently in continual learning but with the cost of a growing number of trainable parameters. The first approach may not be applicable in practice due to data openness policy, while the second approach has the issue of throughput associated with the streaming data. In this study, we propose a novel prompt-based method for online continual learning that includes 4 main components: (1) Single light-weight prompt generator as a general knowledge, (2) trainable scaler-and-shifter as specific knowledge, (3) PTM generalization preserving, and (4) hard-soft updates mechanism. Our proposed method achieves significantly higher performance than the current SOTAs in CIFAR100, ImageNet-R, ImageNet-A, and CUB dataset. Our complexity analysis shows that our method requires a relatively smaller number of parameters and achieves moderate training time, inference time, and throughput. For further study, the source code of our method is available …
Poster
Lingyong Fang · Xinzhong Wang · Depeng depeng wang · Zongru Wu · Ya Guo · Huijia Zhu · Zhuosheng Zhang · Gongshen Liu

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) contain a substantial amount of factual knowledge, which may become outdated or inaccurate over time. Consequently, various knowledge editing techniques have been proposed to update the knowledge encoded within these models. Previous approaches maintain modality consistency during both the editing and testing phases. However, in practical applications, it is desirable for knowledge to be transferable across different modalities, which can enhance the robustness of knowledge editing and potentially allow for cost-effective editing of multimodal knowledge using textual information. To address this, we introduce the concept of Transitivity of Multimodal Knowledge Editing (TMKE) and design corresponding evaluation criteria. Subsequently, we construct a corresponding TMKE Benchmark through an automated pipeline. We evaluate three MLLMs and five knowledge editing methods, uncovering limitations in the current models and methods concerning transitivity. Additionally, we analyze the intrinsic representations of the model during the editing process based on Knowledge Neurons to interpret the experimental phenomena.
Poster
Mustafa Shukor · Enrico Fini · Victor Guilherme Turrisi da Costa · Matthieu Cord · Joshua Susskind · Alaaeldin El-Nouby

[ Exhibit Hall I ]

Abstract
Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing training on multimodal data. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)—those trained from the ground up on all modalities—and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on pre-trained image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter count, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.
Poster
Jerred Chen · Ronald Clark

[ Exhibit Hall I ]

Abstract
In many robotics and VR/AR applications, fast camera motions cause a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.
Poster
Zhen Zeng · Leijiang Gu · Xun Yang · Zhangling Duan · Zenglin Shi · Meng Wang

[ Exhibit Hall I ]

Abstract
Existing knowledge editing works for MultiModal Large Language Models primarily focus on text-oriented, coarse-grained scenarios, where modifying textual content alone is sufficient. As a result, they fail to capture the unique challenges of multimodal editing, particularly when visual information is central to knowledge representation. In this paper, we introduce a visual-oriented, fine-grained multimodal knowledge editing task that targets precise modifications in images containing multiple interacting entities. To support this, we propose the Fine-Grained Visual Knowledge Editing (FGVEdit) benchmark, designed to evaluate the accuracy and effectiveness of multimodal editing at a granular level. To address this challenge, we present the Multimodal Scope Classifier-based Knowledge Editor (MSCKE), a new framework that leverages a multimodal scope classifier to integrate both textual and visual information. By accurately identifying and updating knowledge localized within images, MSCKE ensures precise editing while preserving unrelated content. Extensive experiments on the FGVEdit benchmark highlight the complexity of this new task and demonstrate that existing methods struggle with fine-grained multimodal editing. Our results highlight MSCKE as a scalable and promising framework for advancing multimodal knowledge editing.
Poster
Xingyu Zhu · Shuo Wang · Beier Zhu · Miaoge Li · Yunfan Li · Junfeng Fang · Zhicai Wang · Dongsheng Wang · Hanwang Zhang

[ Exhibit Hall I ]

Abstract
With the increasing attention to pre-trained vision-language models (VLMs), e.g., CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce **ProtoMM**, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03\% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.
Poster
Pengzhan Sun · Junbin Xiao · Tze Ho Elden Tse · Yicong Li · Arjun Akula · Angela Yao

[ Exhibit Hall I ]

Abstract
Visual grounding associates textual descriptions with objects in an image. Conventional methods target third-person image inputs and named object queries. In applications such as AI assistants, the perspective shifts -- inputs are egocentric, and objects may be referred to implicitly through needs and intentions. To bridge this gap, we introduce EgoIntention, the first dataset for egocentric visual intention grounding. EgoIntention challenges multimodal LLMs to 1) understand and ignore unintended contextual objects and 2) reason about uncommon object functionalities. Benchmark results show that current models misidentify context objects and lack affordance understanding in egocentric views. We also propose Reason-to-Ground (RoG) instruction tuning; it enables hybrid training with normal descriptions and egocentric intentions with a chained intention reasoning and object grounding mechanism. RoG significantly outperforms naive finetuning and hybrid training on EgoIntention, while maintaining or slightly improving naive description grounding. This advancement enables unified visual grounding for egocentric and exocentric visual inputs while handling explicit object queries and implicit human intentions.
Poster
Zeqiang Lai · Zhao Yunfei · Zibo Zhao · Haolin Liu · Fu-Yun Wang · Huiwen Shi · Xianghui Yang · Qingxiang Lin · Jingwei Huang · Lliu Yuhong · Jie Jiang · Chunchao Guo · Xiangyu Yue

[ Exhibit Hall I ]

Abstract
3D shape generation has greatly flourished through the development of so-called ``native" 3D diffusion, particularly through the Vectset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles at high-speed generation. Challenges exist because of not only difficulties in accelerating diffusion sampling but also VAE decoding in VDM -- areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps, while maintaining comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation technique. For VAE, we introduce a lightning vectset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding,, and Efficient Network Design. By exploiting the locality of vectset and the sparsity of shape surface in the volume, the proposed decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to the current state-of-the-art open-source shape generation model Hunyuan3D-2, resulting in Hunyuan3D-2 Turbo. Through systematic evaluation for both generation and reconstruction, we demonstrate that our model outperforms existing fast 3D generation methods by a significant margin, achieving …
Poster
Xin Dong · Shichao Dong · Jin Wang · Jing Huang · Li Zhou · Zenghui Sun · Lihua Jing · Jinsong Lan · Xiaoyong Zhu · Bo Zheng

[ Exhibit Hall I ]

Abstract
Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose $\textbf{INTER}: \textbf{Inter}$action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.
Poster
Zhangquan Chen · Xufang Luo · Dongsheng Li

[ Exhibit Hall I ]

Abstract
Visual understanding is inherently intention-driven—humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as a internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs.
Poster
JiaKui Hu · Yuxiao Yang · Jialun Liu · Jinbo Wu · Chen Zhao · Yanye Lu

[ Exhibit Hall I ]

Abstract
Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (\textbf{MV-AR}) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the ``Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view …
Poster
Qi Li · Runpeng Yu · Xinchao Wang

[ Exhibit Hall I ]

Abstract
Parameter-level model merging is an emerging paradigm in multi-task learning with significant promise. Previous research has explored its connections with prediction-level model ensembling—commonly viewed as the upper bound for merging—to reveal the potential of achieving performance consistency between the two. However, this observation relies on certain preconditions, such as being limited to two models, using ViT-based models, and all models are fine-tuned from the same pre-trained checkpoint. To further understand the intrinsic connections between these two paradigms, this paper explores an interesting possibility: If these restrictions are removed, can performance consistency still be achieved between merging and ensembling? To answer this question, we first theoretically establish a performance correlation between merging and ensembling. We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling. To verify whether our findings are practical, we introduce a validation framework termed \underline{Neu}ral \underline{Lig}and (NeuLig). The learning process of NeuLig is meticulously designed with a specialized loss function supported by theoretical foundations. Experimental results demonstrate the robust resilience of NeuLig in terms of both model scale and the number of collaborating models. For instance, for the …
Poster
Yiming Cui · Liang Li · Haibing YIN · Yuhan Gao · Yaoqi Sun · Chenggang Yan

[ Exhibit Hall I ]

Abstract
Day-to-Night Domain Adaptive Object Detection (DN-DAOD) is a significant challenge due to the low visibility and signal-to-noise ratio at night. Although recent self-training approaches achieve promising results, they fail to address three critical biases: distribution bias, training bias, and confirmation bias. Therefore, we propose a Debiased Teacher to address the above biases from three aspects: domain transforming, representation compensating, and pseudo label calibrating. Concretely, the day-to-night domain transforming module (DNDT) leverages physical priors to model some key day-night domain differences, thus transforming daytime images into night-like images. Then, the cross-domain representation compensating module (CDRC) selectively mixes objects from nighttime and night-like images to compensate for the model’s general representation of nighttime objects. Further, to correct confirmation bias caused by learning from inaccurate pseudo labels, the pseudo label confirmation calibrating module (ConCal) is designed to obtain accurate pseudo labels for better nighttime knowledge learning. Experimental results on three benchmarks demonstrate that our method outperforms current SOTA methods by a large margin. Our code is released in supplementary materials.
Poster
Chuang Yu · Jinmiao Zhao · Yunpeng Liu · Sicheng Zhao · Yimian Dai · Xiangyu Yue

[ Exhibit Hall I ]

Abstract
Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we construct an innovative Progressive Active Learning (PAL) framework for single point supervision, which drives the existing SIRST detection networks progressively and actively recognizes and learns more hard samples to achieve significant performance improvements. Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. Meanwhile, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our …
Poster
Qi Wang · Zhipeng Zhang · Baao Xie · Xin Jin · Yunbo Wang · Shiyu Wang · Liaomo Zheng · Xiaokang Yang · Wenjun Zeng

[ Exhibit Hall I ]

Abstract
Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentanglement representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.
Poster
Mutian Xu · Chongjie Ye · Haolin Liu · Yushuang Wu · Jiahao Chang · Xiaoguang Han

[ Exhibit Hall I ]

Abstract
3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes StableDiffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D …
Poster
Mingqi Yuan · Bo Li · Xin Jin · Wenjun Zeng

[ Exhibit Hall I ]

Abstract
Hyperparameter optimization (HPO) is a billion-dollar problem in machine learning, which significantly impacts the training efficiency and model performance. However, achieving efficient and robust HPO in deep reinforcement learning (RL) is consistently challenging due to its high non-stationarity and computational cost. To tackle this problem, existing approaches attempt to adapt common HPO techniques (e.g., population-based training or Bayesian optimization) to the RL scenario. However, they remain sample-inefficient and computationally expensive, which cannot facilitate a wide range of applications. In this paper, we propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. ULTHO also provides a quantified and statistical perspective to filter the HPs efficiently. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet. Extensive experiments demonstrate that the ULTHO can achieve superior performance with simple architecture, contributing to the development of advanced and automated RL systems.
Poster
Hualong Ke · Yachao Zhang · Jiangming Shi · FangyongWang FangyongWang · Yuan Xie · Yanyun Qu

[ Exhibit Hall I ]

Abstract
Federated Continual Learning (FCL) has recently garnered significant attention due to its ability to continuously learn new tasks while protecting user privacy. However, existing Data-Free Knowledge Transfer (DFKT) methods require training the entire model, leading to high training and communication costs, while prompt pool-based methods with accessing other task-specific prompts in the pool may pose privacy leakage risk. To address these challenges, we propose a novel method: Task-aware Prompt gradient Projection and Replay (TPPR), which leverages visual prompts to build a parameter-efficient tuning architecture, thereby significantly reducing training and communication costs. Specifically, we propose the Task-Aware Prompt Gradient Projection (TAPGP) mechanism, from the perspective of protecting learned knowledge, to balance the learning of task-agnostic and task-specific knowledge in a pool-free manner. In practice, we make the gradient of the deep prompts orthogonal to the virtual data and prompts of preceding tasks, which prevents the erosion of old task knowledge while allowing the model to learn new information. Additionally, we introduce Dual-Level Prompt Replay (DLPR) based on exponential moving average to facilitate knowledge review at both inter-task and intra-task levels, effectively inheriting learned knowledge. Extensive experimental results demonstrate that our method effectively reduces model communication overhead and alleviates forgetting while fully …
Poster
Simon Reiß · Zdravko Marinov · Alexander Jaus · Constantin Seibold · M. Sarfraz · Erik Rodner · Rainer Stiefelhagen

[ Exhibit Hall I ]

Abstract
In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks.We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine.This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks.Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks.Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.
Poster
Yuwei Yang · Zeyu Zhang · Yunzhong Hou · Zhuowan Li · Gaowen Liu · Ali Payani · Yuan-Sen Ting · Liang Zheng

[ Exhibit Hall I ]

Abstract
Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30\%-50\% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts.In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the Effective Chart Dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test …
Poster
lee hyuck · Taemin Park · Heeyoung Kim

[ Exhibit Hall I ]

Abstract
In class-imbalanced learning (CIL), post-hoc logit adjustment (LA) effectively mitigates class imbalance by adjusting biased logits according to label frequencies. Given the success of LA in CIL, recent class-imbalanced semi-supervised learning (CISSL) algorithms incorporated LA, leading to improved performance when labeled and unlabeled datasets share the same class distribution. However, a common real-world scenario involves the unknown class distribution of the unlabeled set, which may mismatch that of the labeled set. In this case, LA may result in an inappropriate degree of logit adjustments, potentially degrading classification performance due to its inability to incorporate the unknown class distribution of the unlabeled set. To address this problem, we propose a novel CISSL algorithm named learnable logit adjustment (LLA). Unlike the original LA, LLA learns the appropriate degree of logit adjustment by minimizing the class-averaged loss computed for both the labeled and unlabeled sets. Based on the learned degree, LLA refines the biased pseudo-labels of base semi-supervised learning algorithms and adjusts the biased class predictions on the test set by adjusting the logits. Experimental results on benchmark datasets demonstrate that LLA achieves state-of-the-art performance in CISSL.
Poster
Yuchen Zhou · Jiayu Tang · Xiaoyan Xiao · Yueyao Lin · Linkai Liu · Zipeng Guo · Hao Fei · Xiaobo Xia · Chao Gou

[ Exhibit Hall I ]

Abstract
Modeling task-driven attention in driving is a fundamental challenge for both autonomous vehicles and cognitive science. Existing methods primarily predict where drivers look by generating spatial heatmaps, but fail to capture the cognitive motivations behind attention allocation in specific contexts, which limits deeper understanding of attention mechanisms. To bridge this gap, we introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why). To support this, we present W³DA, the first large-scale explainable driver attention dataset. It enriches existing benchmarks with detailed semantic and causal annotations across diverse driving scenarios, including normal conditions, safety-critical situations, and traffic accidents. We further propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture. Extensive experiments demonstrate the effectiveness of LLada, exhibiting robust generalization across datasets and driving conditions. This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction. The dataset, code, and models will be released.
Poster
Ma Teng · Xiaojun Jia · Ranjie Duan · Xinfeng Li · Yihao Huang · Xiaoshuang Jia · Zhixuan Chu · Wenqi Ren

[ Exhibit Hall I ]

Abstract
With the rapid advancement of multimodal large language models (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry. Although MLLMs are vulnerable to jailbreak attacks, designing effective jailbreak attacks poses unique challenges, especially given the highly constrained adversarial capabilities in real-world deployment scenarios. Previous works concentrate risks into a single modality, resulting in limited jailbreak performance. In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which is black-box and consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. The multimodal risk distribution strategy is used to distribute harmful semantics into multiple modalities to effectively circumvent the single-modality protection mechanisms of MLLMs. The heuristic-induced search strategy identifies two types of prompts: the understanding-enhancing prompt, which helps MLLMs reconstruct the malicious prompt, and the inducing prompt, which increases the likelihood of affirmative outputs over refusals, enabling a successful jailbreak attack. HIMRD achieves an average attack success rate (ASR) of 90% across seven open-source MLLMs and an average ASR of around 68% in three closed-source MLLMs. HIMRD reveals cross-modal security vulnerabilities in current MLLMs and underscores the imperative for developing defensive strategies to mitigate such emerging risks.
Poster
Qianqian Wang · Bowen Zhao · Zhengming Ding · Wei Feng · Quanxue Gao

[ Exhibit Hall I ]

Abstract
Existing hypergraph clustering methods typically assume that node attributes are fully available. However, in real-world scenarios, missing node attributes are common due to factors such as data privacy concerns or failures in data collection devices. While some approaches attempt to handle missing attributes in traditional graphs, they are not designed for hypergraphs, which encode higher-order relationships and introduce additional challenges. To bridge this gap, we propose \textbf{H}ypergraph \textbf{C}lustering \textbf{N}etwork with \textbf{P}artial \textbf{A}ttribute \textbf{I}mputation (HCN-PAI). Specifically, we first leverage higher-order neighborhood propagation to impute missing node attributes by minimizing the Dirichlet energy, ensuring smooth feature propagation across the hypergraph. Next, we introduce a hypergraph smoothing preprocessing that efficiently captures structural information, replacing the hypergraph convolution operation, and significantly reducing computational costs. Finally, we design a dual-space projection contrast mechanism, which employs two independent MLPs to encode node representations into two distinct views and enforces consistency at both node and hyperedge levels. Extensive experiments on multiple benchmark datasets validate the effectiveness and superiority of our proposed method.
Poster
Weijia Zhang · Fei Xie · Weidong Cai · Chao Ma

[ Exhibit Hall I ]

Abstract
Knowledge distillation (KD) aims to transfer the knowledge of a more capable yet cumbersome teacher model to a lightweight student model. In recent years, relation-based KD methods have fallen behind, as instance-matching counterparts dominate in performance. In this paper, we revive relational KD by identifying and tackling several key issues in relational KD, including its susceptibility to overfitting and spurious responses. Specifically, we transfer novelly constructed affinity graphs that compactly encapsulate a wealth of beneficial inter-sample, inter-class, and inter-view correlations by exploiting virtual views and relations as a new kind of knowledge. As a result, the student has access to rich guidance signals and stronger regularisation throughout the distillation process. To further mitigate the adverse impact of spurious responses, we prune the affinity graphs by dynamically detaching redundant and unreliable edges. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO datasets demonstrate the superior performance of the proposed virtual relation matching (VRM) method over a range of tasks, architectures, and set-ups. For instance, VRM for the first time hits 74.0% accuracy for ResNet50-to-MobileNetV2 distillation on ImageNet, and improves DeiT-Ti by 14.44% on CIFAR-100 with a ResNet56 teacher. Code and models will be released.
Poster
Junjie Shan · Ziqi Zhao · Jialin Lu · Rui Zhang · SM Yiu · Ka-Ho Chow

[ Exhibit Hall I ]

Abstract
Foundation models that bridge vision and language have made significant progress. While they have inspired many life-enriching applications, their potential for abuse in creating new threats remains largely unexplored. In this paper, we reveal that vision-language models (VLMs) can be weaponized to enhance gradient inversion attacks (GIAs) in federated learning (FL), where an FL server attempts to reconstruct private data samples from gradients shared by victim clients. Despite recent advances, existing GIAs struggle to reconstruct high-resolution images when the victim has a large local data batch. One promising direction is to focus reconstruction on valuable samples rather than the entire batch, but current methods lack the flexibility to target specific data of interest. To address this gap, we propose Geminio, the first approach to transform GIAs into semantically meaningful, targeted attacks. It enables a brand new privacy attack experience: attackers can describe, in natural language, the data they consider valuable, and Geminio will prioritize reconstruction to focus on those high-value samples. This is achieved by leveraging a pretrained VLM to guide the optimization of a malicious global model that, when shared with and optimized by a victim, retains only gradients of samples that match the attacker-specified query. Geminio can be …
Poster
Pengkun Jiao · Bin Zhu · Jingjing Chen · Chong-Wah Ngo · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead. However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts.To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter's capacity to address data conflict through dual structural optimization. Specifically, we utilize two subspaces: a skill space for stable, holistic knowledge retention, and a rank-rectified task space that locally activates the holistic knowledge.Additionally, we introduce Visual Cue Enhancement (VCE), a multi-level local feature aggregation module designed to enrich the vision-language projection with local details.Our approach is both memory- and time-efficient, requiring only 1.16$\times$ the inference time of the standard LoRA method (with injection into the query and value projection layers), and just 73\% of the inference time of a 4-expert LoRA-MoE. Extensive experiments on various downstream tasks and general MLLM benchmarks validate the effectiveness of our proposed methods.
Poster
Sounak Mondal · Naveen Sendhilnathan · Ting Zhang · Yue Liu · Michael Proulx · Michael Iuzzolino · Chuan Qin · Tanya Jonker

[ Exhibit Hall I ]

Abstract
Decoding human intent from eye gaze during a visual search task has become an increasingly important capability within augmented and virtual reality systems. However, gaze target prediction models used within such systems are constrained by the predefined target categories found within available gaze data, limiting their generalizability to novel categories and their usefulness within real-world, interactive systems. In this work, we present the Gaze-Language Alignment Model (GLAM), a vision-language model that can generalize gaze target predictions to novel categories of search targets lacking gaze annotation. To do so, GLAM uses a novel gaze encoder to encode foveal and peripheral information of a gaze scanpath. The resultant gaze embeddings are aligned with language embeddings of large language model-generated search descriptions for associated target categories using a novel contrastive learning strategy called Gaze-Language Alignment Decomposition (GLAD). When used to train GLAM in a zero-shot setup, GLAD surpassed naive contrastive learning strategies by nearly one-third in target prediction accuracy, even outperforming a fully supervised baseline. Moreover, in a fully supervised setup, GLAM outperformed previous methods in target prediction accuracy, regardless of the training strategy used.
Poster
Shanshan Yan · Zexi Li · Chao Wu · Meng Pang · Yang Lu · Yan Yan · Hanzi Wang

[ Exhibit Hall I ]

Abstract
Data heterogeneity, stemming from local non-IID data and global long-tailed distributions, is a major challenge in federated learning (FL), leading to significant performance gaps compared to centralized learning. Previous research found that poor representations and biased classifiers are the main problems and proposed neural-collapse-inspired synthetic simplex ETF to help representations be closer to neural collapse optima. However, we find that the neural-collapse-inspired methods are not strong enough to reach neural collapse and still have huge gaps to centralized training. In this paper, we rethink this issue from a self-distillation perspective and propose FedYoYo (You Are Your Own Best Teacher), introducing Augmented Self-bootstrap Distillation (ASD) to improve representation learning by distilling knowledge between weakly and strongly augmented local samples, without needing extra datasets or models. We further introduce Distribution-aware Logit Adjustment (DLA) to balance the self-distillation process and correct biased feature representations. FedYoYo nearly eliminates the performance gap, achieving centralized-level performance even under mixed heterogeneity. It enhances local representation learning, reducing model drift and improving convergence, with feature prototypes closer to neural collapse optimality. Extensive experiments show FedYoYo achieves state-of-the-art results, even surpassing centralized logit adjustment methods by 5.4\% under global long-tailed settings. The code is available at https://anonymous.4open.science/r/FedYoYo-1F01}{https://anonymous.4open.science/r/FedYoYo-1F01.
Poster
Chancharik Mitra · Brandon Huang · Tianning Chai · Zhiqiu Lin · Assaf Arbelle · Rogerio Feris · Leonid Karlinsky · Trevor Darrell · Deva Ramanan · Roei Herzig

[ Exhibit Hall I ]

Abstract
Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks. Despite strong performance, LMMs' generative outputs are not specialized for vision-language classification tasks (i.e., tasks with vision-language inputs and discrete labels) such as image classification and multiple-choice VQA.One key challenge in utilizing LMMs for these tasks is the extraction of useful features from generative LMMs.To overcome this, we propose an approach that leverages multimodal feature extraction from the LMM's latent space.Toward this end, we present Sparse Attention Vectors (SAVs)---a finetuning-free method that leverages sparse attention head activations (fewer than 5% of the heads) in LMMs as strong feature representations.With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of vision-language classification tasks.Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.
Poster
Jiawei Wang · Yushen Zuo · Yuanjun Chai · Zhendong Liu · Yicheng Fu · Yichun Feng · Kin Man Lam

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code will be open-sourced.
Poster
Kejia Zhang · Juanjuan Weng · Zhiming Luo · Shaozi Li

[ Exhibit Hall I ]

Abstract
Despite the remarkable progress of deep neural networks (DNNs) in various visual tasks, their vulnerability to adversarial examples raises significant security concerns. Recent adversarial training methods leverage inverse adversarial attacks to generate high-confidence examples, aiming to align adversarial distributions with high-confidence class regions. However, our investigation reveals that under inverse adversarial attacks, high-confidence outputs are influenced by biased feature activations, causing models to rely on background features that lack a causal relationship with the labels. This spurious correlation bias leads to overfitting irrelevant background features during adversarial training, thereby degrading the model's robust performance and generalization capabilities. To address this issue, we propose Debiased High-Confidence Adversarial Training (DHAT), a novel approach that aligns adversarial logits with debiased high-confidence logits and restores proper attention by enhancing foreground logit orthogonality. Extensive experiments demonstrate that DHAT achieves state-of-the-art robustness on both CIFAR and ImageNet-1K benchmarks, while significantly improving generalization by mitigating the feature bias inherent in inverse adversarial training approaches. Code is available at~\url{https://anonymous.4open.science/r/ICCV-7546}.
Poster
Shiyu Zhang · Cheng Yan · Yang Liu · Chenchen Jing · Lei Zhou · Wenjun Wang

[ Exhibit Hall I ]

Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions by leveraging knowledge from seen compositions. Existing methods align textual prototypes with visual features through Vision-Language Models (VLMs), but they face two key limitations: (1) modality gaps hinder the discrimination of semantically similar composition pairs, and (2) single-modal textual prototypes lack fine-grained visual cues, creating bottlenecks in VLM-based CZSL. In this paper, we introduce Visual Proxy Learning, a novel approach that facilitates the learning of distinct visual distributions, effectively reducing the modality gap and improving compositional generalization performance. Specifically, we initialize visual proxies for various attributes, objects, and their compositions using text representations. By optimizing the visual space, we capture fine-grained visual cues and guide the learning of more discriminative visual representations for attributes, objects and compositions.Furthermore, we propose an effective Cross-Modal Joint Learning (CMJL) strategy that imposes cross-modal constraints between the original text-image space and the fine-grained visual space. This approach not only boosts generalization for previously unseen composition pairs but also sharpens the discrimination of similar pairs, fostering more robust and precise learning.Extensive experiments demonstrate state-of-the-art performance in closed-world scenarios and competitive open-world results across four established CZSL benchmarks, validating the effectiveness of our approach in advancing compositional …
Poster
Zihan Cao · Yu Zhong · Liang-Jian Deng

[ Exhibit Hall I ]

Abstract
Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (\textit{e.g.}, from 1,000 to fewer steps), such approaches often compromise fusion quality.In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model.The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step.
Poster
Liangyu Xiang · Junyu Gao · Changsheng Xu

[ Exhibit Hall I ]

Abstract
Existing logit-based knowledge distillation methods typically employ singularly deterministic categorical distributions, which eliminates the inherent uncertainty in network predictions and thereby limiting the effective transfer of knowledge. To address this limitation, we introduce distribution-based probabilistic modeling as a more comprehensive representation of network knowledge. Specifically, we regard the categorical distribution as a random variable and leverage deep neural networks to predict its distribution, representing it as an evidential second-order distribution. Based on the second-oder modeling, we propose Evidential Knowledge Distillation (EKD) which distills both the expectation of the teacher distribution and the distribution itself into the student. The expectation captures the macroscopic characteristics of the distribution, while the distribution itself conveys microscopic information about the classification boundaries. Additionally, we theoretically demonstrate that EKD's distillation objective provides an upper bound on the expected risk of the student when the teacher’s predictions are treated as ground truth labels. Extensive experiments on several standard benchmarks across various teacher-student network pairs highlight the effectiveness and superior performance of EKD. Our code is available in the Supplementary Material.
Poster
Junsung Park · Jungbeom Lee · Jongyoon Song · Sangwon Yu · Dahuin Jung · Sungroh Yoon

[ Exhibit Hall I ]

Abstract
While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation — such as failing to differentiate concepts like "parking" from "no parking" — poses substantial challenges.By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data.To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions.Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality.Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg—a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence.Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately.Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.
Poster
Nuoye Xiong · Anqi Dong · Ning Wang · Cong Hua · Guangming Zhu · Lin Mei · peiyi shen · zhang liang

[ Exhibit Hall I ]

Abstract
Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64\% and a maximum increase in average accuracy across 1.03\%. Source code is available at: http://anonymous.com.
Poster
Marshall Thomas · Edward Fish · Richard Bowden

[ Exhibit Hall I ]

Abstract
Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes—where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly—often faltering on visually similar phonemes—or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, …
Poster
Nicole Kim · Hwanjun Song

[ Exhibit Hall I ]

Abstract
Dataset condensation aims to compress large dataset into smaller synthetic set while preserving the essential representations needed for effective model training. However, existing condensation methods show severe performance degradation when applied to noisy datasets. To address this, we present robust dataset condensation (RDC), an end-to-end method that mitigates noise to generate a clean and robust synthetic set, without requiring separate noise-reduction preprocessing steps. RDC refines the condensation process by integrating contrastive learning tailored for robust condensation, named golden MixUp contrast. It uses synthetic samples to sharpen class boundaries and to mitigate noisy representations, while its augmentation strategy compensates for the limited size of the synthetic set by identifying clean samples from noisy training data, enriching synthetic images with real-data diversity. We evaluate RDC against existing condensation methods and a conventional approach that first applies noise cleaning algorithms to the dataset before performing condensation. Extensive experiments show that RDC outperforms other approaches on CIFAR-10/100 across different types of noise, including asymmetric, symmetric, and real-world noise.
Poster
Jinsoo Bae · Seoung Bum Kim · Hyungrok Do

[ Exhibit Hall I ]

Abstract
Semi-supervised learning (SSL) uses unlabeled data to improve the performance of machine learning models when labeled data is scarce. However, its real-world applications often face the label distribution mismatch problem, in which the unlabeled dataset includes instances whose ground-truth labels are absent from the labeled training dataset. Recent studies referred to as safe SSL have addressed this issue by using both classification and out-of-distribution (OOD) detection. However, the existing methods may suffer from overconfidence in deep neural networks, leading to increased SSL errors because of high confidence in incorrect pseudo-labels or OOD detection. To address this, we propose a novel method, CaliMatch, which calibrates both the classifier and the OOD detector to foster safe SSL. CaliMatch presents adaptive label smoothing and temperature scaling, which eliminates the need to manually tune the smoothing degree for effective calibration. We give a theoretical justification for why improving the calibration of both the classifier and the OOD detector is crucial in safe SSL. Extensive evaluations on CIFAR-10, CIFAR-100, SVHN, TinyImageNet, and ImageNet demonstrate that CaliMatch outperforms the existing methods in safe SSL tasks.
Poster
Hyewon Park · Hyejin Park · Jueun Ko · Dongbo Min

[ Exhibit Hall I ]

Abstract
Continual Test Time Adaptation (CTTA) has emerged as a critical approach to bridge the domain gap between controlled training environments and real-world scenarios.Since it is important to balance the trade-off between adaptation and stabilization, many studies have tried to accomplish it by either introducing a regulation to fully trainable models or updating a limited portion of the models.This paper proposes **Hybrid-TTA**, a holistic approach that dynamically selects the instance-wise tuning method for optimal adaptation. Our approach introduces Dynamic Domain Shift Detection (DDSD), which identifies domain shifts by leveraging temporal correlations in input sequences, and dynamically switches between Full or Efficient Tuning for effective adaptation toward varying domain shifts. To maintain model stability, Masked Image Modeling Adaptation (MIMA) leverages auxiliary reconstruction task for enhanced generalization and robustness with minimal computational overhead.Hybrid-TTA achieves 0.6\%p gain on the Cityscapes-to-ACDC benchmark dataset for semantic segmentation, surpassing previous state-of-the-art methods. It also delivers about 20-fold increase in FPS compared to the recently proposed fastest methods, offering a robust solution for real-world continual adaptation challenges.
Poster
Wooseong Jeong · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
Multi-Task Learning (MTL) enables multiple tasks to be learned within a shared network, but differences in objectives across tasks can cause negative transfer, where the learning of one task degrades another task's performance. While pre-trained transformers significantly improve MTL performance, their fixed network capacity and rigid structure limit adaptability. Previous dynamic network architectures attempt to address this but are inefficient as they directly convert shared parameters into task-specific ones. We propose Dynamic Token Modulation and Expansion (DTME-MTL), a framework applicable to any transformer-based MTL architecture. DTME-MTL enhances adaptability and reduces overfitting by identifying gradient conflicts in token space and applying adaptive solutions based on conflict type. Unlike prior methods that mitigate negative transfer by duplicating network parameters, DTME-MTL operates entirely in token space, enabling efficient adaptation without excessive parameter growth. Extensive experiments demonstrate that DTME-MTL consistently improves multi-task performance with minimal computational overhead, offering a scalable and effective solution for enhancing transformer-based MTL models.
Poster
Francisco Caetano · Christiaan Viviers · Luis Zavala-Mondragón · Peter H.N. De With · Fons van der Sommen

[ Exhibit Hall I ]

Abstract
Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE's suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C), but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of $\leq$ 25MB, it achieves high OOD …
Poster
XIEQUN WANG · Zhan Zhuang · Yu Zhang

[ Exhibit Hall I ]

Abstract
Continual learning (CL) requires models to continuously adapt to new tasks without forgetting past knowledge. In this work, we propose **P**roactive **L**ow-rank **A**llocatio**N**(PLAN), a framework that extends Low-Rank Adaptation (LoRA) to enable efficient and interference-aware fine-tuning of large pre-trained models in CL settings. PLAN proactively manages the allocation of task-specific subspaces by introducing orthogonal basis vectors for each task and optimizing them through a perturbation-based strategy that minimizes conflicts with previously learned parameters. Furthermore, PLAN incorporates a novel selection mechanism that identifies and assigns basis vectors with minimal sensitivity to interference, reducing the risk of degrading past knowledge while maintaining efficient adaptation to new tasks. Empirical results on standard CL benchmarks demonstrate that PLAN consistently outperforms existing methods, establishing a new state-of-the-art for continual learning with foundation models.
Poster
Shenyu Lu · Zhaoying Pan · Xiaoqian Wang

[ Exhibit Hall I ]

Abstract
Contrastive Language-Image Pre-training (CLIP) models exhibit intriguing properties, particularly in their zero-shot classification capability. However, the reliability of CLIP zero-shot classification is severely undermined by spurious correlations. Existing efforts to enhance the robustness of zero-shot CLIP models often rely on prior knowledge or annotations of spurious correlations, limiting real-world applicability due to the unavailability of such information. Alternative methods attempt to detect distribution shift at test time but require training statistics whose access is often restricted or computationally expensive. To address the challenges brought by spurious correlation under zero-shot settings, we propose a novel test-time reasoning approach. Our method, inspired by human recognition, localizes the object and refines the classification accordingly. The inherent capacity of CLIP for semantic understanding allows us to isolate the object of interest without auxiliary models. Zero-shot classification is then performed exclusively on the localized objects, effectively mitigating the influence of spurious correlation. The proposed approach is interpretable and flexible as it requires no spurious annotations or prior knowledge, making it widely applicable. The substantial improvements across multiple benchmark datasets validated the effectiveness of our approach.
Poster
Zihua Zhao · Feng Hong · Mengxi Chen · Pengyi Chen · Benyuan Liu · Jiangchao Yao · Ya Zhang · Yanfeng Wang

[ Exhibit Hall I ]

Abstract
The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-large datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which limits in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods.
Poster
YAWEN ZOU · Guang Li · Duo Su · Zi Wang · Jun YU · Chao Zhang

[ Exhibit Hall I ]

Abstract
Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model's generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source large language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization. Source code and generated data are available in https://anonymous.4open.science/r/10575/.
Poster
Juntao Wu · Xianting Huang · Yu Chen · Shuai Pang · Ke Wang

[ Exhibit Hall I ]

Abstract
Despite the success of adversarial training on small datasets, applying it to large-scale datasets like ImageNet remains challenging. Previous attempts using synthetic data show limited improvements. This work investigates the impact of synthetic data scaling, model scaling, and training strategies on adversarial training with ImageNet, providing deeper insights into large-scale robustness. During the process, we observe a notable phenomenon of loss oscillation, leading to adversarial overfitting, and propose strategies to mitigate it. Experimental results show that, under AutoAttack on ImageNet-1K, our method achieves a robust accuracy of 71.54\%. Our findings highlight the crucial role of synthetic data and model scaling in enhancing adversarial robustness on large-scale benchmarks and provide a new direction for training robust visual representations at scale.
Poster
Jizong Peng · Tze Ho Elden Tse · Kai Xu · Wenchao Gao · Angela Yao

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) is a powerful reconstruction technique, but it needs to be initialized from accurate camera poses and high-fidelity point clouds. Typically, the initialization taken from Structure-from-Motion (SfM) algorithms; however, SfM is time-consuming and restricts the application of 3DGS in real-world scenarios and large-scale scene reconstruction. We introduce a constrained optimization method for simultaneous camera pose estimation and 3D reconstruction that does not require SfM support. Core to our approach is decomposing a camera pose into a sequence of camera-to-(device-)center and (device-)center-to-world optimizations. To facilitate, we propose two optimization constraints conditioned to the sensitivity of each parameter group and restricts each parameter’s search space. In addition, as we learn the scene geometry directly from the noisy point clouds, we propose geometric constraints to improve the reconstruction quality. Experiments demonstrate that the proposed method significantly outperforms the existing (multi-modal) 3DGS baseline and methods supplemented by COLMAP on both our collected dataset and two public benchmarks.
Poster
Hung-Chieh Fang · Hsuan-Tien Lin · Irwin King · Yifei Zhang

[ Exhibit Hall I ]

Abstract
Federated Unsupervised Learning (FUL) aims to learn expressive representations in federated and self-supervised settings. The quality of representations learned in FUL is usually determined by uniformity, a measure of how uniformly representations are distributed in the embedding space. However, existing solutions perform well in achieving intra-client (local) uniformity for local models while failing to achieve inter-client (global) uniformity after aggregation due to non-IID data distributions and the decentralized nature of FUL. To address this issue, we propose Soft Separation and Distillation (SSD), a novel approach that preserves inter-client uniformity by encouraging client representations to spread toward different directions. This design reduces interference during client model aggregation, thereby improving global uniformity while preserving local representation expressiveness. We further enhance this effect by introducing a projector distillation module to address the discrepancy between loss optimization and representation quality. We evaluate SSD in both cross-silo and cross-device federated settings, demonstrating consistent improvements in representation quality and task performance across various training scenarios. Our results highlight the importance of inter-client uniformity in FUL and establish SSD as an effective solution to this challenge.
Poster
Daqian Shi · Xiaolei Diao · Xu Chen · Cedric John

[ Exhibit Hall I ]

Abstract
Deep Neural Networks (DNNs) have significantly advanced the field of computer vision. To improve DNN training process, knowledge distillation methods demonstrate their effectiveness in accelerating network training by introducing a fixed learning direction from the teacher network to student networks. In this context, several distillation-based optimization strategies are proposed, e.g., deep mutual learning and self-distillation, as an attempt to achieve generic training performance enhancement through the cooperative training of multiple networks. However, such strategies achieve limited improvements due to the poor understanding of the impact of learning directions among networks across different iterations. In this paper, we propose a novel competitive distillation strategy that allows each network in a group to potentially act as a teacher based on its performance, enhancing the overall learning performance. Competitive distillation organizes a group of networks to perform a shared task and engage in competition, where competitive optimization is proposed to improve the parameter updating process. We further introduce stochastic perturbation in competitive distillation, aiming to motivate networks to induce mutations to achieve better visual representations and global optimum. The experimental results show that competitive distillation achieves promising performance in diverse tasks and datasets.
Poster
Chao Pan · Ke Tang · Li Qing · Xin Yao

[ Exhibit Hall I ]

Abstract
Fast Adversarial Training (FAT) employs the single-step Fast Gradient Sign Method (FGSM) to generate adversarial examples, reducing the computational costs of traditional adversarial training. However, FAT suffers from Catastrophic Overfitting (CO), where models' robust accuracy against multi-step attacks plummets to zero during training. Recent studies indicate that CO occurs because single-step adversarial perturbations contain label information that models exploit for prediction, leading to overfitting and diminished robustness against more complex attacks. In this paper, we discover that after CO occurs, the label information of certain samples can transfer across different samples, significantly increasing the likelihood of modified images being classified as the intended label. This discovery offers a new perspective on why various adversarial initialization strategies are effective. To address this issue, we introduce an innovative FAT strategy that leverages special samples to capture transferable label information and proactively removes potential label information during training, complemented by a non-uniform label smoothing technique to further eliminate label information. Experimental results across three datasets demonstrate that our method maintains competitive robustness against several attacks compared to other FAT approaches, with ablation studies confirming the effectiveness of our methodology.
Poster
Zhenbang Du · Yonggan Fu · Lifu Wang · Jiayi Qian · Xiao Luo · Yingyan Celine Lin

[ Exhibit Hall I ]

Abstract
Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the denoising steps increases the variability of the characteristics between the steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, …
Poster
Tianshuo Peng · Mingsheng Li · Jiakang Yuan · Hongbin Zhou · Renqiu Xia · Renrui Zhang · LEI BAI · Song Mao · Bin Wang · Aojun Zhou · Botian Shi · Tao Chen · Bo Zhang · Xiangyu Yue

[ Exhibit Hall I ]

Abstract
Large Multi-modal Models (LMMs), trained on web-scale datasets predominantly composed of natural images, have demonstrated remarkable performance on general tasks. However, these models often exhibit limited specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. An intuitive solution is to post-train LMMs on a specific domain, but often suffers from the labor-intensive annotating process and the inaccessibility of private training data. Directly integrating expert models tailored for those tasks is also challenging due to representational gaps and imbalanced optimization. To address these challenges, we introduce \textbf{Chimera}, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs. We will release model weights, along with the data used for training and evaluation, to facilitate future …
Poster
Jiahui Geng · Qing Li

[ Exhibit Hall I ]

Abstract
Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04\% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.
Poster
Jingjing Jiang · Chao Ma · Xurui Song · Hanwang Zhang · Jun Luo

[ Exhibit Hall I ]

Abstract
Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving.
Poster
Kai Tong · Kang Pan · Xiao Zhang · Erli Meng · Run He · Yawen Cui · Nuoyan Guo · Huiping Zhuang

[ Exhibit Hall I ]

Abstract
Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers' features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental …
Poster
Yaxin Xiao · Qingqing Ye · Li Hu · Huadi Zheng · Haibo Hu · Zi Liang · Haoyang LI · JIAOYIJIE JIAOYIJIE

[ Exhibit Hall I ]

Abstract
Machine unlearning enables the removal of specific data from ML models to uphold the *right to be forgotten*. While approximate unlearning algorithms offer efficient alternatives to full retraining, this work reveals that they fail to adequately protect the privacy of unlearned data. In particular, these algorithms introduce implicit residuals which facilitate privacy attacks targeting at unlearned data. We observe that these residuals persist regardless of model architectures, parameters, and unlearning algorithms, exposing a new attack surface beyond conventional output-based leakage. Based on this insight, we propose the *Reminiscence Attack (ReA)*, which amplifies the correlation between residuals and membership privacy through targeted fine-tuning processes. ReA achieves up to 1.90x and 1.12x higher accuracy than prior attacks when inferring class-wise and sample-wise membership, respectively. To mitigate such residual-induced privacy risk, we develop a dual-phase approximate unlearning framework that first eliminates deep-layer unlearned data traces and then enforces convergence stability to prevent models from "pseudo-convergence", where their outputs are similar to retrained models but still preserve unlearned residuals. Our framework works for both classification and generation tasks. Experimental evaluations confirm that our approach maintains high unlearning efficacy, while reducing the adaptive privacy attack accuracy to nearly random guess, at the computational cost of …
Poster
Tahira Shehzadi · Khurram Azeem Hashmi · Shalini Sarode · Didier Stricker · Muhammad Zeshan Afzal

[ Exhibit Hall I ]

Abstract
This paper addresses key limitations in current Semi-Supervised Object Detection (SSOD) frameworks, focusing on issues related to pseudo-label quality, confidence bias, and inefficient query generation. Traditional methods, including CNN-based and DETR-based architectures, often face challenges such as noisy pseudo-labels, overfitting to common object categories, and consequently face difficulty detecting rare objects. Specifically, recent DETR-based SSOD approaches struggle with the one-to-many assignment strategy, which produces noisy pseudo-labels and overlapping predictions, resulting in suboptimal performance. To address these challenges, we propose STEP-DETR, a transformer-based SSOD framework. STEP-DETR introduces Super Teacher to generate higher-quality pseudo-labels and improve the student’s learning process. Furthermore, STEP-DETR proposes Pseudo-Label Text Queries, which incorporate text embeddings from Super Teacher, balancing the student’s confidence across common and rare categories, thereby mitigating confidence bias and enhancing generalization. Moreover, Denoising Text Guided Object Queries synthesizes query-label pairs for foreground and background using contrastive learning, enabling the model to better distinguish objects from background noise. To further boost performance and training efficiency, a Query Refinement Module is incorporated to filter out redundant denoising queries. On MS-COCO and Pascal VOC benchmarks, STEP-DETR outperforms state-of-the-art methods, demonstrating its effectiveness in improving semi-supervised object detection. Notably, with just 10% labeled data, it achieves 45.4 mAP, …
Poster
Shijie Wang · Jian Shi · Haojie Li

[ Exhibit Hall I ]

Abstract
Existing fine-grained image retrieval (FGIR) methods predominantly rely on supervision from predefined categories to learn discriminative representations for retrieving fine-grained objects. However, they inadvertently introduce category-specific semantics into the retrieval representation, creating semantic dependencies on predefined classes that critically hinder generalization to unseen categories. To tackle this, we propose AdvRF, a novel adversarial reconstruction feedback framework aimed at learning category-agnostic discrepancy representations. Specifically, AdvRF reformulates FGIR as a visual discrepancy reconstruction task via synergizing category-aware discrepancy localization from retrieval models with category-agnostic feature learning from reconstruction models. The reconstruction model exposes residual discrepancies overlooked by the retrieval model, forcing it to improve localization accuracy, while the refined signals from the retrieval model guide the reconstruction model to improve its reconstruction ability. Consequently, the retrieval model localizes visual differences, while the reconstruction model encodes these differences into category-agnostic representations. This representation is then transferred to the retrieval model through knowledge distillation for efficient deployment. Quantitative and qualitative evaluations demonstrate that our AdvRF achieves impressive performance on both widely-used fine-grained and coarse-grained datasets.
Poster
Zhaoxin Yuan · Shuang Yang · Shiguang Shan · Xilin Chen

[ Exhibit Hall I ]

Abstract
Visual Speech Recognition (VSR) aims to infer spoken content by analyzing the speaker’s facial dynamics. While this technology has shown promise, a question naturally arises: Is it sufficient to rely solely on such visual information in complex real-world scenarios?Humans, on the other hand, excel at lip-reading by leveraging information beyond lip movements, such as speech-related background and prior knowledge about the task. Despite this well-recognized human capability, existing approaches have not explored incorporating such \textbf{Peripheral Information} into automatic frameworks.We categorize peripheral information into a hierarchical structure based on its relevance to the spoken content: (1) Content Anchors (e.g., speech topic or description), (2) Task Expertise (task-related background, e.g., human prior lip-reading experiences), and (3) Linguistic Perturbation (irrelevant information that VSR systems should process alongside meaningful signals).To unlock the valuable clues embedded in peripheral information, we propose a novel multi-modal framework that utilizes a large language model (LLM) to decode spoken content while seamlessly integrating peripheral information.Center to our framework is a new adaptation method, Synergy LoRA, which enables a coordinated adaptation of visual and textual inputs.Visual features are processed with a independent module while guided by semantic cue from peripheral information by a MoE textual adaptation module. It preserves the …
Poster
Chaoyong Yang · Jia-Li Yin · Bin Chen · Zhaozhe Hu · Xiaolei Liu · Wei Lin

[ Exhibit Hall I ]

Abstract
Data-free black-box attacks aim to attack a model without access to either the model parameters or training data. Existing methods use a generator to synthesize training samples and then train a substitute model to imitate the victim model. The adversarial examples (AEs) are finally generated using the substitute model to transfer to the victim model. To this end, how to generate diverse training samples for substitute model training and improve the transferability of AEs from the substitute model to victim model become the core challenges. In this paper, we propose a Knowledge-Orthogonalized Ensemble Attack, dubbed KOEnsAttack, to accomplish these two goals. We first use dual networks as the ensemble substitute model, and then propose a sample hardness enhancement to transform the samples from the generator into hard samples that exist in the controversial regions of the dual models for promoting the sample diversity. Next, during the substitute model training, we design a knowledge orthogonalization module to guide the dual networks in learning complementary and useful information from the black-box, thereby enhancing the transferability of adversarial samples generated on the final ensemble model. Extensive experiments on several datasets are conducted to evaluate the effectiveness of our method. The results show that …
Poster
yong zhang · Feng Liang · Guanghu Yuan · Min Yang · Chengming Li · Xiping Hu

[ Exhibit Hall I ]

Abstract
Federated learning (FL) enables collaborative training of a global model in the centralized server with data from multiple parties while preserving privacy. However, data heterogeneity can significantly degrade the performance of the global model when each party uses datasets from different sources to train a local model. Among various cases of data heterogeneity, feature drift, feature space difference among parties, is prevalent in real-life data but remains largely unexplored. Feature drift can distract feature extraction learning in clients and thus lead to poor feature extraction and classification performance. To tackle the problem of feature drift in FL, we propose FedPall, an FL framework that utilizes prototype-based adversarial learning to unify feature spaces and collaborative learning to reinforce class information within the features. Moreover, FedPall leverages mixed features generated from global prototypes and local features to enhance the global classifier with classification-relevant information from a global perspective. Evaluation results on three representative feature-drifted datasets demonstrate FedPall's consistently superior performance in classification with feature-drifted data in the FL scenario.
Poster
Munish Monga · Vishal Chudasama · Pankaj Wasnik · Biplab Banerjee

[ Exhibit Hall I ]

Abstract
Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD)—only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET’s effectiveness, achieving a +13.12\% RAI improvement while preserving 89.3\% Avg RI on the Pascal Series (4 tasks), as well as …
Poster
Yuechen Xie · Jie Song · Yicheng Shan · Xiaoyan Zhang · Yuanyu Wan · Shengxuming Zhang · Jiarui Duan · Mingli Song

[ Exhibit Hall I ]

Abstract
High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a $p$-value considerably below 0.05, surpassing …
Poster
Nairouz Mrabah · Nicolas Richet · Ismail Ayed · Eric Granger

[ Exhibit Hall I ]

Abstract
Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.
Poster
Zongyao Xue · Meina Kan · Shiguang Shan · Xilin Chen

[ Exhibit Hall I ]

Abstract
Few-Shot Class-Incremental Learning (FSCIL) focuses on incrementally learning novel classes using only a limited number of samples from novel classes, which faces dual challenges: catastrophic forgetting of previously learned classes and over-fitting to novel classes with few available samples. Recent advances in large pre-trained vision-language models (VLMs), such as CLIP, provide rich feature representations that generalize well across diverse classes. Therefore, freezing the pre-trained backbone and aggregating class features as prototypes becomes an intuitive and effective way to mitigate catastrophic forgetting.However, this strategy fails to address the overfitting challenge, and the prototypes of novel classes exhibit semantic bias due to the few samples per class. To address these limitations, we propose a semantic $\textbf{Feature Decomposition-Recomposition (FDR)} $ method based on VLMs. Firstly, we decompose the CLIP features into semantically distinct segments guided by text keywords from base classes. Then, these segments are adaptively recomposed at the attribute level given text descriptions, forming calibrated prototypes for novel classes. The recomposition process operates linearly at the attribute level but induces nonlinear adjustments across the entire prototype. This fine-grained and non-linear recomposition inherits the generalization capabilities of VLMs and the adaptive recomposition ability of base classes, leading to enhanced performance in FSCIL. Extensive …
Poster
JIACHENG RUAN · Wenzhen Yuan · Xian Gao · Ye Guo · Daoxin Zhang · Zhe Xu · Yao Hu · Ting Liu · yuzhuo fu

[ Exhibit Hall I ]

Abstract
Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0\% accuracy. Additionally, we perform comprehensive …
Poster
Nam Duong Tran · Nam Nguyen Phuong · Hieu Pham · Phi Le Nguyen · My Thai

[ Exhibit Hall I ]

Abstract
Deep neural networks often suffer performance drops when test data distribution differs from training data. Domain Generalization (DG) aims to address this by focusing on domain-invariant features or augmenting data for greater diversity. However, these methods often struggle with limited training domains or significant gaps between seen (training) and unseen (test) domains. To enhance DG robustness, we hypothesize that it is essential for the model to be trained on data from domains that closely resemble unseen test domains—an inherently difficult task due to the absence of prior knowledge about the unseen domains. Accordingly, we propose ConstStyle, a novel approach that leverages a unified domain to capture domain-invariant features and bridge the domain gap with theoretical analysis. During training, all samples are mapped onto this unified domain, optimized for seen domains. During testing, unseen domain samples are projected similarly before predictions. By aligning both training and testing data within this unified domain, ConstStyle effectively reduces the impact of domain shifts, even with large domain gaps or few seen domains. Extensive experiments demonstrate that ConstStyle consistently outperforms existing methods across diverse scenarios. Notably, when only a limited number of seen domains are available, ConstStyle can boost accuracy up to 19.82\% compared to …
Poster
Zachary Yahn · Selim Tekin · Fatih Ilhan · Sihao Hu · Tiansheng Huang · Yichang Xu · Margaret Loper · Ling Liu

[ Exhibit Hall I ]

Abstract
Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking regression-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional regression-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG's attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and regression-based object detectors by up to 83% with superior speed and imperceptibility. Code …
Poster
Hossein Mirzaei · Zeinab Taghavi · Sepehr Rezaee · Masoud Hadi · Moein Madadi · Mackenzie Mathis

[ Exhibit Hall I ]

Abstract
Deep neural networks have demonstrated remarkable success across numerous tasks, yet they remain vulnerable to trojan (backdoor) attacks, raising serious concerns about their safety in real-world mission-critical applications. A common countermeasure is trigger inversion -- reconstructing malicious "shortcut" patterns (triggers) inserted by an adversary during training. Current trigger-inversion methods typically search the full pixel space under specific assumptions but offer no assurances that the estimated trigger is more than an adversarial perturbation that flips the model output. Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. Specifically, we incorporate a diffusion-based generator guided by the target classifier; through iterative generation, we produce candidate triggers that align with the internal representations the model relies on for malicious behavior. Empirical evaluations, both quantitative and qualitative, show that our approach reconstructs triggers that effectively distinguish clean versus tojaned models. DISTIL surpasses alternative methods by high margins, achieving up to **7.1%** higher accuracy on the BackdoorBench dataset and a **9.4%** improvement on trojaned object detection model scanning, offering a promising new direction for reliable backdoor defense **without** reliance on extensive data or strong prior assumptions about triggers.
Poster
Jens U. Kreber · Joerg Stueckler

[ Exhibit Hall I ]

Abstract
Articulated objects are an important type of interactable objects in everyday environments. In this paper, we propose a novel diffusion model-based approach for generating articulated objects that aligns them with partial point clouds and improves their physical plausibility. The model represents part shapes by signed distance functions (SDFs). We guide the reverse diffusion process using a point cloud alignment loss computed using the predicted SDFs. Additionally, we impose non-penetration and mobility constraints based on the part SDFs for guiding the model to generate more physically plausible objects. We also make our diffusion approach category-aware to further improve point cloud alignment if category information is available. We evaluate the generative ability and constraint consistency of samples generated with our approach using the PartNet-Mobility dataset. We also compare our approach with an unguided baseline diffusion model and demonstrate that our method can improve constraint consistency and provides a tradeoff with generative ability.
Poster
Vittorio Pipoli · Alessia Saporita · Federico Bolelli · Marcella Cornia · Lorenzo Baraldi · Costantino Grana · Rita Cucchiara · Elisa Ficarra

[ Exhibit Hall I ]

Abstract
Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models (LLMs) to interpret non-linguistic modalities. Despite their impressive capabilities, the robustness of MLLMs under conditions where one or more modalities are missing remains largely unexplored. In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. Moreover, we propose a novel framework to mitigate the aforementioned issue called Retrieval-Augmented Generation for missing modalities (MissRAG). It consists of a novel multimodal RAG technique alongside a tailored prompt engineering strategy designed to enhance model robustness by mitigating the impact of absent modalities while preventing the burden of additional instruction tuning. To demonstrate the effectiveness of our techniques, we conducted comprehensive evaluations across five diverse datasets, covering tasks such as audio-visual question answering, audio-visual captioning, and multimodal sentiment analysis. Our source code is available at https://anonymous.4open.science/r/MM_MLLM-1536
Poster
Zifu Wan · Ce Zhang · Silong Yong · Martin Ma · Simon Stepputtis · Louis-Philippe Morency · Deva Ramanan · Katia Sycara · Yaqi Xie

[ Exhibit Hall I ]

Abstract
Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our ONLY approach consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost.
Poster
Yongkang Zhang · Dongyu She · Zhong Zhou

[ Exhibit Hall I ]

Abstract
Out-of-distribution (OOD) detection aims to distinguish whether detected objects belong to known categories or not. Existing methods extract OOD samples from In-distribution (ID) data to regularize the model’s decision boundaries. However, the decision boundaries are not adequately regularized due to the model's lack of knowledge about the distribution of OOD data. To address the above issue, we propose an Adaptive Prompt Learning framework via Gaussian Outlier Synthesis (APLGOS) for OOD detection. Specifically, we leverage the Vision-Language Model (VLM) to initialize learnable ID prompts by sampling standardized results from pre-defined Q\&A pairs. Region-level prompts are synthesised in low-likelihood regions of class-conditional gaussian distributions. These prompts are then utilized to initialize learnable OOD prompts and optimized with adaptive prompt learning. Also, OOD pseudo-samples are synthesised via gaussian outlier synthesis. Similarity score between prompts and images is utilized to calculate contrastive learning loss in high-dimensional hidden space. The aforementioned methodology regularizes the model to learn more compact decision boundaries for ID and OOD categories. Extensive experiments show that our proposed method achieves state-of-the-art performance with less ID data on four mainstream datasets.
Poster
Zichen Tang · Haihong E · Jiacheng Liu · Zhongjun Yang · Rongjin Li · Zihua Rong · Haoyang He · Zhuodi Hao · Xinyang Hu · Kun Ji · Ziyan Ma · Mengyuan Ji · Jun Zhang · Chenghao Ma · Qianhe Zheng · Yang Liu · Yiling Huang · Xinyi Hu · Qing Huang · Zijian Xie · Shiyao Peng

[ Exhibit Hall I ]

Abstract
We present FinMMR, a novel bilingual multimodal benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks. Compared to existing benchmarks, our work introduces three significant advancements. (1) Multimodality: We meticulously transform existing financial reasoning datasets, and construct novel questions from the latest Chinese financial research reports. The dataset comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts. (2) Comprehensiveness: FinMMR encompasses 14 financial subdomains, including corporate finance, banking, and industry analysis, significantly exceeding existing benchmarks in financial domain knowledge breadth. (3) Challenge: Models are required to perform multi-step precise numerical reasoning by integrating financial knowledge with the understanding of complex financial images and text. The best-performing MLLM achieves only 51.4\% accuracy on Hard problems. We believe that FinMMR will drive advancements in enhancing the reasoning capabilities of MLLMs in real-world scenarios.
Poster
Viraj Prabhu · Senthil Purushwalkam · An Yan · Caiming Xiong · Ran Xu

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLMs) frequently hallucinate responses to visual queries, undermining their reliability for critical applications. However, quantifying the effect of such hallucinations in free-form responses to open-ended queries requires visually verifying each claim within the response, which is highly challenging. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model with a high-fidelity scene-graph representation constructed from a detailed image caption, and prompt it to generate i) diverse and challenging question-answer (QA) pairs that test a range of image understanding capabilities, and ii) programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.6k challenging but grounded visual QA pairs. Next, we propose a scene graph-based evaluation framework to programmatically measure both the helpfulness and truthfulness of a free-form model response without relying on subjective LLM judgments. We extensively benchmark a range of VLMs on PROVE, and uncover a concerning tradeoff where models that provide more helpful responses often hallucinate more, whereas truthful models tend to be less informative. PROVE serves as a foundation for developing next-generation VLMs that balance helpfulness with truthfulness. …
Poster
Shuai Tan · Bill Gong · Bin Ji · Ye Pan

[ Exhibit Hall I ]

Abstract
Talking head generation is gaining significant importance across various domains, with a growing demand for high-quality rendering. However, existing methods often suffer from identity leakage (IL) and rendering artifacts (RA), particularly in extreme cases. Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. Firstly, we propose an **Enhanced Motion Indicator (EMI)** to effectively decouple identity information from motion features, mitigating the impact of IL on generated talking heads. To address RA, we introduce an **Enhanced Detail Indicator (EDI)**, which utilizes the leaked identity information to supplement missing details, thus fixing the artifacts. Extensive experiments demonstrate that FixTalk effectively mitigates IL and RA, achieving superior performance compared to state-of-the-art methods.
Poster
Jonathan Ventura · Viktor Larsson · Fredrik Kahl

[ Exhibit Hall I ]

Abstract
Spherical motion is a special case of camera motion where the camera moves on the imaginary surface of a sphere with the optical axis normal to the surface. Common sources of spherical motion are a person capturing a stereo panorama with a phone held in an outstretched hand, or a hemi-spherical camera rig used for multi-view scene capture. However, traditional structure-from-motion pipelines tend to fail on spherical camera motion sequences, especially when the camera is facing outward. Building upon prior work addressing the calibrated case, we explore uncalibrated reconstruction from spherical motion, assuming a fixed but unknown focal length parameter. We show that, although two-view spherical motion is always a critical case, self-calibration is possible from three or more views. Through analysis of the relationship between focal length and spherical relative pose, we devise a global structure-from-motion approach for uncalibrated reconstruction. We demonstrate the effectiveness of our approach on real-world captures in various settings, even when the camera motion deviates from perfect spherical motion.
Poster
Hang Xu · Jie Huang · Linjiang Huang · Dong Li · Yidi Liu · Feng Zhao

[ Exhibit Hall I ]

Abstract
Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model's performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling …
Poster
Weinan He · Yixin Zhang · Zilei Wang

[ Exhibit Hall I ]

Abstract
Large-scale pre-trained Vision-Language Models (VLMs) like CLIP have demonstrated promising zero-shot transfer capabilities to downstream tasks. However, their performance deteriorates when facing significant domain shifts. In this paper, we focus on cost-effective adaptation of large-scale pre-trained VLMs to unlabeled target domains. In this context, two prevalent paradigms show inherent limitations: Unsupervised Fine-Tuning (UFT) struggles with poor initial model performance, while Unsupervised Domain Adaptation (UDA) may suffer from adverse effects of inappropriate auxiliary source domain. To alleviate these limitations, we propose to adaptively construct more suitable auxiliary data from large-scale image-text pairs to facilitate unsupervised adaptation without any human annotations. Specifically, we introduce Progressive Distribution Bridging (PDB), which decomposes the challenging adaptation task into multiple simple steps through the construction of auxiliary data. To obtain such data, we design an efficient and controllable retrieval algorithm incorporating cascaded semantic filters and style controller to regulate the semantic category and domain style of retrieved data, respectively. Experimental results across 11 different domains from three standard UDA benchmarks demonstrate the effectiveness of our auxiliary data. Notably, on Office-Home, our method outperforms state-of-the-art UDA methods that rely on labeled source domains. The proposed method offers a more universal and cost-effective solution for adapting VLMs to …
Poster
Xinzi Cao · Ke Chen · Feidiao Yang · Xiawu Zheng · Yutong Lu · Yonghong Tian

[ Exhibit Hall I ]

Abstract
Generalized Category Discovery (GCD) aims to identify both known and novel categories in unlabeled data by leveraging knowledge from labeled datasets. Current methods employ contrastive learning on labeled data to capture known category structures but neglect unlabeled data, limiting their effectiveness in classifying novel classes, especially in fine-grained open-set detection where subtle class differences are crucial. To address this issue, we propose a novel learning approach, **AllGCD**, which seamlessly integrates \textbf{all} unlabeled data into contrastive learning to enhance the discrimination of novel classes. Specifically, we introduce two key techniques: Intra-class Contrast in Labeled Data (Intra-CL) and Inter-class Contrast in Unlabeled Data (Inter-CU). Intra-CL first refines intra-class compactness within known categories by integrating potential known samples into labeled data. This process refines the decision boundaries of known categories, reducing ambiguity when distinguishing novel categories. Building on this, Inter-CU further strengthens inter-class separation between known and novel categories by applying global contrastive learning to the class distribution in the unlabeled data. By jointly leveraging Intra-CL and Inter-CU, AllGCD effectively improves both intra-class compactness and inter-class separation, effectively enhancing the discriminability between known and novel classes. Experiments demonstrate that AllGCD significantly improves novel classes accuracy, \eg, achieving increases of 7.4% on CUB and …
Poster
Yuyang Yang · Wen Li · Sheng Ao · Qingshan Xu · Shangshu Yu · guo yu · Yin Zhou · Siqi Shen · Cheng Wang

[ Exhibit Hall I ]

Abstract
LiDAR localization is a fundamental task in autonomous driving and robotics. Scene Coordinate Regression (SCR) exhibits leading pose accuracy, achieving impressive results in learning-based localization. We observe that the real-world LiDAR scans captured from different viewpoints usually result in the catastrophic collapse of SCR. However, existing LiDAR localization methods have largely overlooked the issue of rotation sensitivity in SCR. In this paper, we present RALoc, an outdoor LiDAR localization method with rotation awareness to achieve accurate localization. The key to our approach is to design a Point Cloud Canonicalization module, which leverages a powerful equivariant key feature aggregation to transform the input LiDAR scan towards a consistent orientation, effectively eliminating the adverse effects of rotation. This proposed module has promising scalability and can be seamlessly integrated with the existing LiDAR localization network. Moreover, we propose the $\textbf{Bi}$directional $\textbf{Li}$DAR $\textbf{Lo}$calization (BiLiLo) dataset as a benchmark to evaluate the performance of various methods in large outdoor scenes with significant rotation changes. Extensive experiments show that RALoc significantly improves localization performance in scenarios with large rotation changes, and also achieves competitive performance in the Oxford Radar RobotCar dataset. Our code and dataset will be released upon acceptance.
Poster
Da-Wei Zhou · Kai-Wen Li · Jingyi Ning · Han-Jia Ye · Lijun Zhang · De-Chuan Zhan

[ Exhibit Hall I ]

Abstract
Class-Incremental Learning (CIL) enables learning systems to continuously adapt to evolving data streams. With the advancement of pre-training, leveraging pre-trained vision-language models (e.g., CLIP) offers a promising starting point for CIL. However, CLIP makes decisions by matching visual embeddings to class names, overlooking the rich contextual information conveyed through language. For instance, the concept of ``cat'' can be decomposed into features like tail, fur, and face for recognition.Besides, since the model is continually updated, these detailed features are overwritten in CIL, requiring external knowledge for compensation.In this paper, we introduce ExterNal knowledGe INjEction (ENGINE) for CLIP-based CIL. To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. The visual branch is enhanced with data augmentation to enrich the visual features, while the textual branch leverages GPT-4 to rewrite discriminative descriptors. In addition to this on-the-fly knowledge injection, we also implement post-tuning knowledge by re-ranking the prediction results during inference. With the injected knowledge, the model can better capture informative features for downstream tasks as data evolves. Extensive experiments demonstrate the state-of-the-art performance of ENGINE.
Poster
Kuangpu Guo · Lijun Sheng · Yongcan Yu · Jian Liang · Zilei Wang · Ran He

[ Exhibit Hall I ]

Abstract
Unsupervised federated learning (UFL) aims to collaboratively train a global model across distributed clients without data sharing and label information.Previous UFL works have predominantly focused on representation learning and clustering tasks.Recently, vision language models (e.g., CLIP) have gained significant attention for their attractive zero-shot prediction capabilities.Leveraging this advancement, classification problems that were previously infeasible under the UFL paradigm now present new opportunities but remain largely unexplored.In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, **Fed**erated **Co**operative **P**seudo **L**abeling (**FedCoPL**). Specifically, clients estimate and upload their pseudo label distribution, and the server adjusts and redistributes them to avoid global imbalance among categories.Moreover, we introduce a partial prompt aggregation protocol for effective collaboration and personalization.In particular, visual prompts containing general image features are aggregated at the server, while text prompts encoding personalized knowledge are retained locally.Extensive experiments on six datasets demonstrate the superior performance of our FedCoPL compared to baseline methods.Our code is available in the supplementary materials.
Poster
Chuyan Zhang · Kefan Wang · Yun Gu

[ Exhibit Hall I ]

Abstract
Low-Rank Adaptation (LoRA) has proven effective in reducing computational costs while maintaining performance comparable to fully fine-tuned foundation models across various tasks. However, its fixed low-rank structure restricts its adaptability in scenarios with substantial domain gaps, where higher ranks are often required to capture domain-specific complexities. Current adaptive LoRA methods attempt to overcome this limitation by dynamically expanding or selectively allocating ranks, but these approaches frequently depend on computationally intensive techniques such as iterative pruning, rank searches, or additional regularization. To address these challenges, we introduce Stable Rank-Guided Low-Rank Adaptation (SR-LoRA), a novel framework that utilizes the stable rank of pre-trained weight matrices as a natural prior for layer-wise rank allocation. By leveraging the stable rank, which reflects the intrinsic dimensionality of the weights, SR-LoRA enables a principled and efficient redistribution of ranks across layers, enhancing adaptability without incurring additional search costs. Empirical evaluations on few-shot tasks with significant domain gaps show that SR-LoRA consistently outperforms recent adaptive LoRA variants, achieving a superior trade-off between performance and efficiency. Our code is available at https://anonymous.4open.science/r/SR-LoRA-A18F.
Poster
Zhi Chen · Zecheng Zhao · Jingcai Guo · Jingjing Li · Zi Huang

[ Exhibit Hall I ]

Abstract
Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch’s semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations.
Poster
Junjia Huang · Pengxiang Yan · Jinhang Cai · Jiyang Liu · Zhao Wang · Yitong Wang · Xinglong Wu · Guanbin Li

[ Exhibit Hall I ]

Abstract
Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation and multi-layer compositions. However, existing approaches often fail to provide a thorough exploration of multi-layer structures, leading to inconsistent inter-layer interactions, such as occlusion relationships, spatial layout, and shadowing. In this paper, we introduce DreamLayer, a novel framework that enables coherent text-driven generation of multiple image layers, by explicitly modeling the relationship between transparent foreground and background layers. DreamLayer incorporates three key components, i.e., Context-Aware Cross-Attention (CACA) for global-local information exchange, Layer-Shared Self-Attention (LSSA) for establishing robust inter-layer connections, and Information Retained Harmonization (IRH) for refining fusion details at the latent level.By leveraging a coherent full-image context, DreamLayer builds inter-layer connections through attention mechanisms and applies a harmonization step to achieve seamless layer fusion. To facilitate research in multi-layer generation, we construct a high-quality, diverse multi-layer dataset including $400k$ samples. Extensive experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers, with broad applicability, including latent-space image editing and image-to-layer decomposition.
Poster
Xiaoyu Zhou · Jingqi Wang · Yongtao Wang · Yufei Wei · Nan Dong · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
Obtaining high-quality 3D semantic occupancy from raw sensor data remains an essential yet challenging task, often requiring extensive manual labeling. In this work, we propose AutoOcc, an vision-centric automated pipeline for open-ended semantic occupancy annotation that integrates differentiable Gaussian splatting guided by vision-language models. We formulate the open-ended semantic occupancy reconstruction task to automatically generate scene occupancy by combining attention maps from vision-language models and foundation vision models. We devise semantic-aware Gaussians as intermediate geometric descriptors and propose a cumulative Gaussian-to-voxel splatting algorithm that enables effective and efficient occupancy annotation. Our framework outperforms existing automated occupancy annotation methods without human labels. AutoOcc also enables open-ended semantic occupancy auto-labeling, achieving robust performance in both static and dynamically complex scenarios. All the source codes and trained models will be released.
Poster
Fucai Ke · Vijay Kumar b g · Xingjian Leng · Zhixi Cai · Zaid Khan · Weiqing Wang · Pari Delir Haghighi · Hamid Rezatofighi · Manmohan Chandraker

[ Exhibit Hall I ]

Abstract
Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.
Poster
Hao Chen · Shell Xu Hu · Wayne Luk · Timothy Hospedales · Hongxiang Fan

[ Exhibit Hall I ]

Abstract
Model merging has emerged as a promising approach for multi-task learning (MTL) in large language models (LLMs), providing a training- and data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned LLMs, existing model merging methods face two key limitations: (i) they are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) they struggle to scale effectively when merging numerous model checkpoints.To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by the Frank-Wolfe optimization, our approach iteratively selects the most relevant model parameters to minimize a linear approximation of the objective function, merging them through a predefined merging function. The objective function is designed to capture the desired behavior of the target merged model, while the fine-tuned candidate models defines the constraint set.More importantly, FW-Merging serves as an orthogonal technique to existing merging methods, seamlessly integrating with them to further enhance performance.Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 …
Poster
Zhengxuan Wei · Jiajin Tang · Sibei Yang

[ Exhibit Hall I ]

Abstract
Existing Moment Retrieval methods face three critical bottlenecks: (1) data scarcity forces models into shallow keyword-feature associations; (2) boundary ambiguity in transition regions between adjacent events; (3) insufficient discrimination of fine-grained semantics (e.g., distinguishing ''kicking" vs. ''throwing" a ball). In this paper, we propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, designed to overcome local optima caused by insufficient data annotations and the lack of robust boundary and semantic discrimination capabilities. AMR is built upon two key insights: (1) it resolves ambiguous boundary information and semantic confusion in existing annotations without additional data (avoiding costly manual labeling), and (2) it preserves boundary and semantic discriminative capabilities enhanced by training while generalizing to real-world scenarios, significantly improving performance. Furthermore, we propose a two-stage training framework with cold-start and distillation adaptation. The cold-start stage employs curriculum learning on augmented data to build foundational boundary/semantic awareness. The distillation stage introduces dual query sets: Original Queries maintain DETR-based localization using frozen Base Queries from the cold-start model, while Active Queries dynamically adapt to real-data distributions. A cross-stage distillation loss enforces consistency between Original and Base Queries, preventing knowledge forgetting while enabling real-world generalization. Experiments on multiple benchmarks show that AMR achieves improved performance over …
Poster
Zedong Wang · Siyuan Li · Dan Xu

[ Exhibit Hall I ]

Abstract
Despite the promise of Multi-Task Learning (MTL) in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts through optimizer-centric loss scaling and gradient manipulation, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizer designs, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropybased penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting (EW) policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law (PL) exponent analysis demonstrates Rep-MTL’s efficacy in balancing task-specific learning and cross-task sharing.
Poster
Pei Wang · Zhaowei Cai · Hao Yang · Davide Modolo · Ashwin Swaminathan

[ Exhibit Hall I ]

Abstract
The optimality of using the de facto cross-entropy loss with one-hot target distribution (hard labeling) is questioned when training (Multimodal) Large Language Models (LLMs/MLLMs). Although it is reasonable for language token prediction, which is a typical multi-class classification problem in discrete space, it is suboptimal for task like numerical prediction, which is a typical regression problem in continuous space. However, enabling regression in LLMs/MLLMs will complicate the training and next-token prediction paradigm at inference. Instead, to address this challenge, we propose a novel loss design, called soft labeling, which smooths the target probability distribution, enabling predictions to be penalized according to their distance to the target. This is similar to regression loss, which penalizes more on the further predictions in the continuous space, but will not change the model architecture and the next-token prediction paradigm of LLMs/MLLMs. We demonstrate the efficacy of soft labeling through extensive experiments on visual grounding, object counting, and chart understanding, achieving state-of-the-art performance on multiple benchmarks without bells and whistles. Soft labeling can be applied in any LLM/MLLM.
Poster
Shuren Qi · Yushu Zhang · CHAO WANG · Zhihua Xia · Xiaochun Cao · FENGLEI FAN

[ Exhibit Hall I ]

Abstract
Developing robust and interpretable vision systems is a crucial step towards trustworthy artificial intelligence. One promising paradigm is to design transparent structures, e.g., geometric invariance, for fundamental representations. However, such invariants exhibit limited discriminability, limiting their applications in larger-scale tasks. For this open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives. At the theoretical level, we show how to construct discriminative invariants with a Convolutional Neural Network (CNN)-like hierarchical architecture, yet in a fully transparent manner. The general blueprint, specific definitions, invariant properties, and numerical implementations are provided. At the practical level, we discuss how to customize this transparent framework into a given task. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. We demonstrate the above arguments with accuracy, invariance, and efficiency results on laboratory-style classification experiments. Furthermore, at the application level, our representations are explored in real-world forensic tasks on adversarial perturbations and generated content. Such applications reveal that our invariants exhibit competitive discriminability even in the era of deep learning. For robust and interpretable vision tasks at larger scales, hierarchical invariant representations can be considered as an …
Poster
Guopeng Li · Qiang Wang · Ke Yan · Shouhong Ding · Yuan Gao · Gui-Song Xia

[ Exhibit Hall I ]

Abstract
Most knowledge distillation (KD) methods focus on teacher-student pairs with similar architectures, such as both being CNN models. The potential and flexibility of KD can be greatly improved by expanding it to Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred selectively to given students. However, it makes CAKD extremely challenging because of substantial feature gaps between heterogeneous models (e.g., a ViT teacher and a CNN student), originating from the distinction of their inherent inductive biases} and module functions. To this end, we fuse heterogeneous knowledge before transferring it from teacher to student. This fusion combines the advantages of cross-architecture inductive biases and module functions by merging directly from different combinations of convolution, attention, and MLP modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions, hindering the effectiveness of conventional pixel-wise MSE loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing. Our method is evaluated across various homogeneous models and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, yielding promising performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our codes will be …
Poster
Yuedong Tan · Jiawei Shao · Eduard Zamfir · Ruanjun Li · Zhaochong An · Chao Ma · Danda Pani Paudel · Luc Gool · Radu Timofte · Zongwei Wu

[ Exhibit Hall I ]

Abstract
Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities.To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness — critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be made publicly available.
Poster
Hyundong Jin · Hyung Jin Chang · Eunwoo Kim

[ Exhibit Hall I ]

Abstract
Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.
Poster
Jinhong Wang · Shuo Tong · Jintai CHEN · Jian liu · Dongqi Tang · Weiqiang Wang · Wentong Li · Hongxia Xu · Danny Chen · Jian Wu

[ Exhibit Hall I ]

Abstract
Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that augments the ordinal understanding ability of MLLMs by specificity and commonality modeling. Specifically, our OrderChain consists of a set of task-aware prompts to facilitate the specificity modeling of diverse OR tasks and a new range optimization Chain-of-Thought (RO-CoT), which learns a commonality way of thinking about OR tasks by uniformly decomposing them into multiple small-range optimization subtasks. Further, we propose a category recursive division (CRD) method to generate instruction candidate category prompts to support RO-CoT automatic optimization. Comprehensive experiments show that a Large Language and Vision Assistant (LLaVA) model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets, e.g., from 47.5% to 93.2% accuracy on the Adience dataset for age estimation, and from 30.0% to 85.7% accuracy on the Diabetic Retinopathy dataset. Notably, LLaVA with our OrderChain also remarkably outperforms state-of-the-art methods by 27% on accuracy and 0.24 on MAE on the Adience dataset. To our best knowledge, our OrderChain is the first work that augments MLLMs …
Poster
Ziqi Gao · Qiufu Li · Linlin Shen

[ Exhibit Hall I ]

Abstract
Compared to 2D data, the scale of point cloud data in different domains available for training, is quite limited. Researchers have been trying to combine these data of different domains for masked autoencoder (MAE) pre-training to leverage such a data scarcity issue. However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. In DAP-MAE, we design a heterogeneous domain adapter that utilizes an adaptation mode during the pre-training, enabling the model to comprehensively learn information from point clouds across different domains, while employing a fusion mode in the fine-tuning to enhance point cloud features. Meanwhile, DAP-MAE incorporates a domain feature generator to guide the adaptation of point cloud features to various downstream tasks. With only one pre-training, DAP-MAE achieves excellent performance across four different point cloud analysis tasks, reaching 95.18\% in object classification on ScanObjectNN and 88.45\% in facial expression recognition on Bosphorus.
Poster
Keon-Hee Park · Seun-An Choe · Gyeong-Moon Park

[ Exhibit Hall I ]

Abstract
Source-free object detection adapts a detector pre-trained on a source domain to an unlabeled target domain without requiring access to labeled source data. While this setting is practical as it eliminates the need for the source dataset during domain adaptation, it operates under the restrictive assumption that only pre-defined objects from the source domain exist in the target domain. This closed-set setting prevents the detector from detecting undefined objects.To ease this assumption, we propose $\textbf{S}$ource-$\textbf{F}$ree $\textbf{U}$nknown $\textbf{O}$bject $\textbf{D}$etection ($\textbf{SFUOD}$), a novel scenario which enables the detector to not only recognize known objects but also detect undefined objects as unknown objects. To this end, we propose $\textbf{CollaPAUL}$ ($\textbf{Colla}$borative tuning and $\textbf{P}$rincipal $\textbf{A}$xis-based $\textbf{U}$nknown $\textbf{L}$abeling), a novel framework for SFUOD. Collaborative tuning enhances knowledge adaptation by integrating target-dependent knowledge from the auxiliary encoder with source-dependent knowledge from the pre-trained detector through a cross-domain attention mechanism. Additionally, principal axis-based unknown labeling assigns pseudo-labels to unknown objects by estimating objectness via principal axes projection and confidence scores from model predictions.The proposed CollaPAUL achieves state-of-the-art performances on SFUOD benchmarks, and extensive experiments validate its effectiveness. The code will be released after the review.
Poster
Barış Zöngür · Robin Hesse · Stefan Roth

[ Exhibit Hall I ]

Abstract
To ensure the reliability of deep models in real-world applications, out-of-distribution (OOD) detection methods aim to distinguish samples close to the training distribution (in-distribution, ID) from those farther away (OOD). In this work, we propose a novel OOD detection method that utilizes singular value decomposition of the weight matrix of the classification head to decompose the model's feature activations into decisive and insignificant components, which contribute maximally, respectively minimally, to the final classifier output. We find that the subspace of insignificant components more effectively distinguishes ID from OOD data than raw activations. This occurs because the classification objective leaves the indecisive subspace largely unaffected, yielding features that are "untainted'' by the target classification task. Conversely, we find that activation shaping methods profit from only considering the decisive subspace, as the insignificant component can cause interference in the activation space. By combining these two findings into a single method, we achieve state-of-the-art results in various standard OOD benchmarks.
Poster
Katja Schwarz · Denis Rozumny · Samuel Rota Bulò · Lorenzo Porzi · Peter Kontschieder

[ Exhibit Hall I ]

Abstract
We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics.
Poster
Amir Mehrpanah · Matteo Gamba · Kevin Smith · Hossein Azizpour

[ Exhibit Hall I ]

Abstract
ReLU networks, while prevalent for visual data, have sharp transitions, sometimes relying on individual pixels for predictions, making vanilla gradient-based explanations noisy and difficult to interpret. Existing methods, such as GradCAM, smooth these explanations by producing surrogate models at the cost of faithfulness. We introduce a unifying spectral framework to systematically analyze and quantify smoothness, faithfulness, and their trade-off in explanations.Using this framework, we quantify and reduce the contribution of ReLU networks to high-frequency information, providing a principled approach to identifying this trade-off. Our analysis characterizes how surrogate-based smoothing distorts explanations, leading to an ``explanation gap'' that we formally define and measure for different post-hoc methods.Finally, we validate our theoretical findings across different design choices, datasets, and ablations.
Poster
Letian Zhang · Quan Cui · Bingchen Zhao · Cheng Yang

[ Exhibit Hall I ]

Abstract
The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and data will be publicly available.
Poster
Zhuoyan Xu · Khoi Nguyen · Preeti Mukherjee · Saurabh Bagchi · Somali Chaterji · Yingyu Liang · Yin Li

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent efforts on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs.
Poster
Kunlun Xu · Fan Zhuo · Jiangmeng Li · Xu Zou · Jiahuan Zhou

[ Exhibit Hall I ]

Abstract
Current lifelong person re-identification (LReID) methods predominantly rely on fully labeled data streams. However, in real-world scenarios where annotation resources are limited, a vast amount of unlabeled data coexists with scarce labeled samples, leading to the Semi-Supervised LReID (Semi-LReID) problem and making LReID methods suffer severe performance degradation. Despite the practical significance of Semi-LReID, it remains unexplored due to its inherent challenges. Existing LReID methods, even when combined with semi-supervised strategies, suffer limited long-term adaptation performance due to struggling with the noisy knowledge occurring during unlabeled data utilization, which hinders new knowledge acquisition and exacerbates catastrophic forgetting. In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing PRototype Evolution with Dual-Knowledge Cooperation framework (SPRED). Our key innovation lies in establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and new-old knowledge collaborative purification to enhance the utilization of unlabeled data. Specifically, learnable identity prototypes are introduced to dynamically capture the identity distributions as the pseudo-label evolves, then generate high-quality pseudo-labels, while dual-knowledge cooperation, which integrates current model specialization and historical model generalization, refines pseudo-labels by filtering out noisy information. Through this cyclic design, reliable pseudo-labels are progressively mined to improve current-stage learning and ensure positive knowledge propagation …
Poster
Ziqian Lu · Yunlong Yu · Qinyue Tong · Jun Liu

[ Exhibit Hall I ]

Abstract
Existing adaptation methods of pre-trained vision-language models like CLIP often rely on base-class samples during fine-tuning, introducing systematic biases that distort decision boundaries and degrade performance on novel classes. In this work, we break new ground by proposing a hierarchical divide-and-conquer framework that addresses classification bias at its root. Our method first segregates the label space into base and novel subspaces, ensuring domain separation. Subsequently, it employs text-embedding clustering within each subspace to decompose ambiguous intra-domain classes into disentangled, fine-grained clusters. This two-stage grouping strategy not only alleviates class confusion but also enables domain-specific model training in isolated subspaces, fostering specialized learning without overfitting base categories. Experiments on three classification benchmarks reveal that our approach achieves state-of-the-art performance, surpassing the second-best competitor by 10\% average accuracy.
Poster
Yilin Gao · Kangyi Chen · Zhongxing Peng · Hengjie Lu · Shugong Xu

[ Exhibit Hall I ]

Abstract
Current visual foundation models (VFMs) face a fundamental limitation in transferring knowledge from vision language models (VLMs): while VLMs excel at modeling cross-modal interactions through unified representation spaces, existing VFMs predominantly adopt \textit{result-oriented} paradigms that neglect the underlying interaction processes. This representational discrepancy leads to suboptimal knowledge transfer and limited generalization capabilities across vision tasks.We propose Learning from Interactions, a cognitive-inspired framework that bridges this gap by explicitly modeling interactions during visual understanding. Our key insight is that preserving the interaction dynamics captured by VLMs -- rather than just their final representations -- enables more effective knowledge transfer to downstream VFMs. The technical core involves two innovations: (1) \textit{Interaction Queries} that maintain persistent relationships across network layers, and (2) interaction-based supervision derived from pre-trained VLMs' cross-modal attention patterns.Comprehensive experiments demonstrate consistent improvements across multiple benchmarks: achieving $\sim$3.3\% and $+$1.6 mAP/$+$2.4 $AP^{mask}$ absolute gains on TinyImageNet classification and COCO detection/segmentation respectively, with minimal parameter overhead and faster convergence (7$\times$ speedup). The framework particularly excels in cross-domain scenarios, delivering $\sim$2.4\% and $\sim$9.3\% zero-shot improvements on PACS and VLCS. Human evaluations confirm our approach's cognitive alignment, outperforming result-oriented methods by 2.7$\times$ in semantic consistency metrics.
Poster
Yinan Zhou · Yuxin Chen · Haokun Lin · Yichen Wu · Shuyu Yang · Zhongang Qi · Chen Ma · Li Zhu

[ Exhibit Hall I ]

Abstract
With recent advances in Multimodal Large Language Models (MLLMs), grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction. However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the **DO**cument **G**rounding and r**E**ferring data engine (**DOGE-Engine**), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs' grounding and referring capabilities in dialogue and reasoning. Using the DOGE-Engine, we construct **DOGE-Bench**, a benchmark covering seven grounding and referring tasks across three document types (chart, poster, and PDF document), offering a comprehensive evaluation of fine-grained document understanding. Leveraging the generated data, we further develop **DOGE**, a strong baseline model that excels in text localization and recognition, while precisely grounds and refers to key textual information during conversation and reasoning, thereby advancing document understanding to a finer granularity and enable flexible interaction paradigms. Our code, data, and model will be open-sourced to support community development.
Poster
Jinxin Shi · Jiabao Zhao · Yifan Yang · Xingjiao Wu · Jiawen Li · Liang He

[ Exhibit Hall I ]

Abstract
For Few-Shot Class-Incremental Learning (FSCIL), direct fine-tuning causes significant parameter shifts, resulting in catastrophic forgetting and increased resource consumption. While, freezing the pre-trained backbone exacerbates the inconsistency between the backbone and the evolving classifier. To overcome these challenges, we introduce a method called Low-Rank updates after knowledge localization (Lark). In the knowledge localization phase, the Fisher Information Matrix is calculated to measure the sensitivity of parameters in different layers to previously acquired knowledge. This phase ultimately identifies the parameters within the model that are most suitable for learning new knowledge. In the subsequent incremental editing phase, a low-rank incremental update strategy is applied. This strategy ensures that the model parameter updates adhere to a Rank-One matrix structure. By doing so, it minimizes alterations to the original parameters, thereby enabling the model to integrate new knowledge while retaining as much of the previous knowledge as possible. Extensive experimental results demonstrate that the Lark method achieves significant performance improvements on the CIFAR100, mini-ImageNet, and CUB200 datasets, surpassing current state-of-the-art methods.
Poster
Zheng Li · Yibing Song · Ming-Ming Cheng · Xiang Li · jian Yang

[ Exhibit Hall I ]

Abstract
Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-anchored Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. Additionally, we introduce a straightforward differentiable attribute search method to identify representative and suitable attributes for downstream tasks. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format in textual-based methods, providing general improvements at a negligible computational cost. Extensive experiments across 11 datasets validate the effectiveness of our method.
Poster
Liwei Luo · Shuaitengyuan Li · Dongwei Ren · Qilong Wang · Pengfei Zhu · Qinghua Hu

[ Exhibit Hall I ]

Abstract
Recently, remarkable progress has been made in large-scale pre-trained model tuning, and inference efficiency is becoming more crucial for practical deployment. Early exiting in conjunction with multi-stage predictors, when cooperated with a parameter-efficient fine-tuning strategy, offers a straightforward way to achieve an inference-efficient model. However, a key challenge remains unresolved: How can early stages provide low-level fundamental features to deep stages while simultaneously supplying high-level discriminative features to early-stage predictors? To address this problem, we propose a Decoupled Multi-Predictor Optimization (DMPO) method to effectively decouple the low-level representative ability and high-level discriminative ability in early stages. First, in terms of architecture, we introduce a lightweight bypass module into multi-stage predictors for functional decomposition of shallow features from early stages, while a high-order statistics-based predictor is developed for early stages to effectively enhance their discriminative ability. To reasonably train our multi-predictor architecture, a decoupled optimization is proposed to allocate two-phase loss weights for multi-stage predictors during model tuning, where the initial training phase enables the model to prioritize the acquisition of discriminative ability of deep stages via emphasizing representative ability of early stages, and the latter training phase drives discriminative ability towards earlier stages as much as possible. As such, our …
Poster
Sudong Wang · Yunjian Zhang · Yao Zhu · Enci Liu · Jianing Li · Yanwei Liu · Xiangyang Ji

[ Exhibit Hall I ]

Abstract
Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in recent years, the persistent challenge of ``hallucination'' has surfaced as a major barrier, sharply constraining their practical applicability and reliability in real-world systems. In this paper, we provide a novel perspective for the causes and mitigations for hallucinations by tracking the information flow within MLLMs. We find that information in MLLMs does not flow in a strictly continuous manner, instead, they may mutate abruptly in deep layers. The mutated information does not originate from shallow layers, on the contrary, it is directly injected into the model, which may cause the model's outputs to deviate from the input, leading to hallucinations. Inspired by this observation, we propose a hallucination mitigation method that directly operates on the mutated information, named \textbf{S}moothing \textbf{H}allucinations by \textbf{I}nformation \textbf{F}low \textbf{T}uning (SHIFT). In this method, the differences of feature encodings between adjacent layers are monitored, and once the mutated information is detected, the knowledge from shallow layers is used to tune it. This process filters out hallucinated knowledge, aligning features more faithfully with the input and effectively reducing hallucinations. Extensive experiments on multiple benchmarks have demonstrated the superior performance in terms of accuracy and efficiency of …
Poster
Kaichen Zhang · Yifei Shen · Bo Li · Ziwei Liu

[ Exhibit Hall I ]

Abstract
Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.
Poster
Yangyang Guo · Mohan Kankanhalli

[ Exhibit Hall I ]

Abstract
While contrastive pre-training is widely employed, its data efficiency problem has remained relatively under-explored thus far.Existing methods often rely on static coreset selection algorithms to pre-identify important data for training.However, this static nature renders them unable to dynamically track the data usefulness throughout pre-training, leading to subpar pre-trained models.To address this challenge, our paper introduces a novel dynamic bootstrapping dataset pruning method.It involves pruning data preparation followed by dataset mutation operations, both of which undergo iterative and dynamic updates.We apply this method to two prevalent contrastive pre-training frameworks: \textbf{CLIP} and \textbf{MoCo}, representing vision-language and vision-centric domains, respectively.In particular, we individually pre-train seven CLIP models on two large-scale image-text pair datasets, and two MoCo models on the ImageNet dataset, resulting in a total of 16 pre-trained models.With a data pruning rate of 30-35\% across all 16 models, our method exhibits only marginal performance degradation (less than \textbf{1\%} on average) compared to corresponding models trained on the full dataset counterparts across various downstream datasets, and also surpasses several baselines with a large performance margin.Additionally, the byproduct from our method, \ie, coresets derived from the original datasets after pre-training, also demonstrates significant superiority in terms of downstream performance over other static coreset selection …
Poster
Peng Wu · Qiuxia Lai · Hao Fang · Guo-Sen Xie · Yilong Yin · Xiankai Lu · Wenguan Wang

[ Exhibit Hall I ]

Abstract
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions. Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning. However, this assumption overlooks the semantic constraints and contextual dependencies inside a composition. For example, certain attributes naturally pair with specific objects (e.g., "striped'' applies to "zebra'' or "shirts'' but not "sky'' or "water''), while the same attribute can manifest differently depending on context (e.g., "young'' in "young tree'' *vs* "young dog''). Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL.In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. We decompose the probability of a composition into two components: the likelihood of an object and the conditional likelihood of its attribute. To enhance object feature learning, we incorporate textual descriptors to highlight semantically relevant image regions. These enhanced object features then guide attribute learning through a cross-attention mechanism, ensuring better contextual alignment. By jointly optimizing object likelihood and conditional attribute likelihood, our method effectively captures compositional dependencies and generalizes well to unseen compositions. Extensive experiments on multiple CZSL benchmarks demonstrate the superiority of our …
Poster
Tuo Chen · Jie Gui · Minjing Dong · Ju Jia · Lanting Fang · Jian liu

[ Exhibit Hall I ]

Abstract
Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning's random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance with +45.9\% attack success rate improvement over existing DPCLs on ImageNet-100 while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses.
Poster
Zhisheng Zhong · Chengyao Wang · Yuqi Liu · Senqiao Yang · Longxiang Tang · Yuechen Zhang · Jingyao Li · Tianyuan Qu · Yanwei Li · Yukang Chen · Shaozuo Yu · WU Sitong · Eric Lo · Shu Liu · Jiaya Jia

[ Exhibit Hall I ]

Abstract
As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multi-modal abilities, including advanced long speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data. All code, data, and models will be available to the public.
Poster
Jun Zhang · Desen Meng · Zhengming Zhang · Zhenpeng Huang · Tao Wu · Limin Wang

[ Exhibit Hall I ]

Abstract
Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. In this paper, we propose p-MoD, an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance. The majority of computation in MLLMs stems from the overwhelming volume of vision tokens processed by the transformer-based LLM. Accordingly, we leverage the Mixture-of-Depths (MoD) mechanism, where each LLM layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layers and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. Extensive experiments on two baseline models across 15 benchmarks show that our model matches or even surpasses the performance of corresponding baselines, while …
Poster
Zhankai Li · Weiping Wang · jie li · Shigeng Zhang · Yunan Hu · Song Guo

[ Exhibit Hall I ]

Abstract
In the field of AI security, the vulnerability of deep neural networks has garnered widespread attention. Specifically, the sensitivity of DNNs to adversarial examples (AEs) can lead to severe consequences, even small perturbations in input data can result in incorrect predictions. AEs demonstrate transferability across models, however, targeted attack success rates (TASRs) remain low due to significant differences in feature dimensions and decision boundaries. To enhance the transferability of targeted AEs, we propose a novel approach by introducing Inverse Target Gradient Competition (ITC) and Spatial Distance Stretching (SDS) in the optimization process. Specifically, we utilize a twin-network-like framework to generate both non-targeted and targeted AEs, introducing a new competition mechanism ITC where non-targeted adversarial gradients are applied each epoch to hinder the optimization of targeted adversarial perturbations, thus enhancing robustness in targeted attacks. Additionally, a top-k SDS strategy is employed, guiding AEs to penetrate target class regions in the latent multi-dimensional space while globally distancing from multiple closest non-targeted regions, ultimately achieving optimal adversarial transferability. Compared with state-of-the-art competition-based attacks, our method demonstrates significant transferability advantages, with average transferable TASRs improved by 16.1% and 21.4% on mainstream CNNs and ViTs, respectively, while also achieving an unmatched breaking-through defense capability.
Poster
Huan Wang · Haoran Li · Huaming Chen · Jun Yan · Jiahua Shi · Jun Shen

[ Exhibit Hall I ]

Abstract
Federated learning aims at training models collaboratively across participants while protecting privacy. However, one major challenge for this paradigm is the data heterogeneity issue, where biased data preferences across multiple clients, harming the model's convergence and performance. In this paper, we first introduce powerful diffusion models into the federated learning paradigm and show that diffusion representations are effective steers during federated training. To explore the possibility of using diffusion representations in handling data heterogeneity, we propose a novel diffusion-inspired Federated paradigm with Diffusion Representation Collaboration, termed FedDifRC, leveraging meaningful guidance of diffusion models to mitigate data heterogeneity. The key idea is to construct text-driven diffusion contrasting and noise-driven diffusion regularization, aiming to provide abundant class-related semantic information and consistent convergence signals. On the one hand, we exploit the conditional feedback from the diffusion model for different text prompts to build a text-driven contrastive learning strategy. On the other hand, we introduce a noise-driven consistency regularization to align local instances with diffusion denoising representations, constraining the optimization region in the feature space. In addition, FedDifRC can be extended to a self-supervised scheme without relying on any labeled data. We also provide a theoretical analysis for FedDifRC to ensure convergence under non-convex …
Poster
Jieming Bian · Lei Wang · Letian Zhang · Jie Xu

[ Exhibit Hall I ]

Abstract
Foundation models (FMs) achieve strong performance across diverse tasks with task-specific fine-tuning, yet full parameter fine-tuning is often computationally prohibitive for large models. Parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) reduce this cost by introducing low-rank matrices for tuning fewer parameters. While LoRA allows for efficient fine-tuning, it requires significant data for adaptation, making Federated Learning (FL) an appealing solution due to its privacy-preserving collaborative framework. However, combining LoRA with FL introduces two key challenges: the \textbf{Server-Side Aggregation Bias}, where server-side averaging of LoRA matrices diverges from the ideal global update, and the \textbf{Client-Side Initialization Lag}, emphasizing the need for consistent initialization across rounds. Existing approaches address these challenges individually, limiting their effectiveness. We propose LoRA-FAIR, a novel method that tackles both issues by introducing a correction term on the server, enhancing aggregation efficiency and accuracy. LoRA-FAIR maintains computational and communication efficiency, yielding superior performance over state-of-the-art methods. Experimental results on ViT and MLP-Mixer models across large-scale datasets demonstrate that LoRA-FAIR consistently achieves performance improvements in FL settings.
Poster
Hongcheng Li · Yucan Zhou · Xiaoyan Gu · Bo Li · Weiping Wang

[ Exhibit Hall I ]

Abstract
Dataset distillation, which compresses large-scale datasets into compact synthetic representations (i.e., distilled datasets), has become crucial for the efficient training of modern deep learning architectures. While existing large-scale dataset distillation methods leverage a pre-trained model through batch normalization statistics alignment, they neglect the essential role of covariance matrices in preserving inter-feature correlations, resulting in reduced diversity in the distilled datasets. In this paper, we propose a simple yet effective approach, Diversity-Enhanced Distribution Alignment (DEDA), which enhances the diversity of distilled data by leveraging inter-feature relationships. Our method first establishes Gaussian distribution alignment by matching the means and covariances of each class in the original dataset with those of the distilled dataset in the feature space of a pre-trained model. Since features within the last layer of a pre-trained model are often highly similar within each class, aligning distributions in this layer cannot obtain diversified distilled data, resulting in gradient starvation during downstream training tasks. To overcome this limitation, we introduce a regularizer that constrains the covariance matrix of the distilled data in the last layer to maximize diagonal elements while minimizing non-diagonal elements. Extensive evaluations across CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K demonstrate state-of-the-art performance without additional computational overhead.
Poster
Ruiqi Du · Xu Tang · Xiangrong Zhang · Jingjing Ma

[ Exhibit Hall I ]

Abstract
Since real-world multi-label data often exhibit significant label imbalance, long-tailed multi-label image classification has emerged as a prominent research area in computer vision. Traditionally, it is considered that deep neural networks' classifiers are vulnerable to long-tailed distributions, whereas the feature extraction backbone remains relatively robust. However, our analysis from the feature learning perspective reveals that the backbone struggles to maintain high sensitivity to sample-scarce categories but retains the ability to localize specific areas effectively. Based on this observation, we propose a new model for long-tailed multi-label image classification named category-specific selective feature enhancement (CSSFE). First, it utilizes the retained localization capability of the backbone to capture label-dependent class activation maps. Then, a progressive attention enhancement mechanism, updating from head to medium to tail categories, is introduced to address the low-confidence issue in medium and tail categories. Finally, visual features are extracted according to the optimized class activation maps and combined with semantic information to perform the classification task. Extensive experiments on two benchmark datasets highlight our findings' generalizability and the proposed CSSFE's superior performance.
Poster
Jaeho Shin · Hyeonjae Gil · Junwoo Jang · Maani Ghaffari · Ayoung Kim

[ Exhibit Hall I ]

Abstract
Affine Grassmannian has been favored for expressing proximity between lines and planes due to its theoretical exactness in measuring distances among features. Despite this advantage, the existing method can only measure the proximity without yielding the distance as an explicit function of rigid body transformation. Thus, an optimizable distance function on the manifold has remained underdeveloped, stifling its application in registration problems. This paper is the first to explicitly derive an optimizable cost function between two Grassmannian features with respect to rigid body transformation ($\mathbf{R}$ and $\mathbf{t}$). Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. Finally, we propose an optimizable cost function based on the transformed bases that can be applied to the registration problem of any affine subspace. Compared to vector parameter-based approaches, our method is able to find a globally optimal solution by directly minimizing the geodesic distance which is agnostic to representation ambiguity. The resulting cost function and its extension to the inlier-set maximizing Branch-and-Bound (BnB) solver have been demonstrated to improve the convergence of existing solutions or outperform them in various computer vision tasks. The code will be made available …
Poster
Linlan Huang · Xusheng Cao · Haori Lu · Yifan Meng · Fei Yang · Xialei Liu

[ Exhibit Hall I ]

Abstract
Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks.With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios.Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models.Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved.Based on these insights, we propose a simple yet effective method that improves CLIP’s performance in class-incremental learning.Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data.
Poster
Melanie Götz · Torsten Krauß · Alexandra Dmitrienko

[ Exhibit Hall I ]

Abstract
Federated Learning (FL) enables collaborative machine learning across decentralized clients without sharing raw data, which offers enhanced privacy and improved performance. However, FL is vulnerable to poisoning attacks, compromising model integrity through both untargeted performance degradation and targeted backdoor attacks. Detecting backdoors in FL is challenging due to their stealthy nature and variability in local datasets. Existing defenses struggle against adaptive adversaries and distinguishing between poisoning and genuine dataset anomalies. This paper introduces the Siamese Backdoor Inspector (Sibai), a novel meta-classifier-based poisoning defense for FL. Leveraging the staple few-shot learning technique of Siamese networks, Sibai effectively detects malicious contributions in various scenarios, including settings with strong variations between clients' datasets and encounters with adaptive adversaries. Sibai achieves high detection rates, prevents backdoors, minimizes performance impact, and outperforms eight recent defenses regarding F1 score, poisoning prevention, and consistency across complex scenarios.
Poster
Xueyi Zhang · Peiyin Zhu · Chengwei Zhang · Zhiyuan Yan · Jikang Cheng · Mingrui Lao · Siqi Cai · Yanming Guo

[ Exhibit Hall I ]

Abstract
Existing continual deepfake detection methods typically treat stability (retaining previously learned forgery knowledge) and plasticity (adapting to novel forgeries) as conflicting properties, emphasizing an inherent trade-off between them, while regarding generalization to unseen forgeries as secondary. In contrast, we reframe the problem: stability and plasticity can coexist and be jointly improved through the model’s inherent generalization. Specifically, we propose Generalization-Preserved Learning (GPL), a novel framework consisting of two key components: (1) Hyperbolic Visual Alignment, which introduces learnable watermarks to align incremental data with the base set in hyperbolic space, alleviating inter-task distribution shifts; (2) Generalized Gradient Projection, which prevents parameter updates that conflict with generalization constraints, ensuring new knowledge learning does not interfere with previously acquired knowledge. Notably, GPL requires neither backbone retraining nor historical data storage. Experiments conducted on four mainstream datasets (FF++, Celeb-DF v2, DFD, and DFDCP) demonstrate that GPL achieves an accuracy of 92.14\%, outperforming replay-based state-of-the-art methods by 2.15\%, while reducing forgetting by 2.66\%. Moreover, GPL achieves an 18.38\% improvement on unseen forgeries using only 1\% of baseline parameters, thus presenting an efficient adaptation to continuously evolving forgery techniques.
Poster
Jiachen Sun · De Cheng · Xi Yang · Nannan Wang

[ Exhibit Hall I ]

Abstract
Domain incremental object detection in remote sensing addresses the challenge of adapting to continuously emerging domains with distinct characteristics. Unlike natural images, remote sensing data vary significantly due to differences in sensors, altitudes, and geographic locations, leading to data distribution shifts and feature misalignments. These challenges make it difficult for models to generalize across domains while retaining knowledge from previous tasks, requiring effective adaptation strategies to mitigate catastrophic forgetting. To address these challenges, we propose the Dual Domain Control via Active Learning (Active-DDC) method, which integrates active learning strategies to handle data distribution and model feature shifts. The first component, the Data-based Active Learning Example Replay (ALER) module, combines a high-information sample selection strategy from active learning with the characteristic extreme foreground-background ratio in remote sensing images, enabling the selection of highly representative samples for storage in a memory bank. The second component, the Query-based Active Domain Shift Control (ADSC) module, leverages the query vector, a key element for DETR-based detectors, to implement query active preselection and optimal transport matching, thus facilitating effective cross-domain knowledge transfer. Our method achieves optimal performance in domain incremental tasks across four remote sensing datasets, and ablation studies further validate the effectiveness of both components.
Poster
Ihab Asaad · Maha Shadaydeh · Joachim Denzler

[ Exhibit Hall I ]

Abstract
Machine learning classifcation models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations to define the target gradient as the linear extrapolation of two gradients computed from each batch’s loss. It is demonstrated that the extrapolated gradient, if directed toward the gradient of the batch with fewer amount of spurious correlation, can guide the training process toward learning a debiased model. GERNE can serve as a general framework for debiasing with methods, such as ERM, reweighting, and resampling, being shown as special cases. The theoretical upper and lower bounds of the extrapolation factor are derived to ensure convergence. By adjusting this factor, GERNE can be adapted to maximize the Group-Balanced Accuracy (GBA) or the Worst-Group Accuracy. The proposed approach is validated on five vision and one NLP benchmarks, demonstrating competitive and often …
Poster
Tengjin Weng · Jingyi Wang · Wenhao Jiang · Zhong Ming

[ Exhibit Hall I ]

Abstract
Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about $1,900$ multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks.Our experiments on VisNumBench led to the following key findings:(i) The 17 MLLMs we tested—including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash—perform significantly below human levels in number sense-related tasks.(ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities.(iii) Stronger MLLMswith larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities.We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing LVLMs' number sense abilities. All benchmark resources, including code and datasets, will be publicly released upon the paper’s acceptance.
Poster
Chengchao Zhang · Fanhua Shang · Hongying Liu · Liang Wan · Wei Feng

[ Exhibit Hall I ]

Abstract
Federated Continual Learning (FCL) has emerged as a prominent distributed learning paradigm and aims at addressing model learning challenges in both federated and continual learning settings. Efficient personalization in FCL remains a major challenge, as it must handle not only conflicts between old and new knowledge within parallel task streams but also heterogeneous knowledge conflicts from different clients. Recent approaches attempt to mitigate these issues through gradient correction. However, they often overlook the combined impact of gradient magnitude and direction, leading to unsatisfactory gradient solutions. To address these issues, we propose a novel federated continual learning method (called FedAGC) with asymmetric gradient correction, which performs memory rehearsal using representative samples selected via a centroid-based approach from historical tasks. By formulating the problem as a multi-objective optimization paradigm, FedAGC derives more effective gradients while incorporating group-level personalization to facilitate useful knowledge integration and irrelevant knowledge isolation, effectively mitigating both temporal and spatial catastrophic forgetting. Extensive experiments confirm the effectiveness of FedAGC.
Poster
Minkyun Seo · Hyungtae Lim · Kanghee Lee · Luca Carlone · Jaesik Park

[ Exhibit Hall I ]

Abstract
Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors,and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code will be made publicly available.
Poster
Shenghe Zheng · Hongzhi Wang

[ Exhibit Hall I ]

Abstract
With the rapid growth of deep learning, there is an increasing availability of open-source models for various tasks. However, single fine-tuned models often fall short of meeting the diverse needs of users. Model merging has thus emerged as an efficient method to integrate the capabilities of existing models into a unified model. Nevertheless, existing model merging methods face challenging trade-offs between performance and deployment costs, primarily due to task interference. For the first time, we reveal that task interference is evident in the frequency domain of model parameters, yet current efforts only focus on spatial domain solutions, which are largely ineffective in addressing frequency domain interference. To mitigate the impact of frequency domain interference, we propose **FR-Merging**, an innovative method that effectively filters harmful frequency domain interference on the backbone with minimal computational overhead. Since performance loss is inevitable with cost-free methods, we propose a lightweight task-specific expert module that dynamically compensates for information loss during merging. This proposed framework, **FREE-Merging** (FR-Merging with experts), strikes a balanced trade-off between training cost, inference latency, storage requirements, and performance. We demonstrate the effectiveness of both FR-Merging and FREE-Merging on multiple tasks across CV, NLP, and Multi-Modal domains and show that they can …
Poster
Yiming Zhang · Zhuokai Zhao · Zhaorun Chen · Zhili Feng · Zenghui Ding · Yining Sun

[ Exhibit Hall I ]

Abstract
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
Poster
Zenghao Niu · Weicheng Xie · Siyang Song · Zitong YU · Feng Liu · Linlin Shen

[ Exhibit Hall I ]

Abstract
Adversarial attacks present a critical challenge to deep neural networks' robustness, particularly in transfer scenarios across different model architectures. However, the transferability of adversarial attacks faces a fundamental dilemma between Exploitation (maximizing attack potency) and Exploration (enhancing cross-model generalization). Traditional momentum-based methods over-prioritize Exploitation, i.e., higher loss maxima for attack potency but weakened generalization (narrow loss surface). Conversely, recent methods with inner-iteration sampling over-prioritize Exploration, i.e., flatter loss surfaces for cross-model generalization but weakened attack potency (suboptimal local maxima). To resolve this dilemma, we propose a simple yet effective Gradient-Guided Sampling (GGS), which harmonizes both objectives through guiding sampling along the gradient ascent direction to improve both sampling efficiency and stability. Specifically, based on MI-FGSM, GGS introduces inner-iteration random sampling and guides the sampling direction using the gradient from the previous inner-iteration (the sampling's magnitude is determined by a random distribution). This mechanism encourages adversarial examples to reside in balanced regions with both flatness for cross-model generalization and higher local maxima for strong attack potency. Comprehensive experiments across multiple DNN architectures and multimodal large language models (MLLMs) demonstrate the superiority of our method over state-of-the-art transfer attacks. Code will be made publicly available.
Poster
Jaeseok Byun · Seokhyeon Jeong · Wonjae Kim · Sanghyuk Chun · Taesup Moon

[ Exhibit Hall I ]

Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches.The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval.However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text $\leftrightarrow$ image) and the target CIR task (image + text $\leftrightarrow$ image), which potentially negatively impacts CIR performance.To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. We devise a novel target-anchored text contrastive learning designed to enhance the capability of the text encoder for CIR. We also propose two key enhancements: (1) a hard negative-based refined batch sampling strategy and (2) a refined concatenation scheme to further mitigate training-inference discrepancy. Integrating \ours into state-of-the-art projection-based methods achieves performance comparable to, or even surpassing, resource-intensive state-of-the-art synthetic CIR triplet-based approaches only …
Poster
Weitai Kang · Haifeng Huang · Yuzhang Shang · Mubarak Shah · Yan Yan

[ Exhibit Hall I ]

Abstract
Recent advancements in 3D Large Language Models (3DLLMs) show their potential to build general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D further integrates an improved vision projector and enhanced sequence organization. Notably, we achieve a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap).
Poster
Zichun Su · Zhi Lu · Yutong Wu · Renfei Shen · Songfeng Lu

[ Exhibit Hall I ]

Abstract
Federated Learning (FL) enables collaborative global model training without data sharing but facing critical challenges from privacy leakage and Byzantine attacks. Existing privacy-preserving robust FL frameworks suffer from three main limitations: high computational costs, restricted RAR usage, and inadequate handling of data heterogeneity. To address these limitations, we propose the FLSeg framework, which leverages Segment Exchange and Segment Aggregation to avoid excessive encryption computations while allowing unrestricted use of any RAR. Additionally, a regularization term in local training balances personalization with global model performance, effectively adapting to heterogeneous data. Our theoretical and experimental analyses demonstrate FLSeg’s semi-honest security and computational efficiency. FLSeg achieves client and server time complexities of $O(\ell)$ and $O(n\ell)$, with empirical results showing significantly reduced computational times, e.g., 233 ms for clients and 78 ms per client on the server, compared to ACORN (USENIX 23) at 1696 ms and 181 ms. Extensive experiments confirm FLSeg’s robustness across diverse heterogeneous and adversarial scenarios, e.g., achieving 64.59\% accuracy on non-IID CIFAR-10 with 20\% Min-Max attackers, compared to ACORN of 48.21\%.
Poster
Hyun Jun Yook · Ga San Jhun · Cho Hyun · Min Jeon · Donghyun Kim · Tae Hyung Kim · Youn Lee

[ Exhibit Hall I ]

Abstract
Machine unlearning (MU) removes specific data points or concepts from deep learning models to enhance privacy and prevent sensitive content generation. Adversarial prompts can exploit unlearned models to generate content containing removed concepts, posing a significant security risk. However, existing adversarial attack methods still face challenges in generating content that aligns with an attacker’s intent while incurring high computational costs to identify successful prompts. To address these challenges, we propose ZIUM, a Zero-shot Intent-aware adversarial attack on Unlearned Models, which enables the flexible customization of target attack images to reflect an attacker’s intent. Additionally, ZIUM supports zero-shot adversarial attacks without requiring further optimization for previously attacked unlearned concepts. The evaluation across various MU scenarios demonstrated ZIUM's effectiveness in successfully customizing content based on user-intent prompts while achieving a superior attack success rate compared to existing methods. Moreover, its zero-shot adversarial attack significantly reduces the attack time for previously attacked unlearned concepts.
Poster
Hang Phung · Manh Nguyen · Thanh Huynh · Quoc Viet Hung Nguyen · Trong Nghia Hoang · Phi Le Nguyen

[ Exhibit Hall I ]

Abstract
This paper develops a generalized federated prompt-tuning framework under practical scenarios where local datasets are multi-modal and have different distributional patterns of missing features at the input level. The proposed framework helps bridge the gap between federated learning and multi-modal prompt-tuning which previously focus on either uni-modal or centralized data. A key challenge in bridging this gap is due to the inherent lack of a semantic alignment between prompt instructions that encodes the same distributional patterns of missing data across different clients. To address this challenge, our proposed framework introduces specific client-tuning and server-aggregation designs that learns to simultaneously optimize, align, and aggregate prompt-tuning instructions across clients and data modalities, enabling them to complement one another and be combined effectively. A thorough evaluation of our framework on a variety of multimodal benchmark datasets demonstrates consistent and significant performance improvement over existing state-of-the-art (SOTA) baselines.
Poster
Stefan Kolek · Aditya Chattopadhyay · Kwan Ho Ryan Chan · Hector Andrade Loarca · Gitta Kutyniok · Rene Vidal

[ Exhibit Hall I ]

Abstract
Building image classification models that are both highly accurate and interpretable remains a challenge in computer vision. Information Pursuit (IP) is an information-theoretic framework for interpretable-by-design sequential prediction. Given a set of task-relevant and semantic data queries, IP selects a sequence of queries in order of information gain and updates the posterior at each step based on the gathered query-answer pairs. To carry out IP, previous methods construct hand-crafted dictionaries of potential data queries, curated either by a domain expert or by prompting large language models. However, in practice, such hand-crafted dictionaries are limited by the expertise of the curator and the heuristics of prompt engineering, resulting in a gap between the predictive performance of IP versus non-interpretable black-box predictors. In this work, we propose to parameterize the IP queries as a learnable dictionary defined in the latent space of vision-language models such as CLIP. Drawing inspiration from sparse dictionary learning, we propose an alternating optimization algorithm that iterates between solving IP's optimization problem for a fixed query dictionary and optimizing the dictionary to maximize classification accuracy. Empirically, our experiments show that our method learns a query dictionary that reduces the accuracy gap between explainable image classification with IP and …
Poster
Qizhen Lan · Qing Tian

[ Exhibit Hall I ]

Abstract
Dense visual prediction tasks, such as detection and segmentation, are crucial for time-critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature-based KD methods rely on static, teacher-driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student-teacher interactions. To address these limitations, we propose Adaptive student-teacher Cooperative Attention Masking for Knowledge Distillation (ACAM-KD), which introduces two key components: (1) Student-Teacher Cross-Attention Feature Fusion (STCA-FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial-Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel-wise feature selection. Unlike conventional KD methods, ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM-KD improves object detection performance by up to 1.4 mAP over the state-of-the-art when distilling a ResNet-50 student from a ResNet-101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3-MobileNetV2 as the student model.
Poster
Wenjin Mo · Zhiyuan Li · Minghong Fang · Mingwei Fang

[ Exhibit Hall I ]

Abstract
Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model with coordination from a central server, without needing to share their raw data. This approach is particularly appealing in the era of privacy regulations like the GDPR, leading many prominent companies to adopt it. However, FL's distributed nature makes it susceptible to poisoning attacks, where malicious clients, controlled by an attacker, send harmful data to compromise the model. Most existing poisoning attacks in FL aim to degrade the model’s integrity, such as reducing its accuracy, with limited attention to privacy concerns from these attacks. In this study, we introduce FedPoisonMIA, a novel poisoning membership inference attack targeting FL. FedPoisonMIA involves malicious clients crafting local model updates to infer membership information. Additionally, we propose a robust defense mechanism to mitigate the impact of FedPoisonMIA attacks. Extensive experiments across various datasets demonstrate the attack's effectiveness, while our defense approach reduces its impact to a degree.
Poster
Xianhang Li · Yanqing Liu · Haoqin Tu · Cihang Xie

[ Exhibit Hall I ]

Abstract
OpenAI's CLIP models, released in early 2021, have long been the only viable choice for the research community in building multimodal foundation models. This dominance has only recently been challenged by a few alternatives like SigLIP. However, to the best of our knowledge, all these solutions are still not fully open, \eg, their training data remains proprietary and/or their training frameworks are unreleased. In this paper, we address this challenge by introducing a family of fully open vision encoders that are as competitive as, or even surpass, OpenAI's CLIP in building multimodal foundation models like LLaVA. Moreover, due to their fully open nature, we offer these vision encoders in a wide range of sizes, from as few as 5.9 million parameters to 632.1 million parameters. We further demonstrate that these variable-sized vision encoders provide significant flexibility: larger models deliver enhanced multimodal performance, while smaller models enable efficient and portable multimodal foundation models suitable for edge device deployment. The training data, code and trained models will be released soon.
Poster
Zixian Guo · Ming Liu · Qilong Wang · Zhilong Ji · Jinfeng Bai · Lei Zhang · Wangmeng Zuo

[ Exhibit Hall I ]

Abstract
In addressing geometric problems, the reasoning capabilities demonstrated by existing large vision-language models (LVLMs) are significantly inferior to those of their corresponding large language model (LLM) backbones. We attribute this issue to the inadequate alignment and joint comprehension of visual and linguistic features. Furthermore, the imprecise information extracted from images by LVLMs further impairs their reasoning abilities. To this end, we propose a dual-mind architecture that can capture detailed visual information from images and facilitate effective linguistic reasoning through joint optimization. Different from the existing supervised fine-tune pipeline, which makes LVLMs conduct problem-solving directly, we let the LVLMs interpret the visual content first. LVLMs extract key elements like precise geometric primitives and spatial relationships as natural language conditions. Then, LLM serves as a linguistic reasoner for deriving the answer through step-by-step reasoning. The visual interpreting module and the linguistic reasoning module can effectively collaborate by an outcome-rewarded joint tuning strategy. By solving the multimodal question using the dual-mind of LVLM and LLM, we achieve significant improvements in visually intensive geometric math problems. This work advances multimodal reasoning by a new coupled architecture with explicit visual perception and linguistic reasoning, which can overcome the limitations of current LVLMs.The code will be …
Poster
Bhavna Gopal · Huanrui Yang · Mark Horton · Yiran Chen

[ Exhibit Hall I ]

Abstract
Vision transformers (ViTs) have become essential backbones in advanced computer vision applications and multi-modal foundation models. Despite their strengths, ViTs remain vulnerable to adversarial perturbations, comparable to or even exceeding the vulnerability of convolutional neural networks (CNNs). Furthermore, the large parameter count and complex architecture of ViTs make them particularly prone to adversarial overfitting, often compromising both clean and adversarial accuracy. This paper mitigates adversarial overfitting in ViTs through a novel, layer-selective fine-tuning approach: SAFER. Instead of optimizing the entire model, we identify and selectively fine-tune a small subset of layers most susceptible to overfitting, applying sharpness-aware minimization to these layers while freezing the rest of the model. Our method consistently enhances both clean and adversarial accuracy over baseline approaches. Typical improvements are around 5%, with some cases achieving gains as high as 20% across various ViT architectures and datasets.
Poster
Yihong Luo · Tianyang Hu · Yifan Song · Jiacheng Sun · Zhenguo Li · Jing Tang

[ Exhibit Hall I ]

Abstract
While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging *new controls* -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.
Poster
Qihang Fan · Huaibo Huang · Mingrui Chen · Ran He

[ Exhibit Hall I ]

Abstract
he Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess. However, its global attention mechanism's quadratic complexity poses substantial computational burdens. A common remedy spatially groups tokens for self-attention, reducing computational requirements. Nonetheless, this strategy neglects semantic information in tokens, possibly scattering semantically-linked tokens across distinct groups, thus compromising the efficacy of self-attention intended for modeling inter-token dependencies. Motivated by these insights, we introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC). SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner. In contrast to traditional clustering methods requiring multiple iterations, our method achieves token clustering in a single pass. Additionally, SEC regulates the number of tokens per cluster, ensuring a balanced distribution for effective parallel processing on current computational platforms without necessitating further optimization. Capitalizing on SEC, we propose a versatile vision backbone, SECViT. Comprehensive experiments in image classification, object detection, instance segmentation, and semantic segmentation validate to the effectiveness of SECViT. Remarkably, SECViT attains an impressive 84.3% image classification accuracy with only 27M parameters and 4.6G FLOPs, without the need for for additional supervision or data. Moreover, SEC can be conveniently and swiftly applied to multimodal large language …
Poster
Shunjie Yuan · Xinghua Li · Xuelin Cao · Haiyan Zhang · Mengyao Zhu · Robert Deng

[ Exhibit Hall I ]

Abstract
Backdoor attacks have revealed the vulnerability of deep neural networks (DNNs), which motivates the development of secure deep learning systems. However, existing backdoor attacks often fail to bypass backdoor detection and human visual inspection, resulting in the exposure of the backdoor implanted in DNNs, which can subsequently be significantly mitigated through pruning or fine-tuning on benign data. To address this issue, in this paper, we propose a novel backdoor attack called SPD (Shallow Protecting Deep), which consists of a deep backdoor in the frequency domain and a shallow backdoor in the pixel domain, where the shallow backdoor acts as a firewall to protect the deep backdoor from being detected. Specifically, the deep backdoor in SPD samples from a specific Gaussian distribution, and encodes the sampled results into the intensity of the image's amplitude component in the frequency domain using an autoencoder, which serves as the backdoor trigger, thereby ensuring the invisibility of the backdoor attack. The shallow backdoor leverages traditional patch-based triggers, which covers all classes and attracts the defender's attention, thereby preserving the deep backdoor's resistance to existing backdoor detection techniques. Experimental results demonstrate that SPD not only can resist existing backdoor detection techniques, but also, due to the …
Poster
Julia Machnio · Mads Nielsen · Mostafa Mehdipour Ghazi

[ Exhibit Hall I ]

Abstract
Active learning (AL) seeks to reduce annotation costs by selecting the most informative samples for labeling, making it particularly valuable in resource-constrained settings. However, traditional evaluation methods, which focus solely on final accuracy, fail to capture the full dynamics of the learning process. To address this gap, we propose PALM (Performance Analysis of Active Learning Models), a unified and interpretable mathematical model that characterizes AL trajectories through four key parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. PALM provides a predictive description of AL behavior from partial observations, enabling the estimation of future performance and facilitating principled comparisons across different strategies. We validate PALM through extensive experiments on CIFAR-10/100 and ImageNet-50/100/200, covering a wide range of AL methods and self-supervised embeddings. Our results demonstrate that PALM generalizes effectively across datasets, budgets, and strategies, accurately predicting full learning curves from limited labeled data. Importantly, PALM reveals crucial insights into learning efficiency, data space coverage, and the scalability of AL methods. By enabling the selection of cost-effective strategies and predicting performance under tight budget constraints, PALM lays the basis for more systematic, reproducible, and data-efficient evaluation of AL in both research and real-world applications.
Poster
Geon Yeong Park · Sang Wan Lee · Jong Ye

[ Exhibit Hall I ]

Abstract
Diffusion distillation models effectively accelerate reverse sampling by compressing the process into fewer steps. However, these models still exhibit a performance gap compared to their pre-trained diffusion model counterparts, exacerbated by distribution shifts and accumulated errors during multi-step sampling. To address this, we introduce Distillation++, a novel inference-time distillation framework that reduces this gap by incorporating teacher-guided refinement during sampling. Inspired by recent advances in conditional sampling, our approach recasts student model sampling as a proximal optimization problem with a score distillation sampling loss (SDS). To this end, we integrate distillation optimization during reverse sampling, which can be viewed as teacher guidance that drives student sampling trajectory towards the clean manifold using pre-trained diffusion models. Thus, Distillation++ improves the denoising process in real-time without additional source data or fine-tuning. Distillation++ demonstrates substantial improvements over state-of-the-art distillation baselines, particularly in early sampling stages, positioning itself as a robust guided sampling process crafted for diffusion distillation models.
Poster
Mohammed Rakib · Arunkumar Bagavathi

[ Exhibit Hall I ]

Abstract
Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation ($G^{2}D$), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. $G^{2}D$ further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate $G^{2}D$ on multiple real-world datasets and show that $G^{2}D$ amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. The source code is available with the supplementary materials.
Poster
Qiqi Liu · Jiaqiang Li · Yuchen Liu · Yaochu Jin · Lingjuan Lyu · Xiaohu Wu · Han Yu

[ Exhibit Hall I ]

Abstract
A crucial issue in federated learning is the heterogeneity of data across clients, which may lead to model divergence, eventually deteriorating the model performance. Personalized federated learning (pFL) has been shown to be an effective approach to addressing data heterogeneity in federated learning. However, many existing pFL studies rely on directly using the global model for local training without fully assessing its impact on the performance of the local model, resulting in a potential conflict between personalization and generalization. To address this issue, we propose a parallel structure of a local supervisor and an inter-learning model for the local model and introduce a novel pFL method called federated learning by considering data similarity across clients assisted by a local supervisor (FedSimSup). Specifically, FedSimSup maintains an inter-learning model for each client and refines the inter-learning model using a local supervisor for each client. The local supervisor monitors the aggregated global information and ensures that the inter-learning model aligns with the local heterogeneous data to enhance local model performance. Additionally, the similarity between clients is measured based on differences in local data distributions, and this similarity is used to adjust the weights of the inter-learning models.Experimental results show that FedSimSup outperforms eight …
Poster
Yi Li · Hualiang Wang · Xinpeng Ding · Haonan Wang · Xiaomeng Li

[ Exhibit Hall I ]

Abstract
Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual …
Poster
Derong Jin · Ruohan Gao

[ Exhibit Hall I ]

Abstract
An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.
Poster
Ruonan Liu · Lin Zhu · Xijie Xiang · Lizhi Wang · Hua Huang

[ Exhibit Hall I ]

Abstract
Spike-based imaging, inspired by the human visual system, offers several advantages, including high temporal resolution and low power consumption, but suffers from significant image degradation in low-light conditions due to noise interference. Restoring spike images under such conditions poses a significant challenge, as traditional frame-based or spike-based techniques are ill-suited to handle such severe noise and unique noise characteristics. This paper proposes a novel approach for restoring low-light spike images using noise-modeled diffusion models. By establishing a noise-embedded spike imaging model under low light, we model the forward diffusion process as the degradation of spike images with proportional and residual terms and incorporate determinstic and non-determinstic components with reverse shifting, enabling the model to capture the distinctive spike noise structure. Additionally, we utilize region mask image, dark current map and spike density value as conditions to further guide the restoration process by providing prompts for degradation regions, deterministic parameters and noise intensity. Experimental results demonstrate that our method significantly outperforms existing spike-based reconstruction and diffusion-based image restoration methods in both quantitative performance and visual qualityThis work opens new possibilities for spike-based imaging systems, particularly in low-light environments, and lays the groundwork for future developments in spike image restoration using advanced …
Poster
Hao Fang · Jiawei Kong · Wenbo Yu · Bin Chen · Jiawei Li · Hao Wu · Shu-Tao Xia · Ke Xu

[ Exhibit Hall I ]

Abstract
Vision-Language Pre-training (VLP) models have exhibited unprecedented capability in many applications by taking full advantage of the learned multimodal alignment. However, previous studies have shown they are vulnerable to maliciously crafted adversarial samples. Despite recent success, these attacks are generally instance-specific and require generating perturbations for each input sample. In this paper, we reveal that VLP models are also susceptible to the instance-agnostic universal adversarial perturbation (UAP). Specifically, we design a novel Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC). In light that the pivotal multimodal alignment in VLP models is achieved via contrastive learning, we devise to turn this powerful weapon against VLP models themselves. I.e., we employ a malicious version of contrastive learning to train the proposed generator using our carefully crafted positive and negative image-text pairs. Once training is complete, the generator is able to produce universal perturbations that can essentially destroy the established alignment relationship in VLP models. Besides, C-PGC fully utilizes the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information as effective guidance. Extensive experiments show that C-PGC successfully forces adversarial samples to move away from their original area in the VLP model's feature space, thus fundamentally enhancing attack performance across various …
Poster
Ge Zheng · Jiaye Qian · Jiajin Tang · Sibei Yang

[ Exhibit Hall I ]

Abstract
Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel ``induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses. Code will be released.
Poster
Xinhang Wan · Jiyuan Liu · Qian Qu · Suyuan Liu · Chuyu Zhang · Fangdi Wang · Xinwang Liu · En Zhu · Kunlun He

[ Exhibit Hall I ]

Abstract
In this paper, we address the problem of novel class discovery (NCD), which aims to cluster novel classes by leveraging knowledge from disjoint known classes. While recent advances have made significant progress in this area, existing NCD methods face two major limitations. First, they primarily focus on single-view data (e.g., images), overlooking the increasingly common multi-view data, such as multi-omics datasets used in disease diagnosis. Second, their reliance on pseudo-labels to supervise novel class clustering often results in unstable performance, as pseudo-label quality is highly sensitive to factors such as data noise and feature dimensionality. To address these challenges, we propose a novel framework named Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD), which is the first attempt to explore NCD in multi-view setting so far. Specifically, at the intra-view level, leveraging the distributional similarity between known and novel classes, we employ matrix factorization to decompose features into view-specific shared base matrices and factor matrices. The base matrices capture distributional consistency among the two datasets, while the factor matrices model pairwise relationships between samples. At the inter-view level, we utilize view relationships among known classes to guide the clustering of novel classes. This includes generating predicted labels through …
Poster
BaoFeng Tan · Xiu-Shen Wei · Lin Zhao

[ Exhibit Hall I ]

Abstract
In this paper, we mitigate the problem of Self-Supervised Learning (SSL) for fine-grained representation learning, aimed at distinguishing subtle differences within highly similar subordinate categories. Our preliminary analysis shows that SSL, especially the multi-stage alignment strategy, performs well on generic categories but struggles with fine-grained distinctions. To overcome this limitation, we propose a prototype-based contrastive learning module with stage-wise progressive augmentation. Unlike previous methods, our stage-wise progressive augmentation adapts data augmentation across stages to better suit SSL on fine-grained datasets. The prototype-based contrastive learning module captures both holistic and partial patterns, extracting global and local image representations to enhance feature discriminability. Experiments on popular fine-grained benchmarks for classification and retrieval tasks demonstrate the effectiveness of our method, and extensive ablation studies confirm the superiority of our proposals.
Poster
Shrisudhan Govindarajan · Daniel Rebain · Kwang Moo Yi · Andrea Tagliasacchi

[ Exhibit Hall I ]

Abstract
Research on differentiable scene representations is consistently moving towards more efficient, real-time models. Recently, this has led to the popularization of splatting methods, which eschew the traditional ray-based rendering of radiance fields in favor of rasterization. This has yielded a significant improvement in rendering speeds due to the efficiency of rasterization algorithms and hardware, but has come at a cost: the approximations that make rasterization efficient also make implementation of light transport phenomena like reflection and refraction much more difficult. We propose a novel scene representation which avoids these approximations, but keeps the efficiency and reconstruction quality of splatting by leveraging a decades-old efficient volumetric mesh ray tracing algorithm which has been largely overlooked in recent computer vision research. The resulting model, which we name Radiant Foam, achieves rendering speed and quality comparable to Gaussian Splatting, without the constraints of rasterization. Unlike ray traced Gaussian models that use hardware ray tracing acceleration, our method requires no special hardware or APIs beyond the standard features of a programmable GPU.
Poster
Ryan Rabinowitz · Steve Cruz · Walter Scheirer · Terrance Boult

[ Exhibit Hall I ]

Abstract
Handling novelty is a common challenge in visual recognition systems. Existing open-set methods rely on the familiarity hypothesis, detecting novelty by the absence of familiar features. We introduce a novel attenuation hypothesis, arguing that small weights learned during training, which attenuate features, play a dual role: they differentiate known classes but also discard information valuable for distinguishing known and unknown classes. How to effectively leverage this attenuation information to enhance open-set recognition remains unclear, so we present COSTARR, a novel approach that combines the requirement of familiar features and the lack of unfamiliar ones. We provide a probabilistic interpretation of the COSTARR score, linking it to the likelihood of correct classification and belonging in a known class. To determine the individual contributions of the pre- and post-attenuated features to COSTARR's performance, we conduct ablation studies that demonstrate both pre-attenuated deep features and the underutilized post-attenuated Hadamard product features are essential for improving open-set recognition. Also, to validate generalizability and efficacy across diverse architectures and datasets, we evaluate COSTARR on a large-scale setting, using ImageNet2012-1K as known data and NINCO, iNaturalist, OpenImage-O and other datasets as unknowns, across multiple modern pre-trained architectures (ViTs, ConvNeXts, and ResNet). The experiments demonstrate that COSTARR …
Poster
Dubing Chen · Jin Fang · Wencheng Han · Xinjing Cheng · Junbo Yin · Cheng-zhong Xu · Fahad Khan · Jianbing Shen

[ Exhibit Hall I ]

Abstract
Vision-based semantic occupancy and flow prediction provide critical spatiotemporal cues for real-world tasks like autonomous driving and robotics. In this work, we strive to improve performance by introducing a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. First, we propose an occlusion-aware adaptive lifting mechanism with depth denoising to improve the robustness of 2D-to-3D feature transformation and reduce reliance on depth priors. Second, we enhance semantic consistency between 3D and 2D features using shared semantic prototypes to jointly constrain both modalities. This is supported by confidence- and category-based sampling to tackle long-tail challenges in 3D space. Third, to ease the feature encoding burden in joint semantics and flow prediction, we introduce a BEV cost volume-based method. It connects flow and semantic features via the cost volume and applies a classification-regression supervision scheme to manage varying flow scales in dynamic scenes. Our purely convolutional framework achieves SOTA results across multiple benchmarks for 3D semantic occupancy prediction and joint semantic occupancy-flow prediction. It is also the 2nd solution for the Occupancy and Flow in Autonomous Driving Challenge. We provide multiple model variants that optimally balance efficiency and performance. Our real-time version exceeds all existing real-time methods in speed …
Poster
Chunyi Li · Xiaozhe Li · Zicheng Zhang · Yuan Tian · Ziheng Jia · Xiaohong Liu · Xiongkuo Min · Jia Wang · Haodong Duan · Kai Chen · Guangtao Zhai

[ Exhibit Hall I ]

Abstract
With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Information Density, which examines **how much insight a benchmark can provide for the development of MLLMs.** We characterize it from four key dimensions: (1) Fallacy, (2) Difficulty, (3) Redundancy, (4) Diversity. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density. We hope this principle can promote the development and application of future MLLM benchmarks.
Poster
Jhe-Hao Lin · Yi Yao · Chan-Feng Hsu · Hongxia Xie · Hong-Han Shuai · Wen-Huang Cheng

[ Exhibit Hall I ]

Abstract
Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a perspective-aware teaching (PAT) KD framework to enable feature distillation across diverse architectures. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model's learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architectures. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method.
Poster
Yunchuan Guan · Yu Liu · Ke Zhou · Zhiqi Shen · Jenq-Newng Hwang · Serge Belongie · Lei Li

[ Exhibit Hall I ]

Abstract
Meta-learning is a powerful paradigm for tackling few-shot tasks. However, recent studies indicate that models trained with the whole-class training strategy can achieve comparable performance to those trained with meta-learning in few-shot classification tasks. To demonstrate the value of meta-learning, we establish an entropy-limited supervised setting for fair comparisons. Through both theoretical analysis and experimental validation, we establish that meta-learning has a tighter generalization bound compared to whole-class training. We unravel that meta-learning is more efficient with limited entropy and is more robust to label noise and heterogeneous tasks, making it well-suited for unsupervised tasks. Based on these insights, We propose MINO, a meta-learning framework designed to enhance unsupervised performance. MINO utilizes the adaptive clustering algorithm DBSCAN with a dynamic head for unsupervised task construction and a stability-based meta-scaler for robustness against label noise. Extensive experiments confirm its effectiveness in multiple unsupervised few-shot and zero-shot tasks.
Poster
Xudong LU · Yinghao Chen · Renshou Wu · Haohao Gao · Xi Chen · Xue Yang · Xiangyu Zhao · Aojun Zhou · Fangyuan Li · Yafei Wen · Xiaoxin Chen · shuai ren · Hongsheng Li

[ Exhibit Hall I ]

Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose **GenieBlue**, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.
Poster
Gaurav Patel · Qiang Qiu

[ Exhibit Hall I ]

Abstract
Machine Unlearning has recently garnered significant attention, aiming to selectively remove knowledge associated with specific data while preserving the model’s performance on the remaining data. A fundamental challenge in this process is balancing effective unlearning with knowledge retention, as naive optimization of these competing objectives can lead to conflicting gradients, hindering convergence and degrading overall performance. To address this issue, we propose Learning to Unlearn while Retaining, aimed to mitigate gradient conflicts between unlearning and retention objectives. Our approach strategically avoids conflicts through an implicit gradient regularization mechanism that emerges naturally within the proposed framework. This prevents conflicting gradients between unlearning and retention, leading to effective unlearning while preserving the model’s utility. We validate our approach across both discriminative and generative tasks, demonstrating its effectiveness in achieving unlearning without compromising performance on remaining data. Our results highlight the advantages of avoiding such gradient conflicts, outperforming existing methods that fail to account for these interactions.
Poster
PENG LIAO · Xilu Wang · Yaochu Jin · WenLi Du · Han Hu

[ Exhibit Hall I ]

Abstract
Neural Architecture Search (NAS) has gained significant attention in personalized federated learning (PFL) due to its ability to automatically design tailored models for individual clients. While most existing NAS approaches for PFL perform architecture searches on the server side, client-side NAS—where architectures are optimized locally on clients—offers stronger privacy protection by eliminating the need to transmit sensitive model information. However, this paradigm remains underexplored and often suffers from suboptimal average client performance, primarily due to two limitations: (1) Inefficient client-side search strategies caused by data isolation and restricted access to local architectures across clients, and (2) slow supernet convergence arising from server aggregation and local supernet training. To address these challenges, we propose a Personalized Federated Stochastic Differential Equation-based NAS (PerFedSDE-NAS). To achieve effective local search, each client employs a guided diffusion model to generate promising personalized architectures tailored to local data characteristics, while a performance predictor based on radial basis functions is used to select only the most promising subset of architectures for evaluation. To accelerate supernet convergence, each client maintains a supernet with an archive-driven training mechanism, and a novel model aggregation strategy is proposed to further enhance weight convergence during FL rounds. We validate PerFedSDE-NAS across three …
Poster
Jie Xu · Na Zhao · Gang Niu · Masashi Sugiyama · Xiaofeng Zhu

[ Exhibit Hall I ]

Abstract
Recently, multi-view learning (MVL) has garnered significant attention due to its ability to fuse discriminative information from multiple views. However, real-world multi-view datasets are often heterogeneous and imperfect, which usually makes MVL methods designed for specific combinations of views lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. Specifically, we introduce a simple yet effective multi-view transformer fusion network where we transform heterogeneous multi-view data into homogeneous word embeddings, and then integrate multiple views by the sample-level attention mechanism to obtain a fused representation. Furthermore, we propose a simulated perturbation based multi-view contrastive learning framework that dynamically generates the noise and unusable perturbations for simulating imperfect data conditions. The simulated noisy and unusable data obtain two distinct fused representations, and we utilize contrastive learning to align them for learning discriminative and robust representations. Our RML is self-supervised and can also be applied for downstream tasks as a regularization. In experiments, we employ it in unsupervised multi-view clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval. Extensive comparison experiments and ablation studies validate the effectiveness of RML.
Poster
Jeonghyeok Do · Sungpyo Kim · Geunhyuk Youk · Jaehyup Lee · Munchurl Kim

[ Exhibit Hall I ]

Abstract
PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images to generate high-resolution multi-spectral (HRMS) outputs. However, cross-modality misalignment---caused by sensor placement, acquisition timing, and resolution disparity---induces a fundamental challenge. Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a single network to jointly reconstruct HRMS and PAN images, leveraging PAN’s high-frequency details as auxiliary self-supervision. Additionally, we introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism that bidirectionally aligns MS texture to PAN structure and vice versa, enabling adaptive feature refinement across modalities. Extensive experiments on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the most recent state-of-the-art method in all metrics, even with 50.11$\times$ faster inference time and 0.63$\times$ the memory size. Furthermore, it demonstrates strong generalization performance on unseen satellite datasets, showing its robustness across different conditions.
Poster
Yuzhu Wang · Manni Duan · Shu Kong

[ Exhibit Hall I ]

Abstract
Visual Prompt Tuning (VPT) is a parameter-efficient finetuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover "burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Interestingly, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner.Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 points on the CUB dataset; interestingly, it learns ``bursty prompts''.As bilinear models are known to introduce burstiness, we present a compact method by learning two small sets of parameters whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments demonstrate that BPT methods not only outperform various VPT methods …
Poster
Sophia Sirko-Galouchenko · Spyros Gidaris · Antonin Vobecky · Andrei Bursuc · Nicolas THOME

[ Exhibit Hall I ]

Abstract
We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles.To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations.
Poster
Kaiyu Yue · Vasu Singla · Menglin Jia · John Kirchenbauer · Rifaa Qadri · Zikui Cai · Abhinav Bhatele · Furong Huang · Tom Goldstein

[ Exhibit Hall I ]

Abstract
Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training.To reduce costs, a promising strategy is to first train the vision encoder using a small language model before transferring it to the large one.We construct small ''surrogate models'' that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers.Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM.Furthermore, our surrogate training approach reduces overall VLM training costs by $\sim$45\% when using Llama-70B as the decoder.
Poster
Yuting Liu · Liu Yang · Yu Wang

[ Exhibit Hall I ]

Abstract
Real-world data often exhibit long-tailed distributions, which degrade data quality and pose challenges for deep learning. To address this issue, knowledge transfer from head classes to tail classes has been shown to effectively mitigate feature sparsity. However, existing methods often overlook class differences, leading to suboptimal knowledge transfer. While the class space exhibits a label hierarchy, similarity relationships beyond hierarchically related categories remain underexplored. Considering the human ability to process visual perception problems in a multi-granularity manner guided by semantics, this paper presents a novel semantic knowledge-driven contrastive learning method. Inspired by the implicit knowledge embedded in large language models, the proposed LLM-based label semantic generation method overcomes the limitations of the label hierarchy. Additionally, a semantic knowledge graph is constructed based on the extended label information to guide representation learning. This enables the model to dynamically identify relevant classes for learning and facilitates multi-granularity knowledge transfer between similar categories. Experiments on long-tail benchmark datasets, including CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT, demonstrate that the proposed method significantly improves the accuracy of tail classes and enhances overall performance without compromising the accuracy of head classes.
Poster
CHENMING ZHU · Tai Wang · Wenwei Zhang · Jiangmiao Pang · Xihui Liu

[ Exhibit Hall I ]

Abstract
Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D visual understanding and 3D scene understanding. In contrast to previous 3D LMMs, LLaVA-3D supports decoding accurate 3D spatial perception outputs, e.g., 3D bounding boxes, directly from these 3D patches, without relying on the time-consuming off-the-shelf 3D segmentors. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across …
Poster
Dongli Tan · Xingyi He · Sida Peng · Yiqing Gong · Xing Zhu · Jiaming Sun · Ruizhen Hu · Yujun Shen · Hujun Bao · Xiaowei Zhou

[ Exhibit Hall I ]

Abstract
This paper aims to establish correspondences for a set of 2D query points across a video sequence in an online manner. Recent methods leverage future frames to achieve smooth point tracking at the current frame, but they still struggle to find points with significant viewpoint changes after long-term occlusions and inherently cannot achieve online tracking. To overcome these challenges, we develop a novel online tracking framework, named ReTracker, that integrates two advances in image matching with tracking-specific designs. First, a decoder network with a global receptive field is incorporated with a temporal attention module to robustly track points undergoing large location changes. Second, the decoder network is adapted to pretrain on large-scale two-view matching data, which offers significantly greater diversity and volume than tracking data, to learn general matching priors. This pretraining strategy effectively enhances our tracker's ability to handle viewpoint and appearance variations after long-term occlusions. Experiments demonstrate that our method outperforms recent online trackers across multiple benchmarks and achieves competitive or superior performance compared to offline methods. Furthermore, we collect an ego-centric, occlusion-heavy dataset to illustrate the retracking capabilities of our approach. The code and dataset will be released for the reproducibility.
Poster
JIAHE ZHAO · RuiBing Hou · zejie tian · Hong Chang · Shiguang Shan

[ Exhibit Hall I ]

Abstract
We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research of human behavior analysis in 3D scenes, advancing embodied AI and world models.
Poster
Xiang Li · Lannan Luo · Qiang Zeng

[ Exhibit Hall I ]

Abstract
Conventional backdoor attacks on deep neural networks (DNNs) typically assume that an attacker can manipulate the training data or process. However, recent research introduces a more practical threat model by injecting backdoors at the inference stage. These approaches leverage bit flip attacks to modify model weights using memory fault injection techniques such as Rowhammer. Despite their effectiveness, they suffer from a significant limitation---the need to flip a relatively large number of bits simultaneously, which is highly challenging in practice. To overcome this constraint, we propose SOLEFLIP, the first one-bit-flip backdoor attack. Unlike prior methods that rely on optimization-based bit searches and require flipping multiple bits, our algorithm identifies a promising weight for the attack and flips a single bit to insert a backdoor. We evaluate SOLEFLIP on CIFAR-10, SVHN, and ImageNet across various DNN architectures, including a vision transformer. The results show that SOLEFLIP achieves high attack success rates (up to 99.9\%, with an average of 98.9\%) while causing minimal degradation to benign accuracy (0.0\% on average). Furthermore, SOLEFLIP is resilient to backdoor defenses. Our findings reveal a critical threat to DNNs: flipping just one bit is sufficient to execute a successful backdoor attack.
Poster
Hang Su · Yunlong Feng · Daniel Gehrig · Panfeng Jiang · Ling Gao · Xavier Lagorce · Laurent Kneip

[ Exhibit Hall I ]

Abstract
Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared …
Poster
Chen Ziwen · Hao Tan · Kai Zhang · Sai Bi · Fujun Luan · Yicong Hong · Li Fuxin · Zexiang Xu

[ Exhibit Hall I ]

Abstract
We propose Long-LRM, a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360$^\circ$ wide-coverage, scene-level reconstruction. Specifically, it takes in 32 input images at a resolution of $960\times 540$ and produces the Gaussian reconstruction in just 1 second on a single A100 GPU. To handle the long sequence of **250K** tokens brought by the large input size, Long-LRM features a mixture of the recent Mamba2 blocks and the classical transformer blocks, enhanced by a light-weight token merging module and Gaussian pruning steps that balance between quality and efficiency. We evaluate Long-LRM on the large-scale DL3DV benchmark and Tanks&Temples, demonstrating reconstruction quality comparable to the optimization-based methods while achieving an **800**$\times$ speedup w.r.t. the optimization-based approaches and an input size at least **60**$\times$ larger than the previous feed-forward approaches. We conduct extensive ablation studies on our model design choices for both rendering quality and computation efficiency. We also explore Long-LRM's compatibility with other Gaussian variants such as 2D GS, which enhances Long-LRM's ability in geometry reconstruction. Project page: https://longgggglrm.github.io
Poster
Siqi Luo · Haoran Yang · Yi Xin · Mingyang Yi · Guangyang Wu · Guangtao Zhai · Xiaohong Liu

[ Exhibit Hall I ]

Abstract
Large pre-trained models achieve remarkable performance in vision tasks but are impractical for fine-tuning due to high computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a subset of parameters; however, most existing approaches are task-agnostic, failing to fully exploit task-specific adaptations, which leads to suboptimal efficiency and performance. To address this limitation, we propose Task-Relevant Parameter and Token Selection (TR-PTS), a task-driven framework that enhances both computational efficiency and accuracy. Specifically, we introduce Task-Relevant Parameter Selection, which utilizes the Fisher Information Matrix (FIM) to identify and fine-tune only the most informative parameters in a layer-wise manner, while keeping the remaining parameters frozen.Simultaneously, Task-Relevant Token Selection dynamically preserves the most informative tokens and merges redundant ones, reducing computational overhead. By jointly optimizing g parameters and tokens, TR-PTS enables the model to concentrate on task-discriminative information. We evaluate TR-PTS on benchmark datasets, including FGVC and VTAB-1k, where it achieves state-of-the-art performance, surpassing full fine-tuning by 3.40% and 10.35%, respectively.
Poster
Weihao Xia · Cengiz Oztireli

[ Exhibit Hall I ]

Abstract
The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications.
Poster
Paweł Skierś · Kamil Deja

[ Exhibit Hall I ]

Abstract
In this work, we introduce JDCL - a new method for continual learning with generative rehearsal based on joint diffusion models. Neural networks suffer from catastrophic forgetting defined as abrupt loss in the model's performance when retrained with additional data coming from a different distribution. Generative-replay-based continual learning methods try to mitigate this issue by retraining a model with a combination of new and rehearsal data sampled from a generative model. In this work, we propose to extend this idea by combining a continually trained classifier with a diffusion-based generative model into a single - jointly optimized neural network. We show that such shared parametrization, combined with the knowledge distillation technique allows for stable adaptation to new tasks without catastrophic forgetting. We evaluate our approach on several benchmarks, where it outperforms recent state-of-the-art generative replay techniques. Additionally, we extend our method to the semi-supervised continual learning setup, where it outperforms competing buffer-based replay techniques, and evaluate, in a self-supervised manner, the quality of trained representations.
Poster
Zewei Zhou · Zhihao Zhao · Tianhui Cai · Zhiyu Huang · Bolei Zhou · Jiaqi Ma

[ Exhibit Hall I ]

Abstract
End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models by nearly 9%. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances cooperative detection and prediction. The codebase will be released to facilitate future multi-agent multi-task research.
Poster
Ada Görgün · Bernt Schiele · Jonas Fischer

[ Exhibit Hall I ]

Abstract
Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization (FV) is a powerful tool to decode what information neurons are responding to and hence to better understand the reasoning behind such networks. In particular, in FV we generate human-understandable images that reflect the information detected by neurons of interest. However, current methods often yield unrecognizable visualizations, exhibiting repetitive patterns and visual artifacts that are hard to understand for a human. To address these problems, we propose to guide FV through **statistics of real image features** combined with measures of **relevant network flow** to generate prototypical images. Our approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art FVs across various architectures. As such, it can be used to decode **which** information is used by the network's reasoning process, complementing the methodology of mechanistic circuits that identify **where** relevant information is encoded.
Poster
Feifei Zhang · Zhihao Wang · Xi Zhang · Changsheng Xu

[ Exhibit Hall I ]

Abstract
Visual Question Answering (VQA) is a widely explored multimodal task aimed at answering questions based on images. Recently, a few studies have started to investigate continual learning in VQA to cope with evolving multimodal data streams. However, these studies fall short of tackling another critical issue in real-world VQA applications: the long-tailed distribution of data. In this paper, we introduce Continual Long-Tailed Visual Question Answering (CLT-VQA) and identify two critical challenges: \textbf{inner-task prototype drift}, where classifier prototypes become biased toward majority classes due to imbalanced data, and \textbf{inter-task feature drift}, where learned features shift over time, causing forgetting of previously learned knowledge. To address these challenges, we propose a unified dual-balance approach that integrates a Balanced Classifier Prototype (BCP) learning module and a Multi-modal Feature Alignment (MFA) module. The BCP optimizes classifier prototypes to achieve balanced class representation, while the MFA aligns features consistently across tasks, preventing catastrophic forgetting. Extensive experimental results demonstrate that our method outperforms existing models, validating the effectiveness of the proposed approach. \textcolor{raspberry}{Code is available in the supplementary materials.}
Poster
Shaofeng Yin · Ting Lei · Yang Liu

[ Exhibit Hall I ]

Abstract
Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.
Poster
Deng Li · Aming WU · Yang Li · Yaowei Wang · Yahong Han

[ Exhibit Hall I ]

Abstract
In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specific-parameter generation. Particularly, we first design a dual-path LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter’s parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a class-centered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.
Poster
Wen Yang · Guodong Liu · Di Ming

[ Exhibit Hall I ]

Abstract
Transfer-based attacks pose a significant security threat to deep neural networks (DNNs), due to their strong performance on unseen models in real-world black-box scenarios.Building on this, feature importance-based attacks further improve the transferability of adversarial examples by effectively suppressing model-specific feature patterns.However, existing methods primarily focus on single-granularity patch and single-stage training, leading to suboptimal solutions.To address these limitations, we propose a general multi-stage optimization framework based on Semantics-aware Multi-granularity Patchout, dubbed as SMP-Attack.Compared to the non-deformable/regular patch definition, we incorporate multi-granularity into the generation process of deformable/irregular patches, thereby enhancing the quality of the computed aggregate gradient.In contrast to conventional joint optimization of multi-layer losses, we introduce an effective multi-stage training strategy that systematically explores significant model-agnostic features from shallow to intermediate layers.Employing the ImageNet dataset, we conduct extensive experiments on undefended/defended CNNs and ViTs, which unequivocally demonstrate the superior performance of our proposed SMP attack over current state-of-the-art methods in black-box scenarios.Furthermore, we assess the compatibility of our multi-stage optimization, which supersedes single-stage training employed in existing feature-based methods, culminating in substantial performance improvement.
Poster
Enming Zhang · Yuzhe Li · Yuliang Liu · Yingying Zhu · Xiang Bai

[ Exhibit Hall I ]

Abstract
Online education has been widespread in worldwide universities and educational institutions. Lecture slides, a fundamental component of online education, contain a wealth of information, playing a crucial role in learning.However, previous works have not yet paid sufficient attention to understanding lecture slides, including the absence of the large-scale dataset and comprehensive understanding tasks. To facilitate the research about lecture slides understanding, we establish the LecSlides-370K, which consists of 25,542 lectures with 370,078 slides across 15 areas. We also introduce two comprehensive tasks, Lecture Summary and Lecture Question Answering (QA), for providing different perspectives of slides understanding. Furthermore, complex and flexible text relations can hinder the understanding of the internal logic of slides. To address this challenge, we propose a novel method, named SlideParser, which includes an auxiliary branch to predict text relations within slides and enhance attention between related texts, thereby improving slides understanding. With extensive experiments, we show the superiority of our proposed method on both LecSlides-370k and SlideVQA. Dataset and code will be released soon.
Poster
Shaokui Wei · Jiayin Liu · Hongyuan Zha

[ Exhibit Hall I ]

Abstract
Backdoor attacks undermine the integrity of machine learning models by allowing attackers to manipulate predictions using poisoned training data. Such attacks lead to targeted misclassification when specific triggers are present, while the model behaves normally under other conditions. This paper considers a post-training backdoor defense task, aiming to detoxify the backdoors in pre-trained models. We begin by analyzing the underlying issues of vanilla fine-tuning and observe that it is often trapped in regions with low loss for both clean and poisoned samples. Motivated by such observations, we propose Distance-Driven Detoxification (D3), an innovative approach that reformulates backdoor defense as a constrained optimization problem. Specifically, D3 promotes the model's departure from the vicinity of its initial weights, effectively reducing the influence of backdoors. Extensive experiments on state-of-the-art (SOTA) backdoor attacks across various model architectures and datasets demonstrate that D3 not only matches but often surpasses the performance of existing SOTA post-training defense techniques.
Poster
Ziming Yu · Pan Zhou · Sike Wang · Jia Li · Mi Tian · Hua Huang

[ Exhibit Hall I ]

Abstract
Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods offer a memory-efficient alternative by using forward passes to estimate gradients, but the variance of gradient estimates typically scales linearly with the model's parameter dimension—a significant issue for LLMs. In this paper, we propose the random Subspace Zeroth-order (SubZero) optimization to address the challenges posed by LLMs' high dimensionality. We introduce a low-rank perturbation tailored for LLMs that significantly reduces memory consumption while improving performance. Additionally, we prove that our gradient estimation closely approximates the backpropagation gradient, exhibits lower variance than traditional ZO methods, and ensures convergence when combined with SGD. Experimental results show that SubZero enhances fine-tuning performance and achieves faster convergence compared to standard ZO approaches like MeZO across various language modeling tasks. The source code is in the supplementary and will be publicly released.
Poster
Wenlong Luo · Shizhou Zhang · De Cheng · Yinghui Xing · Guoqiang Liang · PENG WANG · Yanning Zhang

[ Exhibit Hall I ]

Abstract
Incremental object detection (IOD) is crucial for enabling AI systems to continuously learn new object classes over time while retaining knowledge of previously learned categories, allowing model to adapt to dynamic environments without forgetting prior information.Existing IOD methods primarily employ knowledge distillation to mitigate catastrophic forgetting, yet these approaches overlook class overlap issues, often resulting in suboptimal performance. In this paper, we propose a novel framework for IOD that leverages a decoupled gradient alignment technique on top of the specially proposed pseudo-labeling strategy. Our method employs a Gaussian Mixture Model to accurately estimate pseudo-labels of previously learned objects in current training images, effectively functioning as a knowledge-replay mechanism. This strategy reinforces prior knowledge retention and prevents the misclassification of unannotated foreground objects from earlier classes as background. Furthermore, we introduce an adaptive gradient decomposition and alignment method to maintain model stability while facilitating positive knowledge transfer. By aligning gradients from both old and new classes, our approach preserves previously learned knowledge while enhancing plasticity for new tasks. Extensive experiments on two IOD benchmarks demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods.
Poster
Jinglun Li · Kaixun Jiang · Zhaoyu Chen · Bo Lin · Yao Tang · Weifeng Ge · Wenqiang Zhang

[ Exhibit Hall I ]

Abstract
Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, improving AUROC by 2.80% and reducing FPR95 by 11.13%. The code for SynOOD will be made …
Poster
Seunghun Lee · Jiwan Seo · Kiljoon Han · Minwoo Choi · Sunghoon Im

[ Exhibit Hall I ]

Abstract
In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. Additionally, we design the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames, thereby significantly enhancing instance matching accuracy. CAVIS demonstrates superior performance over state-of-the-art methods on all benchmark datasets in video instance segmentation (VIS) and video panoptic segmentation (VPS). Notably, our method excels on the OVIS dataset, which is known for its particularly challenging videos.
Poster
Haochen Han · Alex Jinpeng Wang · Peijun Ye · Fangming Liu

[ Exhibit Hall I ]

Abstract
The data appetite for Vision-Language Models (VLMs) has continuously scaled up from the early millions to billions today, which faces an untenable trade-off with data quality and inevitably introduces Noisy Correspondence (NC) samples. Undoubtedly, such semantically unrelated data significantly impairs the performance of VLMs. Previous efforts mainly address this challenge by estimating refined alignment for more precise guidance. However, such resource-intensive pipelines that train VLMs from scratch struggle to meet realistic data demands. In this paper, we present a brand new perspective that seeks to directly eliminate the harmful effects of NC in pre-trained VLMs. Specifically, we propose NCU, a Noisy Correspondence Unlearning fine-tuning framework that efficiently enhances VLMs' robustness by forgetting learned noisy knowledge. The key to NCU is learning the hardest negative information, which can provide explicit unlearning direction for both false positives and false negatives. Such twin goals unlearning process can be formalized into one unified optimal transport objective for fast fine-tuning. We validate our approach with the prevailing CLIP model over various downstream tasks. Remarkably, NCU surpasses the robust pre-trained method on zero-shot transfer while with lower computational overhead. The code will be released upon acceptance.
Poster
Brian Isaac-Medina · Mauricio Che · Yona Falinie A. Gaus · Samet Akcay · Toby Breckon

[ Exhibit Hall I ]

Abstract
Modern machine learning models, that excel on computer vision tasks such as classification and object detection, are often overconfident in their predictions for Out-of-Distribution (OOD) examples, resulting in unpredictable behaviour for open-set environments. Recent works have demonstrated that the free energy score is an effective measure of uncertainty for OOD detection given its close relationship to the data distribution. However, despite free energy-based methods representing a significant empirical advance in OOD detection, our theoretical analysis reveals previously unexplored and inherent vulnerabilities within the free energy score formulation such that in-distribution and OOD instances can have distinct feature representations yet identical free energy scores. This phenomenon occurs when the vector direction representing the feature space difference between the in-distribution and OOD sample lies within the null space of the last layer of a neural-based classifier. To mitigate these issues, we explore lower-dimensional feature spaces to reduce the null space footprint and introduce novel regularisation to maximize the least singular value of the final linear layer, hence enhancing inter-sample free energy separation. We refer to these techniques as Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection (FEVER-OOD). Our experiments show that FEVER-OOD techniques achieve state of the art OOD detection in Imagenet-100, …
Poster
Liuchi Xu · Kang Liu · Jinshuai Liu · Lu Wang · Lisheng XU · Jun Cheng

[ Exhibit Hall I ]

Abstract
State-of-the-art logit distillation methods exhibit versatility, simplicity, and efficiency.Despite the advances, existing studies have yet to delve thoroughly into fine-grained relationships within logit knowledge.In this paper, we propose Local Dense Relational Logit Distillation (LDRLD), a novel method that captures inter-class relationships through recursively decoupling and recombining logit information, thereby providing more detailed and clearer insights for student learning.To further optimize performance, we introduce an Adaptive Decay Weight (ADW) strategy, which can dynamically adjust the weights for critical category pairs using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD).Specifically, IRW assigns weights inversely proportional to the rank differences between pairs, while ERD adaptively controls weight decay based on total ranking scores of category pairs. Furthermore, after the recursive decoupling, we distill the remaining non-target knowledge to ensure knowledge completeness and enhance performance. Ultimately, our method improves the student's performance by transferring fine-grained knowledge and emphasizing the most critical relationships.Extensive experiments on datasets such as CIFAR-100, ImageNet-1K, and Tiny-ImageNet demonstrate that our method compares favorably with state-of-the-art logit-based knowledge distillation methods. The code will be made publicly available.
Poster
Yooshin Cho · Hanbyel Cho · Janghyeon Lee · HyeongGwon Hong · Jaesung Ahn · Junmo Kim

[ Exhibit Hall I ]

Abstract
As the use of artificial intelligence rapidly increases, the development of trustworthy artificial intelligence has become important. However, recent studies have shown that deep neural networks are susceptible to learn spurious correlations present in datasets. To improve fairness, we propose a simple yet effective framework called controllable feature whitening. We quantify the linear correlation between the target and bias features by the covariance matrix, and eliminate it through the whitening module. Our results systemically demonstrate that removing the linear correlations between features which are passed to the last linear classifier significantly improves the fairness. A particular advantage of the proposed method is that it does not require regularization terms or adversarial learning, which often leads to unstable optimization in practice. Furthermore, we show that two fairness criteria, demographic parity and equalized odds, can be effectively handled by whitening with the re-weighted covariance matrix. Consequently, our method optimizes the trade-off between the utility and fairness of algorithms by adjusting the re-weighting coefficient. Finally, we validate that our method outperforms existing approaches on four benchmark datasets: Corrupted CIFAR-10, Biased FFHQ, WaterBirds, and Celeb-A.
Poster
Yu-Lin Tsai · Yizhe Li · Zekai Chen · Po-Yu Chen · Francois Buet-Golfouse · Chia-Mu Yu · Xuebin Ren

[ Exhibit Hall I ]

Abstract
The integration of Differential Privacy (DP) with diffusion models (DMs) presents a promising yet challenging frontier, particularly due to the substantial memorization capabilities of DMs that pose significant privacy risks. Differential privacy offers a rigorous framework for safeguarding individual data points during model training, with Differential Privacy Stochastic Gradient Descent (DP-SGD) being a prominent implementation. Diffusion method decomposes image generation into iterative steps, theoretically aligning well with DP's incremental noise addition. Despite the natural fit, the unique architecture of DMs necessitates tailored approaches to effectively balance privacy-utility trade-off. Recent developments in this field have highlighted the potential for generating high-quality synthetic data by pre-training on public data ($i.e.$, ImageNet) and fine-tuning on private data, however, there is a pronounced gap in research on optimizing the trade-offs involved in DP settings, particularly concerning parameter efficiency and model scalability. Our work addresses this by proposing a parameter-efficient fine-tuning strategy optimized for private diffusion models, which minimizes the number of trainable parameters to enhance the privacy-utility trade-off. We empirically demonstrate that our method achieves state-of-the-art performance in DP synthesis, significantly surpassing previous benchmarks on widely studied datasets ($e.g.$, with only 0.47M trainable parameters, achieving a more than 35% improvement over the previous state-of-the-art …
Poster
Maximilian Hoefler · Karsten Mueller · Wojciech Samek

[ Exhibit Hall I ]

Abstract
Explainable AI (XAI) methods have demonstrated significant success in recent years at identifying relevant features in input data that drive deep learning model decisions, enhancing interpretability for users. However, the potential of XAI beyond providing model transparency has remained largely unexplored in adjacent machine learning domains. In this paper, we show for the first time how XAI can be utilized in the context of federated learning. Specifically, while federated learning enables collaborative model training without raw data sharing, it suffers from performance degradation when client data distributions exhibit statistical heterogeneity. We introduce FedXDS (Federated Learning via XAI-guided Data Sharing), the first approach to utilize feature attribution techniques to identify precisely which data elements should be selectively shared between clients to mitigate heterogeneity. By employing propagation-based attribution, our method identifies task-relevant features through a single backward pass, enabling selective data sharing that aligns client contributions. To protect sensitive information, we incorporate metric differential privacy techniques that provide formal privacy guarantees while preserving utility. Experimental results demonstrate that our approach consistently achieves higher accuracy and faster convergence compared to existing methods across varying client numbers and heterogeneity settings. We provide theoretical privacy guarantees and empirically demonstrate robustness against both membership inference and …
Poster
Fei Zhou · Peng Wang · Lei Zhang · Wei Wei · Chen Ding · Guosheng Lin · Yanning Zhang

[ Exhibit Hall I ]

Abstract
Large-scale pre-trained foundation models have demonstrated remarkable generalization capabilities across diverse computer vision tasks through fine-tuning. However, existing fine-tuning approaches often encounter challenges in extreme cross-domain few-shot learning scenarios, primarily due to the significant domain shift between pre-training data and target tasks, as well as the scarcity of annotated target samples. To mitigate this issue, we propose a novel absorption adaptation learning framework which meticulously regularizes the fine-tuning procedure of foundation model using an expert model with the same architecture but trained from scratch on the targeted data in two aspects. On one hand, we first design a masked cross-model unidirectional reconstruction scheme, which forces the foundation model to recover the intermediate feature of the expert model in a randomly masked manner. On the other hand, a decision graph association loss is developed to encourage the consistency of token similarity matrix between these two models. By doing these, the task-relevant semantic knowledge in the expert model from both intermediate feature and the final decision levels are appropriately extracted and absorbed by the foundation model during its fine-tuning, thus mitigating the performance drop caused by domain gap and limited annotation. Sufficient experiments with further observations and analyses underpin our observation and …
Poster
Justin Kay · Grant Horn · Subhransu Maji · Daniel Sheldon · Sara Beery

[ Exhibit Hall I ]

Abstract
The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task?This question of model selection is traditionally answered by collecting and annotating a validation dataset---a costly and time-intensive process.We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected.We validate our approach by curating a collection of 25 benchmark tasks capturing a range of model selection scenarios.CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 50% compared to the previous state-of-the-art. We will make our code and data public.
Poster
Mincheol Park · Cheonjun Park · Seungseop Lim · Mijin Koo · Hyunwuk Lee · Won Woo Ro · Suhyun Kim

[ Exhibit Hall I ]

Abstract
Deep neural networks are widely used in various computer vision tasks, but their vulnerability to adversarial perturbations remains a significant challenge for reliable decision-making. Adversarial purification, a test-time defense strategy, has shown potential in countering these threats by removing noise through diffusion models. This plug-and-play method, using off-the-shelf models, appears highly effective. However, the purified data from diffusion often deviates more from the original data than the adversarial examples, leading to missing critical information and causing misclassification. In this study, we propose that upsampling with Super-Resolution (SR), followed by downsampling, can also aid in eliminating adversarial noise, similar to the noise addition and removal process in diffusion models. While SR alone is not as effective as the diffusion process, it better restores the original features typically associated with the early layers of networks. By combining SR, which initially mitigates damage to early-layer information from adversarial attacks, with diffusion, we observe a synergistic effect, leading to enhanced performance over diffusion models alone. Our comprehensive evaluations demonstrate that this combined approach, PuriFlow, significantly improves accuracy and robustness, working synergistically with state-of-the-art methods.
Poster
Seung-Wook Kim · Seongyeol Kim · Jiah Kim · Seowon Ji · Se-Ho Lee

[ Exhibit Hall I ]

Abstract
Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.
Poster
Yan Zhuang · Minhao Liu · Wei Bai · Yanru Zhang · Xiaoyue Zhang · Jiawen Deng · Fuji Ren

[ Exhibit Hall I ]

Abstract
Multimodal Sentiment Analysis (MSA) enhances emotion recognition by integrating information from multiple modalities. However, multimodal learning with missing modalities suffers from representation inconsistency and optimization instability, leading to suboptimal performance. In this paper, we introduce Correlation-Aware and Modalities-Aware Distillation (CMAD), a unified framework designed for MSA under varying missing-modality conditions. Specifically, CMAD comprises two key components: (1) Correlation-Aware Feature Distillation (CAFD), which enforces multi-level representation alignment by preserving both feature similarities and high-order correlation structures between teacher and student models, and (2) Modality-Aware Regularization (MAR) employs an adaptive weighting strategy guided by modality difficulty, enabling a curriculum learning paradigm to stabilize the training process. Extensive evaluations on five datasets show that CMAD consistently outperforms existing methods, achieving average performance improvements of 1.0\% on MOSEI, 4.4\% on IEMOCAP, 1.9\% on MUStARD, 0.5\% on UR-FUNNY and 1.9\% on CHERMA.
Poster
Xianfu Cheng · Wei Zhang · Shiwei Zhang · Jian Yang · Xiangyuan Guan · Xianjie Wu · Xiang Li · Ge Zhang · Jiaheng Liu · Yuying Mai · Yutao Zeng · Zhoufutu Wen · JinKe JinKe · Baorui Wang · Weixiao Zhou · Lu Yunhong · Hangyuan Ji · Tongliang Li · Wenhao Huang · Zhoujun Li

[ Exhibit Hall I ]

Abstract
The increasing application of multi-modal large language models (MLLMs) across various sectors has spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by 7 key features: it is based on bilingual, it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 scenario domains. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.
Poster
Wenqi Zhang · Hang Zhang · Xin Li · Jiashuo Sun · Yongliang Shen · Weiming Lu · Deli Zhao · Yueting Zhuang · Lidong Bing

[ Exhibit Hall I ]

Abstract
Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving.
Poster
Dahye Kim · Xavier Thomas · Deepti Ghadiyaram

[ Exhibit Hall I ]

Abstract
We study $\textit{how}$ rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. We uncover monosemantic interpretable features by leveraging k-sparse autoencoders (k-SAE). We substantiate our mechanistic interpretations via transfer learning using light-weight classifiers on off-the-shelf diffusion models' features. On $4$ datasets, we demonstrate the effectiveness of diffusion features for representation learning. We provide an in-depth analysis of how different diffusion architectures, pre-training datasets, and language model conditioning impacts visual representation granularity, inductive biases, and transfer learning capabilities. Our work is a critical step towards deepening interpretability of black-box diffusion models. Code and visualizations available at: \url{https://github.com/revelio-diffusion/revelio}
Poster
Siyu Jiao · Haoye Dong · Yuyang Yin · ZEQUN JIE · Yinlong Qian · Yao Zhao · Humphrey Shi · Yunchao Wei

[ Exhibit Hall I ]

Abstract
Recent works in 3D representation learning and multimodal pre-training have made remarkable progress. However, typically multimodal 3D models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through a series of transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.
Poster
Jiaxin Ai · Pengfei Zhou · xu Pan · Ming Li · Fanrui Zhang · Zizhen Li · Jianwen Sun · Yukang Feng · Baojin Huang · Zhongyuan Wang · Kaipeng Zhang

[ Exhibit Hall I ]

Abstract
As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.
Poster
Yanyun Wang · Li Liu

[ Exhibit Hall I ]

Abstract
Adversarial Training (AT) is one of the most effective methods to train robust Deep Neural Networks (DNNs). However, AT creates an inherent trade-off between clean accuracy and adversarial robustness, which is commonly attributed to the more complicated decision boundary caused by the insufficient learning of hard adversarial samples. In this work, we reveal a counterintuitive fact for the first time: **from the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended**. Thus, different from previous views, we argue that it is rather the over-sufficient learning of hard adversarial samples that degrades the decision boundary and contributes to the trade-off problem. Specifically, the excessive pursuit of perception consistency would force the model to view the perturbations as noise and ignore the information within them, which should have been utilized to induce a smoother perception transition towards the decision boundary to support its establishment to an appropriate location. In response, we define a new AT objective named **Robust Perception**, encouraging the model perception to change smoothly with input perturbations, based on which we propose a novel **R**obust **P**erception **A**dversarial **T**raining (**RPAT**) method, effectively mitigating the current accuracy-robustness …
Poster
Qi Wang · Zeyu Zhang · Dong Wang · Di Gai · Xin Xiong · Jiyang Xu · Ruihua Zhou

[ Exhibit Hall I ]

Abstract
Large-scale pre-training technology has achieved remarkable performance in diversified object re-identification (Re-ID) downstream tasks. Nevertheless, to our best knowledge, the pre-training model specifically for vehicle Re-ID, which focuses on tackling the challenge of multi-view variations, has not been fully investigated. In this paper, we first leverage a diffusion model to build a large-scale vehicle Re-ID benchmark dataset, dubbed “DiffVERI”, containing over 1700K images from abundant multi-view annotations. In terms of this dataset, we further present VehicleMAE, a novel masked image modeling pre-training paradigm that learns view-invariant representations by performing mutual-distillation in a self-supervised manner. To be specific, the pipeline of VehicleMAE unfolds two core modules, i.e., view-asymmetry masked image modeling (VMIM) and past-to-present mutual-distillation (PPMD). Technically, VMIM consists of two homogeneous masked autoencoders (MAE) that simultaneously reconstruct the RGB pixels and multi-view semantic information of the specific vehicle body region via paired asymmetric mask sampling strategies. To progressively distill the knowledge of the model itself, PPMD considers the two MAEs in the current epoch and the previous one as the student models and the teacher models, respectively, which leverages the knowledge learned by the current student and the historical teacher for mutual feature-level distillation. Extensive experimental results have verified that …
Poster
Anh Thai · Kyle Genova · Songyou Peng · Leonidas Guibas · Thomas Funkhouser

[ Exhibit Hall I ]

Abstract
Language-guided 3D scene understanding is important for advancing applications in robotics, AR/VR, and human-computer interaction, enabling models to comprehend and interact with 3D environments through natural language. While 2D vision-language models (VLMs) have achieved remarkable success in 2D VQA tasks, progress in the 3D domain has been significantly slower due to the complexity of 3D data and the high cost of manual annotations. In this work, we introduce SplatTalk, a novel method that uses a generalizable 3D Gaussian Splatting (3DGS) framework to produce 3D tokens suitable for direct input into a pretrained LLM, enabling effective zero-shot 3D visual question answering (3D VQA) for scenes with only posed images. During experiments on multiple benchmarks, our approach outperforms both 3D models trained specifically for the task and previous 2D-LMM-based models utilizing only images (our setting), while achieving competitive performance with state-of-the-art 3D LMMs that additionally utilize 3D inputs.
Poster
Haoxuan Wang · Zhenghao Zhao · Junyi Wu · Yuzhang Shang · Gaowen Liu · Yan Yan

[ Exhibit Hall I ]

Abstract
The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce \textbf{C}ondition-\textbf{a}ware \textbf{O}ptimization with \textbf{O}bjective-guided Sampling (\textbf{CaO$_2$}), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood.CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3\% accuracy.
Poster
Bowen Wang · Zhouqiang Jiang · Yasuaki Susumu · Shotaro Miwa · Tianwei Chen · Yuta Nakashima

[ Exhibit Hall I ]

Abstract
The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain-specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and select "Monster Hunter: World'' as the target to construct a multimodal knowledge graph (MH-MMKG), which incorporates multi-modalities and intricate entity relations. We also design a series of challenging queries based on MH-MMKG to evaluate the models’ ability for complex knowledge retrieval and reasoning. Furthermore, we propose a multi-agent retriever that enables a model to autonomously search relevant knowledge without additional training. Experimental results show that our approach significantly enhances the performance of MLLMs, providing a new perspective on multimodal knowledge-augmented reasoning and laying a solid foundation for future research.
Poster
Xuelin Zhu · Jian liu · Jiuxin Cao · Bing WANG

[ Exhibit Hall I ]

Abstract
Mamba, a selective state-space model, has recently been widely applied to various visual tasks due to its powerful capability to capture long-range dependencies. Although promising performance has been achieved on image classification, the effectiveness of Mamba on multi-label image classification has not been explored yet. In this work, we develop a novel MambaML framework for multi-label image classification, which incorporates a Mamba-based decoder to aggregate visual information from image features into label embeddings, yielding label-specific visual representations for classification. Building upon this, MambaML further employ Mamba to model both image feature sequence and label embedding sequence. In this way, MambaML is capable of exploring the spatial relationships of image features, semantic dependencies between label embeddings, as well as their cross-correlations, thereby resulting in robust label-specific visual representations and training binary classifiers for high-performance multi-label image classification. Extensive experimental results demonstrate that our MambaML achieves state-of-the-art performance on multiple benchmarks in multi-label image classification task.
Poster
Jing Wu · Mehrtash Harandi

[ Exhibit Hall I ]

Abstract
Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts and dominance, impeding MU algorithms from reaching optimal solutions.To address the gradient conflict and dominance issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain and balance their contributions.To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto stationary point.Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective.We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation.Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving a better trade-off between forgetting and preserving.Our results also highlight improvements …
Poster
Sixian Chan · Zedong Li · Xiaoqin Zhang · Wenhao Li · Shijian Lu · Chunhua Shen

[ Exhibit Hall I ]

Abstract
Multi-modal object tracking has emerged as a significant research focus in computer vision due to its robustness in complex environments, such as exposure variations, blur, and occlusions. Despite the fact that existing studies integrate supplementary modal information into pre-trained RGB trackers through visual prompt mechanisms, this exhibits a critical limitation: they inherently prioritize RGB information as the dominant modality, thereby underutilizing the complementary information of alternative modal.To address this fundamental limitation, we present SMSTracker, an innovative tri-path score mask sigma fusion framework for multi-modal tracking, including three key modules. Firstly, we design a tri-path Score Mask Fusion (SMF) module to evaluate and quantify the reliability of each modality, allowing optimal exploitation of complementary features between modalities. Secondly, we introduce a pioneering Sigma Interaction (SGI) module to facilitate a sophisticated fusion of modal features across tri-branches, representing the first application of Sigma point-based feature interaction in object tracking tasks. Furthermore, we advance a Drop Key Fine-tuning (DKF) strategy to address the inherent challenge of unequal data contribution in multi-modal learning scenarios, thereby enhancing the model's capacity for comprehensive multi-modal information processing.Finally, extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event datasets demonstrate the significant performance improvements achieved by SMSTracker over existing state-of-the-art methods. …
Poster
Wenjun Miao · Guansong Pang · Zihan Wang · Jin Zheng · Xiao Bai

[ Exhibit Hall I ]

Abstract
Recent advancements in CLIP-based out-of-distribution (OOD) detection have shown promising results via regularization on prompt tuning, leveraging background features extracted from a few in-distribution (ID) samples as proxies for OOD features.However, these methods suffer from an inherent limitation: a lack of diversity in the extracted OOD features from the few-shot ID data.To address this issue, we propose to leverage external datasets as auxiliary outlier data (i.e., pseudo OOD samples) to extract rich, diverse OOD features, with the features from not only background regions but also foreground object regions, thereby supporting more discriminative prompt tuning for OOD detection. We further introduce Auxiliary Prompt Tuning (APT), a novel framework that can be used as a plug-in module to enable existing prompt tuning-based methods to utilize the auxiliary data for more accurate OOD detection.There are two key challenges of utilizing those auxiliary data in prompt tuning, including I) foreground-background decomposition of unlabeled auxiliary data with diverse outlying objects and II) optimization of foreground OOD features. APT tackles challenge I with an adaptive logit-based Kullback–Leibler divergence method and challenge II by constructing foreground-background pairs for each foreground region to enable effective exploitation of foreground OOD features. Extensive experiments on standard and hard OOD benchmarks …
Poster
Hemanth Saratchandran · Simon Lucey

[ Exhibit Hall I ]

Abstract
Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.
Poster
Tiankai Hang · Shuyang Gu · Jianmin Bao · Fangyun Wei · Dong Chen · Xin Geng · Baining Guo

[ Exhibit Hall I ]

Abstract
Diffusion models have emerged as the de facto choice for generating high-quality visual signals across various domains.However, training a single model to predict noise across various levels poses significant challenges, necessitating numerous iterations and incurring significant computational costs.Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence and improve model performance.In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. Our key insight is that the importance sampling of the logarithm of the Signal-to-Noise ratio ($\log \text{SNR}$), theoretically equivalent to a modified noise schedule, is particularly beneficial for training efficiency when increasing the sample frequency around $\log \text{SNR}=0$. This strategic sampling allows the model to focus on the critical transition point between signal dominance and noise dominance, potentially leading to more robust and accurate predictions.We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule.Furthermore, we highlight the advantages of our noise schedule design on the ImageNet benchmark, showing that the designed schedule consistently benefits different prediction targets.Our findings contribute to the ongoing efforts to optimize diffusion models, potentially paving the way for more efficient and effective training paradigms in …
Poster
Ziyue Wang · Yurui Dong · Fuwen Luo · Minyuan Ruan · Zhili Cheng · Chi Chen · Peng Li · Yang Liu

[ Exhibit Hall I ]

Abstract
The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in …
Poster
Parag Dutta · Mohd Ayyoob · Shalabh Bhatnagar · Ambedkar Dukkipati

[ Exhibit Hall I ]

Abstract
Representation learning lies at the core of deep reinforcement learning. While CNNs have been default models for encoding image observations so far, modifying the encoder architecture presents challenges, particularly due to the the necessity of identifying a new set of hyper-parameters that align with each modification. To address this problem, we propose a powerful representation learning technique for visual reinforcement learning using Fourier Neural Operators (FNO). Our findings demonstrate that the proposed FNO encoder effectively learns representations from images that encapsulate the underlying differential equations (PDEs) governing the dynamics of the environment in an online model-free RL framework.The FNO encoder with the Efficient Rainbow algorithm achieves a median Human Normalized Score (HNS) of $26.1\%$ on the Atari100k benchmark across 26 environments, delivering a $10$-point enhancement over the CNN-based Efficient Rainbow algorithm. In the context of offline reinforcement learning Atari games, we achieve a remarkable $2.89\times$ improvement compared to sate-of-the-art transformer based models. Additionally, upon using our FNO encoder with the A2C algorithm on the ViZDoom environment, we achieve $\sim38\%$ improvement in rewards in the first $200$ episodes. Further, we match the vanilla A2C performance after just $\sim100$ episodes. We also achieve $81\%$ mean normalized score in the CARLA Autonomous Driving …
Poster
Amirhossein Kazerouni · Soroush Mehraban · Michael Brudno · Babak Taati

[ Exhibit Hall I ]

Abstract
Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence. Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals. However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability. We introduce **LIFT**, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning. LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features. This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency. Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings. With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs. Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and …
Poster
Runkai Zheng · Vishnu Dasu · Yinong Wang · Haohan Wang · Fernando De la Torre

[ Exhibit Hall I ]

Abstract
Modern machine learning models heavily rely on large datasets that often include sensitive and private information, raising serious privacy concerns. Differentially private (DP) data generation offers a solution by creating synthetic datasets that limit the leakage of private information within a predefined privacy budget; however, it requires a substantial amount of data to achieve performance comparable to models trained on the original data. To mitigate the significant expense incurred with synthetic data generation, Dataset Distillation (DD) stands out for its remarkable training and storage efficiency. This efficiency is particularly advantageous when integrated with DP mechanisms, curating compact yet informative synthetic datasets without compromising privacy. However, current state-of-the-art private DD methods suffer from a synchronized sampling-optimization process and the dependency on noisy training signals from randomly initialized networks. This results in the inefficient utilization of private information due to the addition of excessive noise. To address these issues, we introduce a novel framework that % decouples sampling from optimization and utilize auxiliary datasets to identify informative subspaces of the signal. Our approach decouples sampling from optimization for better convergence and improves signal quality by mitigating the impact of DP noise through matching in an informative subspace, all without incurring additional privacy …
Poster
Qihan Huang · Weilong Dai · Jinlong Liu · Wanggui He · Hao Jiang · Mingli Song · Jingyuan CHEN · Chang Yao · Jie Song

[ Exhibit Hall I ]

Abstract
MLLM reasoning has drawn widespread research for its excellent problem-solving capability. Current reasoning methods fall into two types: PRM, which supervises the intermediate reasoning steps, and ORM, which supervises the final results. Recently, DeepSeek-R1 has challenged the traditional view that PRM outperforms ORM, which demonstrates strong generalization performance using an ORM method (i.e., GRPO). However, current MLLM's GRPO algorithms still struggle to handle challenging and complex multimodal reasoning tasks (e.g., mathematical reasoning). In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data utilization refers to that GRPO cannot acquire positive rewards to update the MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses image condition and solely relies on text condition for generation after GRPO training. To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. Experiment results on three base MLLMs across eleven datasets demonstrate that our proposed methods advance the reasoning capability of original MLLM by a large margin, exhibiting superior performance to …
Poster
Muhammad Aqeel · Shakiba Sharifi · Marco Cristani · Francesco Setti

[ Exhibit Hall I ]

Abstract
So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training-validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets. Code will be made available upon acceptance.
Poster
Johannes Künzel · Anna Hilsmann · Peter Eisert

[ Exhibit Hall I ]

Abstract
We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene.This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose a auxiliary loss to enhance the discriminative capability of the learned descriptors.Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description.Code and data will be made available for research purposes.
Poster
Yiting Yang · Hao Luo · Yuan Sun · Qingsen Yan · Haokui Zhang · Wei Dong · Guoqing Wang · Peng Wang · Yang Yang · Heng Tao Shen

[ Exhibit Hall I ]

Abstract
A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this study, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range …
Poster
Sangyun Shin · Yuhang He · Xinyu Hou · Samuel Hodgson · Andrew Markham · Niki Trigoni

[ Exhibit Hall I ]

Abstract
The robustness of 3D object detection in large-scale outdoor point clouds degrades significantly when deployed in an unseen environment due to domain shifts. To minimize the domain gap, existing works on domain adaptive detection focuses on several factors, including point density, object shape and sizes, to reduce the false negative detections. However, the adaptation results indicate that there are still remaining challenges. We argue that this is due to the challenge in recognizing comparably less distinctive region on object surface due to sparsity, occlusion, etc. In this work, we aim to reinforce those features by generating points on object surface to make them straightforwardly recognizable. We draw our motivation from a common observation that detection proposals already contain the accurate bounding boxes, but with relatively low objectness score predictions, which lead to false negatives. Given these box proposals, we densify sparse object points with a diffusion approach. As a result, our model DiffRefine can act as a simple additional module before second-stage refinement, where most existing detection models for two-stage detection can use. Experimental results on domain adaptive detection show competitive performance, especially on vanishing points due to distance on various detection architectures.
Poster
Kunyang Li · Jean-Charles Noirot Ferrand · Ryan Sheatsley · Blaine Hoak · Yohan Beugin · Eric Pauley · Patrick McDaniel

[ Exhibit Hall I ]

Abstract
Fine-tuning has become the standard practice for adapting pre-trained (upstream) models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks---over 75\% above the average measured with area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks---57.5\% and 34.6\% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.
Poster
Taehwan Lee · Kyeongkook Seo · Jaejun Yoo · Sung Whan Yoon

[ Exhibit Hall I ]

Abstract
Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models.In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias---where errors in noise estimation accumulate over iterations---and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models, whereas other well-known methods such as Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA), which promote flatness indirectly via ensembling, are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improves not only generative performance but also robustness.

Oral 2A: View Synthesis and Scene Reconstruction Tue 21 Oct 01:30 p.m.  

Oral
Hanwen Jiang · Hao Tan · Peng Wang · Haian Jin · Yue Zhao · Sai Bi · Kai Zhang · Fujun Luan · Kalyan Sunkavalli · Qixing Huang · Georgios Pavlakos

[ Exhibit Hall III ]

Abstract
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing.
Oral
Alexander Mai · Peter Hedman · George Kopanas · Dor Verbin · David Futschik · Qiangeng Xu · Falko Kuester · Jonathan Barron · Yinda Zhang

[ Exhibit Hall III ]

Abstract
We present Exact Volumetric Ellipsoid Rendering (EVER), a method for real-time 3D reconstruction.EVER accurately blends an unlimited number of overlapping primitives together in 3D space, eliminating the popping artifacts that 3D Gaussian Splatting (3DGS) and other related methods exhibit.EVER represents a radiance field as a set of constant-density volumetric ellipsoids, which are raytraced by intersecting each primitive twice (once upon ray entrance and another on ray exit) and accumulating the derivatives of the densities and colors along the ray.Because EVER is built around ray tracing, it also enables effects such as defocus blur and fish-eye camera distortion, while still achieving frame rates of ~30 FPS at 720p on an NVIDIA RTX4090. We show that our method is more accurate on the challenging large-scale scenes from the Zip-NeRF dataset, where it achieves state of the art SSIM, even higher than Zip-NeRF.
Oral
Chen Zhao · Xuan Wang · Tong Zhang · Saqib Javed · Mathieu Salzmann

[ Exhibit Hall III ]

Abstract
3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness in novel view synthesis (NVS). However, 3DGS tends to overfit when trained with sparse views, limiting its generalization to novel viewpoints. In this paper, we address this overfitting issue by introducing Self-Ensembling Gaussian Splatting (SE-GS). We achieve self-ensembling by incorporating an uncertainty-aware perturbation strategy during training. A $\mathbf{\Delta}$-model and a $\mathbf{\Sigma}$-model are jointly trained on the available images. The $\mathbf{\Delta}$-model is dynamically perturbed based on rendering uncertainty across training steps, generating diverse perturbed models with negligible computational overhead. Discrepancies between the $\mathbf{\Sigma}$-model and these perturbed models are minimized throughout training, forming a robust ensemble of 3DGS models. This ensemble, represented by the $\mathbf{\Sigma}$-model, is then used to generate novel-view images during inference. Experimental results on the LLFF, Mip-NeRF360, DTU, and MVImgNet datasets demonstrate that our approach enhances NVS quality under few-shot training conditions, outperforming existing state-of-the-art methods.
Oral
Weirong Chen · Ganlin Zhang · Felix Wimbauer · Rui Wang · Nikita Araslanov · Andrea Vedaldi · Daniel Cremers

[ Exhibit Hall III ]

Abstract
Traditional SLAM systems, which rely on bundle adjustment, often struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates.This work proposes a novel approach that leverages a 3D point tracker to decouple the static and dynamic motion, effectively separating the camera-induced motion from the motion of dynamic objects.Bundle adjustment can therefore operate reliably considering only the camera-induced component of the observed motion. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps.Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end.By integrating motion decomposition, bundle adjustment, and depth refinement into a unified framework, our method accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.
Oral
Yue Li · Qi Ma · Runyi Yang · Huapeng Li · Mengjiao Ma · Bin Ren · Nikola Popovic · Nicu Sebe · Ender Konukoglu · Theo Gevers · Luc Gool · Martin R. Oswald · Danda Pani Paudel

[ Exhibit Hall III ]

Abstract
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge.To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes.Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines. …

Oral 2B: Efficient Learning Tue 21 Oct 01:30 p.m.  

Oral
Uranik Berisha · Jens Mehnert · Alexandru Condurache

[ Kalakaua Ballroom ]

Abstract
Increasingly expensive training of ever larger models such as Vision Transfomers motivate reusing the vast library of already trained state-of-the-art networks. However, their latency, high computational costs and memory demands pose significant challenges for deployment, especially on resource-constrained hardware. While structured pruning methods can reduce these factors, they often require costly retraining, sometimes for up to hundreds of epochs, or even training from scratch to recover the lost accuracy resulting from the structural modifications. Maintaining the provided performance of trained models after structured pruning and thereby avoiding extensive retraining remains a challenge. To solve this, we introduce Variance-Based Pruning, a simple and structured one-shot pruning technique for efficiently compressing networks, with minimal finetuning. Our approach first gathers activation statistics, which are then used to select neurons for pruning. Simultaneously the mean activations are integrated back into the model to preserve a high degree of performance. On ImageNet-1k recognition tasks, we demonstrate that directly after pruning DeiT-Base retains over 70% of its original performance and requires only 10 epochs of fine-tuning to regain 99% of the original accuracy while simultaneously reducing MACs by 35% and model size by 36%, thus speeding up the model by 1.44 times.
Oral
Haoyu Wu · Jingyi Xu · Hieu Le · Dimitris Samaras

[ Kalakaua Ballroom ]

Abstract
Token merging can effectively accelerate various vision systems by processing groups of similar tokens only once and sharing the results across them. However, existing token grouping methods are often ad hoc and random, disregarding the actual content of the samples. We show that preserving high-information tokens during merging—those essential for semantic fidelity and structural details—significantly improves sample quality, producing finer details and more coherent, realistic generations. Despite being simple and intuitive, this approach remains underexplored.To do so, we propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation, leveraging readily available importance scores, such as those from classifier-free guidance in diffusion models. Experiments show that our approach significantly outperforms baseline methods across multiple applications, including text-to-image synthesis, multi-view image generation, and video generation with various model architectures such as Stable Diffusion, Zero123++, AnimateDiff, or PixArt-$\alpha$.
Oral
Yunuo Chen · Zezheng Lyu · Bing He · Ning Cao · Gang chen · Guo Lu · Wenjun Zhang

[ Kalakaua Ballroom ]

Abstract
Recent learned image compression (LIC) models have achieved remarkable rate-distortion (RD) performance, yet their high computational complexity severely limits practical deployment. To overcome this challenge, we propose a novel Stage-wise Modular Distillation framework, SMoDi, which efficiently compresses LIC models while preserving RD performance. This framework treats each stage of LIC models as an independent sub-task, mirroring the teacher model’s task decomposition to student, thereby simplifying knowledge transfer.We identify two crucial factors determining the effectiveness of knowledge distillation: student model construction and loss function design. Specifically, we first propose Teacher-Guided Student Model Construction, a pruning-like method ensuring architectural consistency between teacher and student models. Next, we introduce Implicit End-to-end Supervision, facilitating adaptive energy compaction and bitrate regularization.Based on these insights, we develop KDIC, a lightweight student model derived from the state-of-the-art S2CFormer model. Experimental results demonstrate that KDIC achieves top-tier RD performance with significantly reduced computational complexity. To our knowledge, this work is among the first successful applications of knowledge distillation to learned image compression.
Oral
shengyuan zhang · An Zhao · Ling Yang · Zejian Li · Chenye Meng · Haoran Xu · Tianrun Chen · AnYang Wei · Perry GU · Lingyun Sun

[ Kalakaua Ballroom ]

Abstract
Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality.However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D LiDAR scene completion models, dubbed $\textbf{ScoreLiDAR}$, which achieves efficient yet high-quality scene completion.ScoreLiDAR enables the distilled model to sample in significantly fewer steps after distillation.To improve completion quality, we also introduce a novel $\textbf{Structural Loss}$, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene.The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration.Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame ($>$5$\times$) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models.
Oral
Ruonan Yu · Songhua Liu · Zigeng Chen · Jingwen Ye · Xinchao Wang

[ Kalakaua Ballroom ]

Abstract
Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments show that our method significantly reduces the storage cost to merely 0.001% compared to full soft-label storage methods while achieving comparable performance to state-of-the-art …

Demonstration: Demos 2 Tue 21 Oct 03:00 p.m.  

  • Simular Agents: Agentic AI that Uses Computers Like a Human, Xin Eric Wang, Hao Liu, Ang Li
  • Controllable 3D Object Generation (From SuperDec to SuperGen), Elisabetta Fedele, Francis Engelmann
  • DINOv3, Patrick Labatut, Daniel Haziza, Akinniyi Akinyemi, Spaso Ilievski, Vasil Khalidov, Francisco Massa, Somya Jain, Michaël Ramamojisoa, Seungeun Yi, Marc Szafraniec, Caleb Ho, Piotr Bojanowski, Dominic Burt, Claire Roberts
  • Virtual Try-Off: Your Garment, Re-imagined, Riza Velioglu

Poster Session 2 & Exhibit Hall with Coffee Break Tue 21 Oct 03:00 p.m.  

Poster
Sara Rojas Martinez · Matthieu Armando · Bernard Ghanem · Philippe Weinzaepfel · Vincent Leroy · Grégory Rogez

[ Exhibit Hall I ]

Abstract
Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we use a strong image encoder by distilling the ones from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment humans, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more holistic 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D.Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications.We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks containing diverse human-centric scenarios. Additionally, we validate its generalization to …
Poster
Mahmoud Afifi · Luxi Zhao · Abhijith Punnappurath · Mohamed Abdelsalam · Ran Zhang · Michael Brown

[ Exhibit Hall I ]

Abstract
Cameras rely on auto white balance (AWB) to correct undesirable color casts caused by scene illumination and the camera’s spectral sensitivity. This is typically achieved using an illuminant estimator that determines the global color cast solely from the color information in the camera's raw sensor image. Mobile devices provide valuable additional metadata---such as capture timestamp and geolocation---that offers strong contextual clues to help narrow down the possible illumination solutions. This paper proposes a lightweight illuminant estimation method that incorporates such contextual metadata, along with additional capture information and image colors, into a lightweight model ($\sim$5K parameters), achieving promising results, matching or surpassing larger models. To validate our method, we introduce a dataset of 3,224 smartphone images with contextual metadata collected at various times of day and under diverse lighting conditions. The dataset includes ground-truth illuminant colors, determined using a color chart, and user-preferred illuminants validated through a user study, providing a comprehensive benchmark for AWB evaluation.
Poster
Hanxue Zhang · Haoran Jiang · Qingsong Yao · Yanan SUN · Renrui Zhang · Hao Zhao · Hongyang Li · Hongzi Zhu · Zetong Yang

[ Exhibit Hall I ]

Abstract
Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which mitigates catastrophic forgetting in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data. DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings.
Poster
Marko Mihajlovic · Siwei Zhang · Gen Li · KAIFENG ZHAO · Lea Müller · Siyu Tang

[ Exhibit Hall I ]

Abstract
Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding human-environment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10× faster inference, 6× lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL’s strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained …
Poster
Maria-Paola Forte · Nikos Athanasiou · Giulia Ballardini · Jan Bartels · Katherine J. Kuchenbecker · Michael Black

[ Exhibit Hall I ]

Abstract
Capturing accurate 3D human pose in the wild would provide valuable data for training motion-generation and pose-estimation methods.While video-based capture methods are increasingly accurate, we observe that they often fail in cases involving self-contact, such as a hand touching the face. Natural human behavior frequently includes self-contact, but determining when it occurs is challenging from video alone. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel approach that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization that minimizes reprojection error and deviations from the input estimate while enforcing vertex proximity constraints based on the measured start and end of self-touch. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture, demonstrating an average of 18.5% improvement in reconstruction accuracy. Our framework enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation. Code and data will be shared publicly.
Poster
Lin Bie · Siqi Li · Yifan Feng · Yue Gao

[ Exhibit Hall I ]

Abstract
Monocular depth estimation (MDE) is a fundamental problem in computer vision with wide-ranging applications in various downstream tasks. While multi-scale features are perceptually critical for MDE, existing transformer-based methods have yet to leverage them explicitly. To address this limitation, we propose a hypergraph-based multi-scale representation fusion framework, Hyper-Depth.The proposed Hyper-Depth incorporates two key components: a Semantic Consistency Enhancement (SCE) module and a Geometric Consistency Constraint (GCC) module. The SCE module, designed based on hypergraph convolution, aggregates global information and enhances the representation of multi-scale patch features. Meanwhile, the GCC module provides geometric guidance to reduce over-fitting errors caused by excessive reliance on local features. In addition, we introduce a Correlation-based Conditional Random Fields (C-CRFs) module as the decoder to filter correlated patches and compute attention weights more effectively.Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches across all evaluation metrics on the KITTI and NYU-Depth-v2 datasets, achieving improvements of 6.21% and 3.32% on the main metric RMSE, respectively. Furthermore, zero-shot evaluations on the nuScenes and SUN-RGBD datasets validate the generalizability of our approach.
Poster
Varun Sundar · Tianyi Zhang · Sacha Jungerman · Mohit Gupta

[ Exhibit Hall I ]

Abstract
Quanta image sensors record individual photons, enabling capabilities like imaging in near-complete darkness and ultra-high-speed videography. Yet, most research on quanta sensors is limited to recovering image intensities. Can we go beyond just imaging, and develop algorithms that can extract high-level scene information from quanta sensors? This could unlock new possibilities in vision systems, offering reliable operation in extreme conditions. The challenge: raw photon streams captured by quanta sensors have fundamentally different characteristics than conventional images, making them incompatible with vision models. One approach is to first transform raw photon streams to conventional-like images, but this is prohibitively expensive in terms of compute, memory, and latency, making it impractical for most vision and robotics systems. We propose quanta neural networks (QNNs) that directly produce downstream task objectives from raw photon streams. Our core proposal is a trainable QNN layer that can seamlessly integrate with existing image- and video-based neural networks, producing quanta counterparts. By avoiding image reconstruction and allocating computational resources on a scene-adaptive basis, QNNs achieve $1$--$2$ orders of magnitude improvements across all efficiency metrics (compute, latency, readout bandwidth) as compared to reconstruction-based quanta vision, while maintaining high task accuracy across a wide gamut of challenging scenarios including low …
Poster
Haokai Zhu · Bo Qu · Si-Yuan Cao · Runmin Zhang · Shujie Chen · Bailin Yang · Hui-liang Shen

[ Exhibit Hall I ]

Abstract
Previous deep image registration methods that employ single homography, multi-grid homography, or thin-plate spline often struggle with real scenes containing depth disparities due to their inherent limitations. To address this, we propose an Exponential-Decay Free-Form Deformation Network (EDFFDNet), which employs free-form deformation with an exponential-decay basis function. This design achieves higher efficiency and performs well in scenes with depth disparities, benefiting from its inherent locality. We also introduce an Adaptive Sparse Motion Aggregator (ASMA), which replaces the MLP motion aggregator used in previous methods. By transforming dense interactions into sparse ones, ASMA reduces parameters and improves accuracy. Additionally, we propose a progressive correlation refinement strategy that leverages global-local correlation patterns for coarse-to-fine motion estimation, further enhancing efficiency and accuracy. Experiments demonstrate that EDFFDNet reduces parameters, memory, and total runtime by 70.5%, 32.6%, and 33.7%, respectively, while achieving a 0.5 dB PSNR gain over the state-of-the-art method. With an additional local refinement stage, EDFFDNet-2 further improves PSNR by 1.06 dB while maintaining lower computational costs. Our method also demonstrates strong generalization ability across datasets, outperforming previous deep learning methods.
Poster
Ruifei Zhang · Junlin Xie · Wei Zhang · Weikai Chen · Xiao Tan · Xiang Wan · Guanbin Li

[ Exhibit Hall I ]

Abstract
Effectively integrating Large Language Models (LLMs) into autonomous driving requires a balance between leveraging high-level reasoning and maintaining real-time efficiency. Existing approaches either activate LLMs too frequently, causing excessive computational overhead, or use fixed schedules, failing to adapt to dynamic driving conditions. To address these challenges, we propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decision-making. (1) \textbf{When} to activate the LLM: AdaDrive employs a novel adaptive activation loss that dynamically determines LLM invocation based on a comparative learning mechanism, ensuring activation only in complex or critical scenarios. (2) \textbf{How} to integrate LLM assistance: Instead of rigid binary activation, AdaDrive introduces an adaptive fusion strategy that modulates a continuous, scaled LLM influence based on scene complexity and prediction confidence, ensuring seamless collaboration with conventional planners.Through these strategies, AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance. Extensive experiments on language-grounded autonomous driving benchmarks demonstrate that AdaDrive state-of-the-art performance in terms of both driving accuracy and computational efficiency.
Poster
Yuhang Yang · Fengqi Liu · Yixing Lu · Qin Zhao · Pingyu Wu · Wei Zhai · Ran Yi · Yang Cao · Lizhuang Ma · Zheng-Jun Zha · Junting Dong

[ Exhibit Hall I ]

Abstract
3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation. All training code, models, …
Poster
Xuying Zhang · Yutong Liu · Yangguang Li · Renrui Zhang · Yufei Liu · Kai Wang · Wanli Ouyang · Zhiwei Xiong · Peng Gao · Qibin Hou · Ming-Ming Cheng

[ Exhibit Hall I ]

Abstract
We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQVAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the3D VQ-VAE first encodes a wide range of 3D shapes into a compact triplane latent space and utilizes a set of discrete representations from a trainable codebook to reconstruct fine-grained geometries under the supervision of query point occupancy. Then, the 3D GPT, equipped with a custom triplane position embedding called TriPE, predicts the codebook index sequence with prefilling prompt tokensin an autoregressive manner so that the composition of 3D geometries can be modeled part by part. Extensive experiments on ShapeNet and Objaverse demonstrate that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
Poster
Jinjing Zhu · Tianbo Pan · Zidong Cao · Yexin Liu · James Kwok · Hui Xiong

[ Exhibit Hall I ]

Abstract
With the superior sensitivity of event cameras to high-speed motion and extreme lighting conditions, event-based monocular depth estimation has gained popularity to predict structural information about surrounding scenes in challenging environments. However, the scarcity of labeled event data constrains prior supervised learning methods. Unleashing the promising potential of the existing RGB-based depth foundation model, DAM~\cite{yang2024depth}, we propose Depth Any Event stream (EventDAM) to achieve high-performance event-based monocular depth estimation in an annotation-free manner. EventDAM effectively combines paired dense RGB images with sparse event data by incorporating three key cross-modality components: Sparsity-aware Feature Mixture (SFM), Sparsity-aware Feature Distillation (SFD), and Sparsity-invariant Consistency Module (SCM). With the proposed sparsity metric, SFM mixes features from RGB images and event data to generate auxiliary depth predictions, while SFD facilitates adaptive feature distillation. Furthermore, SCM ensures output consistency across varying sparsity levels in event data, thereby endowing EventDAM with zero-shot capabilities across diverse scenes. Extensive experiments across a variety of benchmark datasets, compared to approaches using diverse input modalities, robustly substantiate the generalization and zero-shot capabilities of EventDAM. Project page: \url{http://}.
Poster
Bing Fan · Yunhe Feng · Yapeng Tian · James Liang · Yuewei Lin · Yan Huang · Heng Fan

[ Exhibit Hall I ]

Abstract
Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from first-person videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances …
Poster
Yansong Guo · Jie Hu · Yansong Qu · Liujuan Cao

[ Exhibit Hall I ]

Abstract
Recent advances in interactive 3D segmentation from 2D images have demonstrated impressive performance. However, current models typically require extensive scene-specific training to accurately reconstruct and segment objects, which limits their applicability in real-time scenarios. In this paper, we introduce WildSeg3D, an efficient approach that enables the segmentation of arbitrary 3D objects across diverse environments using a feed-forward mechanism. A key challenge of this feed-forward approach lies in the accumulation of 3D alignment errors across multiple 2D views, which can lead to inaccurate 3D segmentation results. To address this issue, we propose Dynamic Global Aligning (DGA), a technique that improves the accuracy of global multi-view alignment by focusing on difficult-to-match 3D points across images, using a dynamic adjustment function. Additionally, for real-time interactive segmentation, we introduce Multi-view Group Mapping (MGM), a method that utilizes an object mask cache to integrate multi-view segmentations and respond rapidly to user prompts. WildSeg3D demonstrates robust generalization across arbitrary scenes, thereby eliminating the need for scene-specific training. Specifically, WildSeg3D not only attains the accuracy of state-of-the-art (SOTA) methods but also achieves a 40$\times$ speedup compared to existing SOTA models. Our code will be publicly available.
Poster
Junli Liu · Qizhi Chen · Zhigang Wang · Yiwen Tang · Yiting Zhang · Chi Yan · Dong Wang · Xuelong Li · Bin Zhao

[ Exhibit Hall I ]

Abstract
Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning.Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.
Poster
Weida Wang · Changyong He · Jin Zeng · Di Qiu

[ Exhibit Hall I ]

Abstract
Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a high-performance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves state-of-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset.
Poster
Suchisrit Gangopadhyay · Jung Hee Kim · Xien Chen · Patrick Rim · Hyoungseob Park · Alex Wong

[ Exhibit Hall I ]

Abstract
Monocular depth estimation (MDE) has advanced significantly with the introduction of transformer-based foundational vision models. However, their application to fisheye images, widely used in robotics, security systems, autonomous vehicles, and augmented reality due to their wide field of view, remains challenging due to severe radial distortions and calibration differences. Standard transformer-based models trained on perspective images fail to generalize effectively to fisheye inputs, resulting in poor depth predictions. To address this, we introduce \emph{calibration tokens}, a lightweight, token-based adaptation method that allows perspective-trained foundational models to handle fisheye distortions without retraining or fine-tuning the entire network. Calibration tokens learn to realign distorted fisheye features with the perspective latent distribution in a self-supervised manner using a novel inverse warping consistency loss. This training approach leverages existing perspective image datasets and pre-trained foundational models without requiring labeled fisheye images. Experiments demonstrate that our calibration tokens improve performance on real-world fisheye datasets for monocular depth estimation tasks, surpassing baselines while maintaining computational efficiency and inference-time simplicity.
Poster
WonJun Moon · Hyun Seok Seong · Jae-Pil Heo

[ Exhibit Hall I ]

Abstract
Facilitating an entity's interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common but unaffordable features. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method.
Poster
Fan Pei · jinchen bai · Xiang Feng · Zoubin Bi · Kun Zhou · Hongzhi Wu

[ Exhibit Hall I ]

Abstract
We present OpenSubstance, a high-quality measured dataset with 1.8 million high-dynamic-range images of 151 objects with a wide variety in shape and appearance, captured under 270 camera views and 1,637 lighting conditions, including 1,620 one-light-at-a-time, 8 environment, 8 linear and 1 full-on illumination. For each image, the corresponding lighting condition, camera parameters and foreground segmentation mask are provided. High-precision 3D geometry is also acquired for rigid objects. It takes 1 hour on average to capture one object with our custom-built high-performance lightstage and a top-grade commerical 3D scanner. We perform comprehensive quantitative evaluation on state-of-the-art techniques across different tasks, including single- and multi-view photometric stereo, as well as relighting. The project is publicly available at ***anonymous link***.
Poster
peilin Tao · Hainan Cui · Diantao Tu · Shuhan Shen

[ Exhibit Hall I ]

Abstract
Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework.We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module.Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations.To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distance-based objective function and refine them with an unbiased non-bilinear angle-based objective function.Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency.Our framework outperforms existing global SfM methods, establishing itself as a robust solution for real-world multi-camera SfM applications. We will share our system as an open-source implementation.
Poster
Haoyu Zhao · Hao Wang · Xingyue Zhao · Hao Fei · Hongqiu Wang · Chengjiang Long · Hua Zou

[ Exhibit Hall I ]

Abstract
Recent advancements in 3D generation models have opened new possibilities for simulating dynamic 3D object movements and customizing behaviors, yet creating this content remains challenging. Current methods often require manual assignment of precise physical properties for simulations or rely on video generation models to predict them, which is computationally intensive. In this paper, we rethink the usage of multi-modal large language model (MLLM) in physics-based simulation, and present PhysSplat, a physics-based approach that efficiently endows static 3D objects with interactive dynamics. We begin with detailed scene reconstruction and object-level 3D open-vocabulary segmentation, progressing to multi-view image in-painting. Inspired by human visual reasoning, we propose MLLM-based Physical Property Perception (MLLM-P3) to predict the mean physical properties of objects in a zero-shot manner. The Material Property Distribution Prediction model (MPDP) then estimates physical property distributions via geometry-conditioned probabilistic sampling of MLLM-P3 outputs, reformulating the problem as probability distribution estimation to reduce computational costs. Finally, we simulate objects in 3D scenes with particles sampled via the Physical-Geometric Adaptive Sampling (PGAS) strategy, efficiently capturing complex deformations and significantly reducing computational costs. Extensive experiments and user studies demonstrate that our PhysSplat achieves more realistic motion than state-of-the-art methods within 2 minutes on a single GPU.
Poster
Adam Harley · Yang You · Yang Zheng · Xinglong Sun · Nikhil Raghuraman · Sheldon Liang · Yunqi Gu · Wen-Hsuan Chu · Suya You · Achal Dave · Rares Ambrus · Katerina Fragkiadaki · Leonidas Guibas

[ Exhibit Hall I ]

Abstract
We introduce AllTracker: a method that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train jointly on flow datasets and point tracking datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. We will publicly release our code and model weights.
Poster
Christian Löwens · Thorben Funke · Jingchao Xie · Alexandru Condurache

[ Exhibit Hall I ]

Abstract
Online mapping approaches show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo labels generated from unlabeled sensor data. We derive those pseudo labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code will be made publicly available.
Poster
Zhuoguang Chen · Minghui Qin · Tianyuan Yuan · Zhe Liu · Hang Zhao

[ Exhibit Hall I ]

Abstract
Recent advancements in sparse multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streamiNG 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We introduce a refined decoder that facilitates coarse-to-fine interaction between memory and new observations using memory gating and a dual-source attention structure. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model’s performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments on multiple multi-view reconstruction datasets demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed.
Poster
JINPENG DONG · Chen Li · Yutong Lin · Jingwen Fu · Sanping Zhou · Nanning Zheng

[ Exhibit Hall I ]

Abstract
High-definition (HD) map is an important component to support navigation and planning for autonomous driving vehicles. Predicting map elements with high quality (high classification and localization scores) is crucial to the safety of autonomous driving vehicles. However, current methods perform poorly in high quality predictions. Two main factors are responsible for this: 1) inappropriate classification labels due to one-to-many matching queries shared labels, and 2) sub-optimal task features due to tasks shared sampling features.In this paper, we reveal two inherent defects in current methods and develop a novel HD map construction method named DAMap to address these problems. Specifically, DAMap consists of three components: Distance-aware Focal Loss (DAFL), Hybrid Loss Scheme (HLS), and Task Modulated Deformable Attention (TMDA). The DAFL is introduced to assign appropriate classification labels for one-to-many matching samples. The TMDA is proposed to obtain discriminative task-specific features. Furthermore, HLS is proposed to better utilize the advantages of the proposed DAFL. We perform extensive experiments and consistently achieve performance improvement on the NuScenes and Argoverse2 benchmarks under different metrics, baselines, splits, backbones, and schedules.
Poster
Zhu Yihang · Jinhao Zhang · Yuxuan Wang · Aming WU · Cheng Deng

[ Exhibit Hall I ]

Abstract
As an important direction of embodied intelligence, 3D Visual Grounding has attracted much attention, aiming to identify 3D objects matching the given language description. Most existing methods often follow a two-stage process, i.e., first detecting proposal objects and identifying the right objects based on the relevance to the given query. However, when the query is complex, it is difficult to leverage an abstract language representation to lock the corresponding objects accurately, affecting the grounding performance. In general, given a specific object, humans usually follow two clues to finish the corresponding grounding, i.e., attribute and location clues. To this end, we explore a new mechanism, attribute-to-location clue reasoning, to conduct accurate grounding. Particularly, we propose a VGMamba network that consists of an SVD-based attribute mamba, location mamba, and multi-modal fusion mamba. Taking a 3D point cloud scene and language query as the input, we first exploit SVD to make a decomposition of the extracted features. Then, a slidingwindow operation is conducted to capture attribute characteristics. Next, a location mamba is presented to obtain the corresponding location information. Finally, by means of multi-modal mamba fusion, the model could effectively localize the object that matches the given query. In the experiment, our method …
Poster
Mingquan Zhou · Chen He · Ruiping Wang · Xilin Chen

[ Exhibit Hall I ]

Abstract
Open-vocabulary 3D instance segmentation (OV-3DIS), which aims to segment and classify objects beyond predefined categories, is a critical capability for embodied AI applications. Existing methods rely on pre-trained 2D foundation models, focusing on instance-level features while overlooking contextual relationships, limiting their ability to generalize to rare or ambiguous objects. To address these limitations, we propose an OV-3DIS framework guided by contextual information. First, we employ a Class-agnostic Proposal Module, integrating a pre-trained 3D segmentation model with a SAM-guided segmenter to extract robust 3D instance masks. Subsequently, we design a Semantic Reasoning Module, which selects the best viewpoint for each instance and constructs three 2D context-aware representations. The representations are processed using Multimodal Large Language Models with Chain-of-Thought prompting to enhance semantic inference. Notably, our method outperforms state-of-the-art methods on the ScanNet200 and Replica datasets, demonstrating superior open-vocabulary segmentation capabilities. Moreover, preliminary implementation in real-world scenarios verifies our method's robustness and accuracy, highlighting its potential for embodied AI tasks such as object-driven navigation.
Poster
Boyi Sun · Yuhang Liu · Houxin He · Yonglin Tian · Fei-Yue Wang

[ Exhibit Hall I ]

Abstract
Manual annotation of 3D bounding boxes in large-scale 3D scenes is expensive and time-consuming. This motivates the exploration of annotation-free 3D object detection using unlabeled point cloud data. Existing unsupervised 3D detection frameworks predominantly identify moving objects via scene flow, which has significant limitations: (1) limited detection classes ($<3$), (2) difficulty in detecting stationary objects, and (3) reliance on high frame rates. To address these limitations, we propose AnnofreeOD, a novel Annotation-free Object Detection framework based on 2D-to-3D knowledge distillation. First, we explore an effective strategy to generate high-quality pseudo boxes using single-frame 2D knowledge. Second, we observe the noise from the previous step and introduce Noise-Resistant Regression (NRR) based on Box Augmentation (BA). AnnofreeOD achieves state-of-the-art performance across multiple experiments. On the nuScenes dataset, we established the first annotation-free 10-class object detection baseline, achieving 40\% of fully supervised performance. Furthermore, in 3-class and class-agnostic object detection tasks, our approach surpasses prior state-of-the-art methods by +9.3\% mAP (+12.2\% NDS) and +6.0\% AP (+7.2\% NDS), significantly improving precision. The code and model weights are provided in the supplementary material.
Poster
Longliang Liu · Miaojie Feng · Junda Cheng · Jijun Xiang · Xuan Zhu · Xin Yang

[ Exhibit Hall I ]

Abstract
Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-to-plane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features from both branches, further suppressing polar distortions. Extensive experiments demonstrate that PriOr-Flow is compatible with various perspective-based iterative optical flow methods and consistently achieves state-of-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for wide-field motion estimation.
Poster
Haoyu Zhen · Qiao Sun · Hongxin Zhang · Junyan Li · Siyuan Zhou · Yilun Du · Chuang Gan

[ Exhibit Hall I ]

Abstract
This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos.This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.
Poster
Fatemeh Saleh · Sadegh Aliakbarian · Charlie Hewitt · Lohit Petikam · Xiao Xian · Antonio Criminisi · Thomas J. Cashman · Tadas Baltrusaitis

[ Exhibit Hall I ]

Abstract
The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. We release our annotated synthetic dataset, SynthHuman, as well as our models, upon publication.
Poster
Massimiliano Viola · Kevin Qu · Nando Metzger · Bingxin Ke · Alexander Becker · Konrad Schindler · Anton Obukhov

[ Exhibit Hall I ]

Abstract
Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings, and tend to struggle when applied to images outside the training domain, as well as when the available depth measurements are sparse, irregularly distributed, or of varying density. Inspired by recent advances in monocular depth estimation, we reframe depth completion as image-conditional depth map generation guided by a sparse set of measurements. Our method, Marigold-DC, builds on a pretrained latent diffusion model (LDM) for depth estimation and injects the depth observations as test-time guidance, via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. The method exhibits excellent zero-shot generalization across a diverse range of environments and handles even extremely sparse guidance effectively. Our results suggest that contemporary monodepth priors greatly robustify depth completion: it may be better to view the task as recovering dense depth from (dense) image pixels, guided by sparse depth; rather than as inpainting (sparse) depth, guided by an image.
Poster
Qiaomu Miao · Vivek Golani · Jingyi Xu · Progga Paromita Dutta · Minh Hoai · Dimitris Samaras

[ Exhibit Hall I ]

Abstract
This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-of-view targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertainty-based Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person's face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. The paper also introduces a multi-view dataset for developing and evaluating multi-view GTE. Data and code will be made available.
Poster
Kent Gauen · Stanley Chan

[ Exhibit Hall I ]

Abstract
This paper presents an efficient method to compute space-time superpixels and an application of the superpixels called superpixel convolution. The space-time superpixel method extends a single-image Bayesian method named BASS. Our approach, named Bayesian-inspired Space-Time Superpixels (BIST), is inspired by hill-climbing to a local mode of a Dirichlet-Process Gaussian Mixture Model conditioned on the previous frame's superpixel information. The method is only Bayesian-inspired, rather than actually Bayesian, because the split/merge steps are treated as a classification problem rather than derived from a Gibbs sampling update step. However, this heuristic reduces the number of split/merge steps from several hundred per frame to only a few. BIST is over twice as fast as BASS and over 10 times faster than other space-time superpixel methods with favorable (and sometimes superior) quality. Additionally, to garner interest in superpixels, this paper demonstrates their use within deep neural networks. We present a superpixel-weighted convolution layer for single-image denoising that outperforms standard convolution by 1 dB PSNR.
Poster
Wen Jiang · BOSHU LEI · Katrina Ashton · Kostas Daniilidis

[ Exhibit Hall I ]

Abstract
We present an active mapping system that could plan for long-horizon exploration goals and short-term actions with a 3D Gaussian Splatting (3DGS) representation. Existing methods either did not take advantage of recent developments in multimodal Large Language Models (LLM) or did not consider challenges in localization uncertainty which is critical in embodied agents. We propose employing multimodal LLMs for long-horizon planning in conjunction with detailed motion planning using our information-based algorithm. By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. We also introduce an uncertainty-aware path proposal and selection algorithm that balances the dual objectives of maximizing the information gain for the environment while minimizing the cost of localization errors. Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.
Poster
SaiKiran Tedla · Junyong Lee · Beixuan Yang · Mahmoud Afifi · Michael Brown

[ Exhibit Hall I ]

Abstract
Multispectral (MS) images capture detailed scene information across a wide range of spectral bands, making them invaluable for applications requiring rich spectral data. Integrating MS imaging into multi-camera devices, such as smartphones, has the potential to enhance both spectral applications and RGB image quality. A critical step in processing MS data is demosaicing, which reconstructs color information from the mosaic MS images captured by the camera. This paper proposes a method for MS image demosaicing specifically designed for dual-camera setups where both RGB and MS cameras capture the same scene. Our approach leverages co-captured RGB images, which typically have higher spatial fidelity, to guide the demosaicing of lower-fidelity MS images. We introduce the Dual-camera RGB-MS Dataset -- a large collection of paired RGB and MS mosaiced images with ground-truth demosaiced outputs -- that enables training and evaluation of our method. Experimental results demonstrate that our method achieves state-of-the-art accuracy compared to existing techniques.
Poster
Yue-Jiang Dong · Wang Zhao · Jiale Xu · Ying Shan · Song-Hai Zhang

[ Exhibit Hall I ]

Abstract
Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions.In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce \textbf{scale guidance} to synchronize the depth scale \textbf{across windows} and \textbf{geometry guidance} to enforce geometric alignment \textbf{within windows} based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.
Poster
Takehiko Ohkawa · Jihyun Lee · Shunsuke Saito · Jason Saragih · Fabian Prada · Yichen Xu · Shoou-I Yu · Ryosuke Furuta · Yoichi Sato · Takaaki Shiratori

[ Exhibit Hall I ]

Abstract
One can hardly model self-contact of human poses without considering underlying body shapes. For example, the pose of rubbing a belly for a person with a low BMI leads to penetration of the hand into the belly for a person with a high BMI. Despite its importance, existing self-contact datasets lack the variety of self-contact poses and precise body shapes, limiting conclusive analysis between self-contact and shapes. To address this, we begin by introducing the first extensive self-contact dataset with precise body shape registration, Goliath-SC, consisting of 383K self-contact poses across 130 subjects. Using this dataset, we propose generative modeling of a self-contact prior conditioned by body shape parameters, based on a body-part-wise latent diffusion with self-attention. We further incorporate this prior into single-view human pose estimation, while refining estimated poses to be in contact. Our experiments suggest that shape conditioning is vital to the successful modeling of self-contact pose distribution, hence improving pose estimation in self-contact from a single image.
Poster
Haihao Zhang · Yunjian Zhang · Jianing Li · Lin Zhu · Meng Lv · Yao Zhu · Yanwei Liu · Xiangyang Ji

[ Exhibit Hall I ]

Abstract
Accurate stereo matching under fast motion and extreme lighting conditions is a challenge for many vision applications. Event cameras have the advantages of low latency and high dynamic range, thus providing a reliable solution to this challenge. However, since events are sparse, this makes it an ill-posed problem to obtain dense disparity using only events. In this work, we propose a novel framework for event-based dense stereo via cross-sensor knowledge distillation. Specifically, a multi-level intensity-to-event distillation strategy is designed to maximize the potential of long-range information, local texture details, and task-related knowledge of the intensity images. Simultaneously, to enforce the cross-view consistency, an intensity-event joint left-right consistency module is proposed. With our framework, extensive dense and structural information contained in intensity images is distilled to the event branch, so retaining only the events can predict dense disparities during inference, preserving the low latency characteristics of the events. Adequate experiments conducted on the MVSEC and DSEC datasets demonstrate that our method exhibits superior stereo matching performance than baselines, both quantitatively and qualitatively.
Poster
Jie Zhu · Sungkil Lee

[ Exhibit Hall I ]

Abstract
Flare and glare are common nighttime artifacts that degrade image quality and hinder computer vision tasks. Existing synthetic datasets lack physical realism and diversity, while deep learning-based removal methods struggle in complex scenes, posing significant challenges. To address these issues, we introduce the high-quality annotated Physically-Based Flare and Glare (PBFG) dataset and a Flare and Glare Removal Network (FGRNet). PBFG comprises 2,600 flares and 4,000 glares using our computational rendering scheme with diverse lens systems and optical configurations. Our advanced streak synthesis enhances template fidelity and improves streak removal accuracy. FGRNet leverages spatial-frequency features for comprehensive local and global feature extraction. It introduces a Spatial-Frequency Enhanced Module with a Spatial Reconstruction Unit and a Frequency-Enhanced Unit to extract multi-scale spatial information and enhance frequency representation. This design effectively removes complex artifacts, including large-area glares, diverse flares, and multiple or off-screen-induced streaks. Additionally, a histogram-matching module ensures stylistic and visual consistency with ground truth. Extensive experiments confirm that PBFG accurately replicates real-world patterns, and FGRNet outperforms state-of-the-art methods both quantitatively and qualitatively, resulting in significant gains of PSNRs (up to 2.3 dB and 3.14 dB in an image and its glare regions, respectively).
Poster
Shaohan Li · Hao Yang · Min Chen · Xiaolin Qin

[ Exhibit Hall I ]

Abstract
The increasing frequency of extreme weather events due to global climate change urges accurate weather prediction. Recently, great advances are made by the \textbf{end-to-end methods}, thanks to deep learning techniques, but they face limitations of \textit{representation inconsistency} in multivariable integration and struggle to effectively capture the dependency between variables, which is required in complex weather systems. Treating different variables as distinct modalities and applying a \textbf{two-stage training approach} from multimodal models can partially alleviate this issue, but due to the inconformity in training tasks between the two stages, the results are often suboptimal. To address these challenges, we propose an implicit two-stage training method, configuring separate encoders and decoders for each variable. In detailed, in the first stage, the Translator is frozen while the Encoders and Decoders learn a shared latent space, in the second stage, the Encoders and Decoders are frozen, and the Translator captures inter-variable interactions for prediction. Besides, by introducing a self-attention mechanism for multivariable fusion in the latent space, the performance achieves further improvements. Empirically, extensive experiments shows state-of-the-art performance of our method. In specific, it reduces the MSE for near-surface air temperature and relative humidity predictions by 28.82% and 23.39%, respectively.
Poster
Yifan Jiao · Yunhao Li · Junhua Ding · Qing Yang · Song Fu · Heng Fan · Libo Zhang

[ Exhibit Hall I ]

Abstract
In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking. To provide highquality per-frame 3D annotations, all sequences are labeled manually with multiple rounds of meticulous inspection and refinement. To our best knowledge, GSOT3D is the largest benchmark dedicated to various generic 3D object tracking tasks. To understand how existing 3D trackers perform and to provide comparisons for future research on GSOT3D, we assess eight representative point cloud-based tracking models. Our evaluation results exhibit that these models heavily degrade on GSOT3D, and more efforts are required for robust and generic 3D object tracking. Besides, to encourage future research, we present a simple yet effective generic 3D tracker, named PROT3D, that localizes the target object via a progressive spatial-temporal network …
Poster
Tongtong Cheng · Rongzhen Li · Yixin Xiong · Tao Zhang · Jing Wang · Kai Liu

[ Exhibit Hall I ]

Abstract
Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications.
Poster
Xuejian Gou · Fang Liu · Licheng Jiao · Shuo Li · Lingling Li · Hao Wang · Xu Liu · Puhua Chen · wenping ma

[ Exhibit Hall I ]

Abstract
In real-world scenarios, objects and their parts inherently possess both coarse-grained differences and intricate fine-grained structural relationships. These characteristics can be formalized as knowledge, leveraged for fine-grained part comprehension. However, existing part segmentation models consistently fail to capture these complex inter-part relationships, treating parts as independent entities and disregarding object-level distinctions. To address these limitations, we propose a novel Knowledge-Guided Part Segmentation (KPS) framework. Our approach automatically extracts structural relationships between parts using a large language model (LLM) and integrates them into a knowledge graph. Subsequently, a structural knowledge guidance module employs a graph convolutional network (GCN) to model these relationships. Furthermore, a coarse-grained object guidance module captures object-specific distinctions and integrates them as visual guidance. The integrated insights from the part structure and object differentiation guide the fine-grained part segmentation. Our KPS achieves notable improvements in segmentation performance, with a 4.96\% mIoU gain on PartImageNet and a 3.73\% gain on Pascal-Part. Moreover, in the open-vocabulary setting on Pascal-Part-116, it improves hIoU by 3.25\%, highlighting the effectiveness of knowledge guidance in enhancing fine-grained part segmentation.
Poster
Hayeon Kim · Ji Ha Jang · Se Young Chun

[ Exhibit Hall I ]

Abstract
Due to limited 3D data, recent prior arts in 3D editing rely mainly on the Score Distillation Sampling (SDS) loss that edits and segments in 2D rendered views using pre-trained diffusion priors and then projects back onto 3D space to update the model. While these approaches are effective for 3D instance-level editing, they struggle with 3D part-level editing especially for Gaussian splatting due to inconsistent multi-view 2D part segmentations and inherently ambiguous SDS loss with localized nature of Gaussians. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing that enables drastic part-level changes. Firstly, we propose 3D-geometry aware label prediction (3D-GALP) exploiting the uncertainty in soft-label segmentations. Secondly, we propose a regularized SDS loss with masks that consists of a usual SDS loss with the predicted 3D mask and an L1 regularizer as an anchor loss for high-quality part-edited 2D images using our proposed scheduled latent mixing and part editing (SLaMP) method. Our SDS loss improves flexibility in local editing by removing 3D masked regions, allowing changes beyond existing context. SLaMP uses the projected 2D mask of the predicted 3D mask to confine modifications to the target region while preserving contextual coherence. Experimental results demonstrate that …
Poster
Siqi Zhang · Yanyuan Qiao · Qunbo Wang · Zike Yan · Qi Wu · Zhihua Wei · Jing Liu

[ Exhibit Hall I ]

Abstract
Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the **co**mbination of **s**elective **m**em**o**rization (COSMO).Specifically, COSMO integrates state-space modules and transformer modules, and incorporates two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions.Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs.
Poster
Ilya A. Petrov · Riccardo Marin · Julian Chibane · Gerard Pons-Moll

[ Exhibit Hall I ]

Abstract
Modeling 3D human-object interaction (HOI) is a problem of great interest for computer vision and a key enabler for virtual and mixed-reality applications. Existing methods work in a one-way direction: some recover plausible human interactions conditioned on a 3D object; others recover the object pose conditioned on a human pose. Instead, we provide the first unified model - TriDi which works in any direction. Concretely, we generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process, allowing to model seven distributions with one network. We implement TriDi as a transformer attending to the various modalities' tokens, thereby discovering conditional relations between them. The user can control the interaction either as a text description of HOI or a contact map. We embed these two representations onto a shared latent space, combining the practicality of text descriptions with the expressiveness of contact maps. Using a single network, TriDi unifies all the special cases of prior work and extends to new ones modeling a family of seven distributions. Remarkably, despite using a single model, TriDi generated samples surpass one-way specialized baselines on GRAB and BEHAVE in terms of both qualitative and quantitative metrics, and demonstrating better diversity. We show applicability …
Poster
Xuan Yao · Junyu Gao · Changsheng Xu

[ Exhibit Hall I ]

Abstract
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that our method achieves notable performance improvements on popular VLN-CE benchmarks. Code is available in the Supplementary Material.
Poster
Shani Gamrian · Hila Barel · Feiran Li · Masakazu Yoshimura · Daisuke Iso

[ Exhibit Hall I ]

Abstract
Object detection models are typically applied to standard RGB images processed through Image Signal Processing (ISP) pipelines, which are designed to enhance sensor-captured RAW images for human vision. However, these ISP functions can lead to a loss of critical information that may be essential in optimizing for computer vision tasks, such as object detection. In this work, we introduce Raw Adaptation Module (RAM), a module designed to replace the traditional ISP, with parameters optimized specifically for RAW object detection. Inspired by the parallel processing mechanisms of the human visual system, RAM departs from existing learned ISP methods by applying multiple ISP functions in parallel rather than sequentially, allowing for a more comprehensive capture of image features. These processed representations are then fused in a specialized module, which dynamically integrates and optimizes the information for the target task. This novel approach not only leverages the full potential of RAW sensor data but also enables task-specific pre-processing, resulting in superior object detection performance. Our approach outperforms RGB-based methods and achieves state-of-the-art results across diverse RAW image datasets under varying lighting conditions and dynamic ranges.
Poster
Xiao Chen · Tai Wang · Quanyi Li · Tao Huang · Jiangmiao Pang · Tianfan Xue

[ Exhibit Hall I ]

Abstract
Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by limited training data and conservative exploration strategies, struggle to generalize across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we present GLEAM-Bench, the first large-scale benchmark with 1,152 diverse 3D scenes from synthetic and real datasets. In this work, we propose GLEAM, a generalizable exploration policy for active mapping. Its superior generalizability comes from our semantic representations, long-term goal, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 68.16\% coverage (+11.41\%) with efficient trajectories, and improved mapping accuracy on 128 unseen complex scenes.
Poster
Gencer Sumbul · Chang Xu · Emanuele Dalsasso · Devis Tuia

[ Exhibit Hall I ]

Abstract
From optical sensors to microwave radars, leveraging the complementary strengths of remote sensing (RS) sensors is of great importance for achieving dense spatio-temporal monitoring of our planet. In contrast, recent deep learning models—task-specific or foundational—are often specific to single sensors or to fixed combinations: adapting such models to different sensory inputs requires both architectural changes and re-training, limiting scalability and generalization across multiple RS sensors. On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SA-MAE, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SA-MAE projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the usage of arbitrary combinations of bands—a key discriminative property for RS—both for training and inference. To obtain sensor-agnostic representations, we train a single, unified transformer model reconstructing masked multi-sensor data with cross-sensor token mixup. On both single- and multi-modal tasks across diverse sensors, SA-MAE outperforms previous models that rely on sensor-specific pretraining.
Poster
Björn Braun · Rayan Armani · Manuel Meier · Max Moebus · Christian Holz

[ Exhibit Hall I ]

Abstract
Egocentric vision systems aim to understand the spatial surroundings and the wearer's behavior inside it, including motions, activities, and interaction with objects. Meta's Project Aria 2 recently added a heart rate (HR) contact sensor to additionally capture the wearer's cardiac activity, which can impact the person's attention and situational responses. In this paper, we propose egoPPG, a novel non-contact-based method to recover cardiac activity from the eye-tracking cameras in previous egocentric vision systems. Our method continuously estimates the person's photoplethysmogram (PPG) from areas around the eyes and fuses motion cues from the headset's inertial measurement unit to track HR values. We demonstrate egoPPG's downstream benefit for existing egocentric datasets on EgoExo4D, where we find that augmenting existing models with tracked HR values improves proficiency estimation by 14%. To train and validate egoPPG, we collected a dataset of 13+ hours of eye-tracking videos from Project Aria and contact-based blood volume pulse signals as well as an electrocardiogram (ECG) for ground-truth HR values. 25 participants performed diverse everyday activities such as office work, cooking, dancing, and exercising, which induced significant natural motion and HR variation (44 - 164 bpm). Our model robustly estimates HR (MAE=7.67 bpm) and captures patterns (r=0.85). Our results …
Poster
Fu-Jen Tsai · Yan-Tsung Peng · Yen-Yu Lin · Chia-Wen Lin

[ Exhibit Hall I ]

Abstract
Image dehazing aims to remove unwanted hazy artifacts in images. Although previous research has collected paired real-world hazy and haze-free images to improve dehazing models' performance in real-world scenarios, these models often experience significant performance drops when handling unseen real-world hazy images due to limited training data. This issue motivates us to develop a flexible domain adaptation method to enhance dehazing performance during testing. Observing that predicting haze patterns is generally easier than recovering clean content, we propose the Physics-guided Haze Transfer Network (PHATNet) which transfers haze patterns from unseen target domains to source-domain haze-free images, creating domain-specific fine-tuning sets to update dehazing models for effective domain adaptation. Additionally, we introduce a Haze-Transfer-Consistency loss and a Content-Leakage Loss to enhance PHATNet's disentanglement ability. Experimental results demonstrate that PHATNet significantly boosts state-of-the-art dehazing models on benchmark real-world image dehazing datasets.
Poster
Sitao Zhang · Hongda Mao · Qingshuang Chen · Yelin Kim

[ Exhibit Hall I ]

Abstract
Visual place recognition is crucial for autonomous navigation and robotic mapping. Current methods struggle with perceptual aliasing and computational inefficiency. We present SemVPR, a novel approach integrating multimodal semantic knowledge into VPR. By leveraging a pre-trained vision-language model as a teacher during the training phase, SemVPR learns local visual and semantic descriptors simultaneously, effectively mitigating perceptual aliasing through semantic-aware aggregation without extra inference cost. The proposed nested descriptor learning strategy generates a series of ultra-compact global descriptors, reduced by approximately compared to state-of-the-art methods, in a coarse-to-fine manner, eliminating the need for offline dimensionality reduction or training multiple models. Extensive experiments across various VPR benchmarks demonstrate that SemVPR consistently outperforms state-of-the-art methods with significantly lower computational costs, rendering its feasibility for latency-sensitive scenarios in real-world applications.
Poster
CHANGHEE YANG · Hyeonseop Song · Seokhun Choi · Seungwoo Lee · Jaechul Kim · Hoseok Do

[ Exhibit Hall I ]

Abstract
Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real-world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in-the-wild 2D pose dataset into diverse 3D pose–image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model--aligned with challenging poses and appearances--PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real-world benchmarks including various backgrounds and occlusions, challenging poses, and multi-view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.
Poster
Yun Li · Yiming Zhang · Tao Lin · Xiangrui Liu · Wenxiao Cai · Zheng Liu · Bo Zhao

[ Exhibit Hall I ]

Abstract
The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To address this gap, we introduce ST-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.
Poster
Anna-Maria Halacheva · Yang Miao · Jan-Nico Zaech · Xi Wang · Luc Gool · Danda Pani Paudel

[ Exhibit Hall I ]

Abstract
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered in the research field. In this work, we address this shortcoming by introducing: (1) Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes. Articulate3D provides 8 types of annotations for articulated objects, covering parts and detailed motion information,all stored in a standardized scene representation format designed for scalable 3D content creation, exchange and seamless integration into simulation environments. (2) USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects. We evaluate USDNet on Articulate3D as well as two existing datasets, demonstrating the advantage of our unified dense prediction approach. Furthermore, we highlight the value of Articulate3D through cross-dataset and cross-domain evaluations and showcase its applicability in downstream tasks such as scene editing through LLM prompting and robotic …
Poster
Khurram Azeem Hashmi · Karthik Suresh · Didier Stricker · Muhammad Zeshan Afzal

[ Exhibit Hall I ]

Abstract
Low-light conditions significantly degrade the performance of high-level vision tasks. Existing approaches either enhance low-light images without considering normal illumination scenarios, leading to poor generalization or are tailored to specific tasks. We propose **TorchAdapt**, a real-time adaptive feature enhancement framework that generalizes robustly across varying illumination conditions without degrading performance in well-lit scenarios. TorchAdapt consists of two complementary modules: the **Torch** module enhances semantic features beneficial for downstream tasks, while the **Adapt** module dynamically modulates these enhancements based on input content. Leveraging a novel light-agnostic learning strategy, TorchAdapt aligns feature representations of enhanced and well-lit images to produce powerful illumination-invariant features. Extensive experiments on multiple high-level vision tasks, including object detection, face detection, instance segmentation, semantic segmentation, and video object detection, demonstrate that TorchAdapt consistently outperforms state-of-the-art low-light enhancement and task-specific methods in both low-light and light-agnostic settings. TorchAdapt thus provides a unified, flexible solution for robust visual perception across diverse lighting conditions. Code and models are provided as supplementary.
Poster
Christopher Xie · Armen Avetisyan · Henry Howard-Jenkins · Yawar Siddiqui · Julian Straub · Richard Newcombe · Vasileios Balntas · Jakob Engel

[ Exhibit Hall I ]

Abstract
We present a novel human-in-the-loop approach to estimate 3D scene layout that uses human feedback from an egocentric standpoint. We study this approach through introduction of a novel local correction task, where users identify local errors and prompt a model to automatically correct them. Building on SceneScript, a state-of-the-art framework for 3D scene layout estimation that leverages structured language, we propose a solution that structures this problem as "infilling", a task studied in natural language processing. We train a multi-task version of SceneScript that maintains performance on global predictions while significantly improving its local correction ability. We integrate this into a human-in-the-loop system, enabling a user to iteratively refine scene layout estimates via a low-friction "one-click fix'' workflow. Our system enables the final refined layout to diverge from the training distribution, allowing for more accurate modelling of complex layouts.
Poster
Matteo Poggi · Fabio Tosi

[ Exhibit Hall I ]

Abstract
We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8× lower compared to most recent methods, and still achieves the best cross-dataset generalization on Sintel Final and KITTI with a relative improvement of 10 and 15% over the previous state-of-the-art, as well as on Spring and LayeredFlow datasets representing a solid step towards more responsible hardware use.
Poster
Songyan Zhang · Yongtao Ge · Jinyuan Tian · Guangkai Xu · Hao Chen · Chen Lv · Chunhua Shen

[ Exhibit Hall I ]

Abstract
3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by moving objects. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping corresponding RGB pixels across different views to 3D pointmaps within a shared coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, …
Poster
Yongsheng Yuan · Jie Zhao · Dong Wang · Huchuan Lu

[ Exhibit Hall I ]

Abstract
Modern visual trackers have achieved robust performance with precisely initialized target bounding boxes. However, providing high-precision initial annotations is a process both labor-intensive and error-prone in real-world scenarios. Interactive initialization (e.g., click-based, scribble-based) presents a more practical alternative. In this paper, we introduce a unified Click-and-Track (CAT) framework for full-process tracking, eliminating the need for auxiliary models or complex initializing pipelines. We present a novel fine-tuning paradigm that bridges the information gap inherent in click-based initialization through two key innovations: 1) The proposed click-based location and joint spatial-visual prompt refinement are sequentially performed to remedy the geometric information loss (e.g., boundary ambiguity, shape uncertainty) inherent in click-based initialization. 2) We design a parameter-efficient module called CTMoE to leverages the tracker's inherent capabilities when fine-tuning. The proposed CTMoE enable the foundation model to learn different matching patterns, unifying click-based initialization and tracking within a unified architecture. Extensive experimental results demonstrate state-of-the-art performance of our click-based tracking method on the LaSOT benchmark (70.5\% AUC) while maintaining parameter efficiency, surpassing existing click-based tracking frameworks by a large margin and even outperforming some bounding-box-initialized trackers.
Poster
Yapeng Meng · Yihan Lin · Taoyi Wang · Yuguo Chen · Lijian Wang · Rong Zhao

[ Exhibit Hall I ]

Abstract
Recording and reconstructing high-speed scenes poses a significant challenge. The high bandwidth of high-speed cameras makes continuous recording unsustainable, while the frame interpolation methods using traditional RGB cameras (typically 30 fps) introduce artifacts and are affected by motion blur. Leveraging sensors inspired by the human visual system, such as event cameras, provides high-speed parse temporal variation or spatial variation data to alleviate the ill-conditioned problem of high-speed reconstructing with traditional RGB cameras. However, existing methods still suffer from RGB blur, temporal aliasing, and loss of event information. To overcome the above challenges, we leverage a novel dual-pathway complementary vision sensor, which outputs high-speed, sparse spatio-temporal differential frames between two RGB frames as reconstruction conditions. Further, we propose a cascaded bi-directional recurrent diffusion model (CBRDM) that can achieve accurate, sharp, color-rich video frames reconstruction results. Our method improves the LPIPS metric by 37.6% over state-of-the-art RGB interpolation algorithms and achieves superior performance in real-world comparisons with event cameras. Our code and dataset will be publicly available.
Poster
Sihang Li · Zeyu Jiang · Grace Chen · Chenyang Xu · Siqi Tan · Xue Wang · Irving Fang · Kristof Zyskowski · Shannon McPherron · Radu Iovita · Chen Feng · Jing Zhang

[ Exhibit Hall I ]

Abstract
3D reassembly is a challenging spatial intelligence task with broad applications across scientific domains. While large-scale synthetic datasets have fueled promising learning-based approaches, their generalizability to different domains is limited. Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose \acronym{}, a \textbf{g}eneralizable 3D re\textbf{a}ssembly framework for \textbf{r}eal-world \textbf{f}ractures. \acronym{} leverages fracture-aware pretraining to learn fracture features from individual fragments, while flow matching enables precise 6-DoF alignments. At inference time, we introduce one-step preassembly, improving robustness to unseen objects and varying numbers of fractures. In collaboration with archaeologists, paleoanthropologists, and ornithologists, we curate \dataset{}, a diverse dataset for vision and learning communities, featuring real-world fracture types across ceramics, bones, eggshells, and lithics. Comprehensive experiments have demonstrated our approach consistently outperforms state-of-the-art methods on both synthetic and real-world datasets, achieving 82.87\% lower rotation error and 25.15\% higher part accuracy. This work sheds light on training on synthetic data to advance real-world 3D puzzle solving, showcasing its strong generalization across unseen object shapes and diverse fracture types.
Poster
Qingcheng Zhao · Xiang Zhang · Haiyang Xu · Zeyuan Chen · Jianwen Xie · Yuan Gao · Zhuowen Tu

[ Exhibit Hall I ]

Abstract
We propose DePR, a novel depth-guided single-view scene reconstruction framework that integrates instance-level diffusion priors. Our approach follows a compositional reconstruction paradigm, where individual objects are first generated before being arranged into a coherent scene. Unlike previous methods that solely use depth for object layout estimation during inference—thus underutilizing its rich geometric information—DePR leverages depth throughout both training and inference. Specifically, we introduce depth-guided conditioning to effectively encode shape priors into image-conditioned diffusion models. During inference, depth further aids in layout optimization and guided DDIM sampling, ensuring better alignment between reconstructed objects and the input image. Despite being trained on limited synthetic data, DePR achieves state-of-the-art performance and strong generalizability in single-view scene reconstruction, as demonstrated through evaluations on both synthetic and real-world datasets.
Poster
Yuedong Tan · Zongwei Wu · Yuqian Fu · Zhuyun Zhou · Guolei Sun · Eduard Zamfir · Chao Ma · Danda Pani Paudel · Luc Gool · Radu Timofte

[ Exhibit Hall I ]

Abstract
Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a classifier with weak loss tasked with distinguishing between modalities. More specifically, if the classifier "fails" to accurately identify the modality of the given sample, this signals an opportunity for cross-modal knowledge sharing. Intuitively, knowledge transfer is facilitated whenever a sample from one modality is sufficiently close and aligned with another. Technically, we achieve this by routing samples from one modality to the expert of the others, within a mixture-of-experts framework designed for multimodal video object tracking. During the inference, the expert of the respective modality is chosen, which …
Poster
Emery Pierson · Lei Li · Angela Dai · Maks Ovsjanikov

[ Exhibit Hall I ]

Abstract
Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation …
Poster
Seunghyun Lee · Tae-Kyun Kim

[ Exhibit Hall I ]

Abstract
Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pre-trains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multi-modal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.
Poster
Fangwei Zhong · Kui Wu · Churan Wang · Hao Chen · Hai Ci · Zhoujun Li · Yizhou Wang

[ Exhibit Hall I ]

Abstract
We introduce UnrealZoo, a rich collection of 100 photo-realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open worlds with scales up to $16 km^2$ landscapes. Additionally, we offer a rich variety of playable entities including humans, animals, robots, and vehicles for embodied AI. We extend UnrealCV with optimized Python APIs and tools for data collection, environment augmentation, distributed training, and benchmarking, achieving significant improvements in the efficiency of rendering and communication, to support advanced applications, such as multi-agent interactions. Our experimental evaluation across complex navigation and tracking tasks reveals two key insights: first, the substantial benefits of the diversity of environments for developing generalizable reinforcement learning (RL) agents; second, the persistent challenges that current embodied agents face in open-world settings. These challenges include transferring to a new embodiment at test time, managing latency in closed-loop control systems for dynamic environments, and effectively reasoning about complex 3D spatial structures in unstructured terrain. UnrealZoo thus provides both a powerful testing ground and a pathway toward more capable embodied AI systems for real-world deployment.
Poster
Valter Piedade · Chitturi Sidhartha · José Gaspar · Venu Madhav Govindu · Pedro Miraldo

[ Exhibit Hall I ]

Abstract
Outliers are ubiquitous in geometric vision contexts such as pose estimation and mapping, leading to inaccurate estimates. While robust loss functions tackle outliers, it is challenging to make the estimation robust to the choice of initialization and estimate the appropriate robust loss shape parameter that allows distinguishing inliers from outliers. Graduated non-convexity (GNC) often mitigates these issues. However, typical GNC uses a fixed annealing factor to update the shape parameter, which can lead to low-quality or inefficient estimates. This paper proposes a novel approach to adaptively anneal the shape parameter within a GNC framework. We developed a search strategy that incorporates a sampling of annealing choices and model scorings to select the most promising shape parameter at each GNC iteration. Additionally, we propose new stopping criteria and an initialization technique that improves performance for diverse data, and we show the benefits of combining discrete and continuous robust estimation strategies. We evaluate our method using synthetic and real-world data in two problems: 3D registration and pose graph optimization in SLAM sequences. Our results demonstrate greater efficiency and robustness compared to previous GNC schemes.
Poster
Zhengbo Zhang · Lin Geng Foo · Hossein Rahmani · Jun Liu · De Wen Soh

[ Exhibit Hall I ]

Abstract
Single image defocus deblurring (SIDD) is a challenging task that aims to recover an all-in-focus image from a defocused one. In this paper, we make the observation that a defocused image can be viewed as a blend of illuminated blobs based on fundamental imaging principles, and the defocus blur in the defocused image is caused by large illuminated blobs intermingling with each other. Thus, from a novel perspective, we perform SIDD by adjusting the shape and opacity of the illuminated blobs that compose the defocused image. With this aim, we design a novel 2D Gaussian blob representation for illuminated blobs and a differentiable rasterization method to obtain the parameters of the 2D Gaussian blobs that compose the defocused image. Additionally, we propose a blob deblurrer to adjust the parameters of the 2D Gaussian blobs corresponding to the defocused image, thereby obtaining a sharp image. We also explore incorporating prior depth information via our depth-based regularization loss to regularize the size of Gaussian blobs, further improving the performance of our method. Extensive experiments on five widely-used datasets validate the effectiveness of our proposed method.
Poster
Guowei Shi · Zian Mao · Peisen Huang

[ Exhibit Hall I ]

Abstract
Ultra-precision measurement of 6DoF pose is essential in applications such as semiconductor manufacturing and nanoscale manipulation. Conventional vision‐based techniques are often hampered by sensitivity to defocus, limited number of periods when using images of periodical patterns, etc. In this paper, we propose a novel two-dimensional interpolated Discrete Fourier Transform (2D-IpDFT) method for robust 6DoF pose estimation using periodic patterns. We further develop a mathematical framework that links image parameters—phase and frequency—to 6DoF pose, which is applicable to both orthographic and quasi-orthographic imaging systems. Extensive experiments on a low-cost setup, featuring an industrial camera and etched periodic patterns, demonstrate nanometer-level translational accuracy and microradian-level rotational precision.
Poster
Gabriele Berton · Alex Stoken · Carlo Masone

[ Exhibit Hall I ]

Abstract
Thousands of photos of Earth are taken every day by astronauts from the International Space Station. The localization of these photos, which has been performed manually for decades, has recently been approached through image retrieval solutions: given an astronaut photo, the goal is to find its most similar match among a large database of geo-tagged satellite images, in a task called Astronaut Photography Localization (APL). Yet, existing APL approaches are trained only using satellite images, without taking advantage of the millions open-source astronaut photos. In this work we present the first APL pipeline capable of leveraging astronaut photos for training. We first produce full localization information for 300,000 manually weakly labeled astronaut photos through an automated pipeline, and then use these images to train a model, called AstroLoc. AstroLoc learns a robust representation of Earth's surface features through two objective functions: pairing astronaut photos with their matching satellite counterparts in a pairwise loss, and a second loss on clusters of satellite imagery weighted by their relevance to astronaut photography through unsupervised mining. AstroLoc achieves a staggering 35% average improvement in recall@1 over previous SOTA, reaching a recall@100 consistently over 99% for existing datasets. Moreover, without fine-tuning, AstroLoc provides excellent results …
Poster
Hanwen Jiang · Hao Tan · Peng Wang · Haian Jin · Yue Zhao · Sai Bi · Kai Zhang · Fujun Luan · Kalyan Sunkavalli · Qixing Huang · Georgios Pavlakos

[ Exhibit Hall I ]

Abstract
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing.
Poster
Weirong Chen · Ganlin Zhang · Felix Wimbauer · Rui Wang · Nikita Araslanov · Andrea Vedaldi · Daniel Cremers

[ Exhibit Hall I ]

Abstract
Traditional SLAM systems, which rely on bundle adjustment, often struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates.This work proposes a novel approach that leverages a 3D point tracker to decouple the static and dynamic motion, effectively separating the camera-induced motion from the motion of dynamic objects.Bundle adjustment can therefore operate reliably considering only the camera-induced component of the observed motion. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps.Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end.By integrating motion decomposition, bundle adjustment, and depth refinement into a unified framework, our method accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.
Poster
Hanwen Jiang · Qixing Huang · Georgios Pavlakos

[ Exhibit Hall I ]

Abstract
Training single-view Large Reconstruction Models (LRMs) follows the fully supervised route, requiring multi-view supervision. However, the multi-view data typically comes from synthetic 3D assets, which are hard to scale further and are not representative of the distribution of real-world object shapes. To address these limitations, we introduce Real3D, the first LRM that uses single-view real images for training, benefiting from their scalability and capturing the real-world shape distribution. Real3D introduces a novel self-training framework, including unsupervised losses at the pixel- and semantic-level, enabling LRMs to learn from these single-view images without multi-view supervision. Simultaneously, to deal with the noise of real data, Real3D also presents an automatic data curation approach to gather high-quality examples that have positive impact on training. Our experiments show that Real3D consistently outperforms prior work in diverse evaluation settings that include real and synthetic data, as well as both in-domain and out-of-domain shapes.
Poster
Olaf Dünkel · Thomas Wimmer · Christian Theobalt · Christian Rupprecht · Adam Kortylewski

[ Exhibit Hall I ]

Abstract
Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision.While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts.We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we refine off-the-shelf features using pseudo ground truth obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints.While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4\% absolute gain and by over 7\% against methods with similar supervision requirements.The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.
Poster
Pengjie Zhang · Lin Zhu · Xiao Wang · Lizhi Wang · Hua Huang

[ Exhibit Hall I ]

Abstract
Event cameras have shown promise in vision applications like optical flow estimation and stereo matching with many specialized architectures. However, existing works only focus event data within the confines of task-specific domains, overlooking the correlations between tasks across the temporal and spatial domains. In this paper, we propose a novel matching-based framework for event cameras to estimate flow and disparity simultaneously in a shared representation space, reformulating them as a unified pixel-wise correspondence matching problem. Specifically, our method utilizes a Temporal Recurrent Network to aggregate asynchronous event features across temporal or spatial domains, and a Spatial Contextual Attention to enhance knowledge transfer across event flows via temporal or spatial interactions. By utilizing a shared pixel-wise feature similarities module, our network performs optical flow estimation from temporal event segments and stereo matching from spatial event segments simultaneously. Our unified model inherently supports multi-task unification and cross-task transfer, which facilitate training and streamline deployment. Without the need for retraining on specific tasks, our model can effectively handle both event-based flow and stereo estimation, achieving state-of-the-art performance on both tasks. Our code will be released upon acceptance.
Poster
Lojze Zust · Yohann Cabon · Juliette Marrie · Leonid Antsfeld · Boris Chidlovskii · Jerome Revaud · Gabriela Csurka

[ Exhibit Hall I ]

Abstract
Panoptic segmentation of 3D scenes, which consists in isolating object instances in a dense 3D reconstruction of a scene, is challenging given only unposed images. Existing approaches typically extract 2D panoptic segmentations for each image using an off-the-shelf model, before optimizing an implicit geometric representation (often NeRF-based) that integrates and fuses 2D panoptic constraints. Not only this requires camera parameters and costly test-time optimization for each scene, but we argue that performing 2D panoptic segmentation despite the problem at hand being fundamentally 3D and multi-view, is likely suboptimal. In this work, we instead propose a simple integrated and unified approach. Our novel network, named PanSt3R, jointly predicts the 3D geometry and panoptic segmentation without any test-time optimization in a single forward pass. PanSt3R builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, which we entail with semantic knowledge and panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach. Overall, the proposed PanSt3R is simple, fast and scalable. We conduct extensive experiments on multiple benchmarks and show that our method yields state of-the-art results while being orders of magnitude faster.
Poster
Pingchuan Ma · Ming Gui · Johannes Schusterbauer · Xiaopei Yang · Olga Grebenkova · Vincent Tao Hu · Björn Ommer

[ Exhibit Hall I ]

Abstract
Generative probabilistic models have rapidly advanced and are now widely used in content creation. They have achieved impressive results in generating artwork and demonstrated an understanding of different styles. However, their understanding of art primarily remains at the level of individual pieces, limiting their ability to reveal broader stylistic trends and transitions over time. To analyze how art evolves, a distributional perspective is required, as single-instance observations do not capture the relation between them, which is essential for such a study. In this work, we introduce a diverse and high-quality dataset of over $656{,}536$ artworks spanning various genres, including paintings, illustrations, and other art forms, along with relevant metadata and annotations.Building on this dataset, we present a method that models the evolution of art as an optimal transport problem with stochastic interpolant to examine stylistic changes over time without requiring paired data. This approach allows us to study and understand the historical progression of art, uncovering the transitions and stylistic shifts that have occurred over centuries. Our code and dataset will be released upon publication.
Poster
Matteo Dunnhofer · Zaira Manigrasso · Christian Micheloni

[ Exhibit Hall I ]

Abstract
Visual object tracking and segmentation are becoming fundamental tasks for understanding human activities in egocentric vision. Recent research has benchmarked state-of-the-art methods and concluded that first person egocentric vision presents challenges compared to previously studied domains. However, these claims are based on evaluations conducted across significantly different scenarios. Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities. This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint inherent to egocentric vision versus the domain of human-object activities? To address this question, we introduce a new benchmark study designed to disentangle such factors. Our evaluation strategy enables a more precise separation of challenges related to the first person perspective from those linked to the broader domain of human-object activity understanding. By doing so, we provide deeper insights into the true sources of difficulty in egocentric tracking and segmentation, facilitating more targeted advancements in this field.
Poster
Junyan Ye · Honglin Lin · Leyan Ou · Dairong Chen · Zihao Wang · Qi Zhu · Conghui He · Weijia Li

[ Exhibit Hall I ]

Abstract
Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response.In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10\% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at https://cvg-text.github.io/CVG-Text/.
Poster
Gongwei Chen · Xurui Zhou · Rui Shao · Yibo Lyu · Kaiwen Zhou · Shuai Wang · WenTao Li · Yinchuan Li · Zhongang Qi · Liqiang Nie

[ Exhibit Hall I ]

Abstract
The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: **1) the high-density and loose-relation of element context** highlight the existence of many unrelated elements and their negative influence; **2) the high redundancy of history context** reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed **SimpAgent**. To mitigate potential interference from numerous unrelated elements, we introduce a **masking-based element pruning** method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a **consistency-guided history compression** module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27\% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.
Poster
Jungdae Lee · Taiki Miyanishi · Shuhei Kurita · Koya Sakamoto · Daichi Azuma · Yutaka Matsuo · Nakamasa Inoue

[ Exhibit Hall I ]

Abstract
Vision-and-language navigation (VLN) aims to develop agents capable of navigating in realistic environments. While recent cross-modal training approaches have significantly improved navigation performance in both indoor and outdoor scenarios, aerial navigation over real-world cities remains underexplored primarily due to limited datasets and the difficulty of integrating visual and geographic information. To fill this gap, we introduce CityNav, the first large-scale real-world dataset for aerial VLN. Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description, covering 4.65 km^2 across two real cities: Cambridge and Birmingham. In contrast to existing datasets composed of synthetic scenes such as AerialVLN, our dataset presents a unique challenge because agents must interpret spatial relationships between real-world landmarks and the navigation destination, making CityNav an essential benchmark for advancing aerial VLN. Furthermore, as an initial step toward addressing this challenge, we provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation. In our experiments, we compare performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation significantly improves their navigation performance.
Poster
Ruifei Zhang · Wei Zhang · Xiao Tan · Sibei Yang · Xiang Wan · Xiaonan Luo · Guanbin Li

[ Exhibit Hall I ]

Abstract
Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs poses considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive's effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81\% (from 7B to 1.3B), yielding substantial driving score improvements of \textbf{15.4}\%, \textbf{16.8}\%, and \textbf{7.6}\% at tiny, short, and long distances, respectively, in closed-loop evaluations.
Poster
Zhengyao Lyu · Tianlin Pan · Chenyang Si · Zhaoxi Chen · Wangmeng Zuo · Ziwei Liu · Kwan-Yee K. Wong

[ Exhibit Hall I ]

Abstract
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our code will be made publicly available.
Poster
Yizhou Zhao · Haoyu Chen · Chunjiang Liu · Zhenyang Li · Charles Herrmann · Junhwa Hur · Yinxiao Li · Ming-Hsuan Yang · Bhiksha Raj · Min Xu

[ Exhibit Hall I ]

Abstract
System identification from videos aims to recover object geometry and governing physical laws. Existing methods integrate differentiable rendering with simulation but rely on predefined material priors, limiting their ability to handle unknown ones. We introduce MASIV, the first vision-based framework for material-agnostic system identification. Unlike existing approaches that depend on hand-crafted constitutive laws, MASIV employs learnable neural constitutive models, inferring object dynamics without assuming a scene-specific material prior. However, the absence of full particle state information imposes unique challenges, leading to unstable optimization and physically implausible behaviors. To address this, we introduce dense geometric guidance by reconstructing continuum particle trajectories, providing temporally rich motion constraints beyond sparse visual cues. Comprehensive experiments show that MASIV achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.
Poster
SHIBO WANG · Haonan He · Maria Parelli · Christoph Gebhardt · Zicong Fan · Jie Song

[ Exhibit Hall I ]

Abstract
We interact with objects everyday, making the holistic 3D reconstruction of hands and objects from videos essential for applications like robotic in-hand manipulation. While most RGB-based methods rely on object templates, existing template-free approaches depend heavily on image observations, assuming full visibility of the object in the video. However, this assumption often does not hold in real-world scenarios, where cameras are fixed and objects are held in a static grip. As a result, parts of the object may remain unobserved, leading to unrealistic reconstructions when the object is under-observed. To this end, we introduce MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited views. Our key insight is that, although paired 3D hand-object data is extremely scarce, large-scale diffusion models like image-to-3D models offer abundant object supervision. This additional supervision can act as a prior to help regularize unseen object regions during hand interactions. Leveraging this insight, MagicHOI incorporates an existing image-to-3D diffusion model into a hand-object reconstruction framework. We then refine hand poses by incorporating hand-object interaction constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art template-free hand-object reconstruction methods. We also show that image-to-3D diffusion priors effectively regularize unseen object …
Poster
Meiqi Cao · Xiangbo Shu · Xin Jiang · Rui Yan · Yazhou Yao · Jinhui Tang

[ Exhibit Hall I ]

Abstract
While event cameras excel in capturing microsecond temporal dynamics, they suffer from sparse spatial representations compared to traditional RGB data. Thus, multi-modal event-based action recognition approaches aim to synergize complementary strengths by independently extracting and integrating paired RGB-Event features. However, this paradigm inevitably introduces additional data acquisition costs, while eroding the inherent privacy advantages of event-based sensing. Drawing inspiration from event-to-image reconstruction, texture-enriched visual representation directly reconstructed from asynchronous event streams is a promising solution. In response, we propose an Enhanced Multimodal Perceptual (EMP) framework that hierarchically explores multimodal cues~(\eg, edges and textures) from raw event streams through two synergistic innovations spanning representation to feature levels. Specifically, we introduce Cross-Modal Frequency Enhancer (CFE) that leverages complementary frequency characteristics between reconstructed frames and stacked frames to refine event representations. Furthermore, to achieve unified feature encoding across modalities, we develop High-Frequency Guided Selector (HGS) for semantic consistency token selection guided by dynamic edge features while suppressing redundant multimodal information interference adaptively. Extensive experiments on four benchmark datasets demonstrate the superior effectiveness of our proposed framework.
Poster
Runmin Zhang · Zhu Yu · Si-Yuan Cao · Lingyu Zhu · Guangyi Zhang · Xiaokai Bai · Hui-liang Shen

[ Exhibit Hall I ]

Abstract
This work presents SGCDet, a novel multi-view indoor object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within an adaptive region, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with a high occupancy probability for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves state-of-the-art performance on the ScanNet and ARKitScenes datasets. Compared to the previous state-of-the-art approach, our SGCDet reduces training memory, training time, inference memory, and inference time by 42.9\%, 47.2\%, 50\%, and 40.8\%, respectively, while achieving notable improvements in mAP@0.50 of 3.9 on ScanNet and 3.3 on ARKitScenes.
Poster
Xiao Lin · Yun Peng · Liuyi Wang · xianyou zhong · Minghao Zhu · Jingwei Yang · Yi Feng · Chengju Liu · Qijun Chen

[ Exhibit Hall I ]

Abstract
In the effort to achieve robust and generalizable category-level object pose estimation, recent methods primarily focus on learning fundamental representations from data. However, the inherent biases within the data are often overlooked: the repeated training samples and similar environments may mislead the models to over-rely on specific patterns, hindering models' performance on novel instances. In this paper, we present CleanPose, a novel method that mitigates the data biases to enhance category-level pose estimation by integrating causal learning and knowledge distillation. By incorporating key causal variables (structural information and hidden confounders) into causal modeling, we propose the causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further confront the data bias at the feature level, we devise a residual-based knowledge distillation approach to transfer unbiased semantic knowledge from 3D foundation model, providing comprehensive causal supervision. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over state-of-the-art methods. Code will be released.
Poster
Younjoon Chung · Hyoungseob Park · Patrick Rim · Xiaoran Zhang · Jihe He · Ziyao Zeng · Safa Cicek · Byung-Woo Hong · James Duncan · Alex Wong

[ Exhibit Hall I ]

Abstract
We propose a method of adapting pretrained depth completion models to test time data in an unsupervised manner. Depth completion models are (pre)trained to produce dense depth maps from pairs of RGB image and sparse depth maps in ideal capture conditions (source domain), e.g., well-illuminated, high signal-to-noise. When models are transferred to capture conditions out of ideal case (target domain), they produce erroneous output dense depth maps due to the covariate shift. To identify cases of out-of-distribution errors, we propose an learn an energy model in the source domain that assigns scalars that represent the likelihood of the output dense depth maps. This energy model is further used to adapt the pretrained depth completion models at test time, leading to our method: Energy-based Test-time Adaptation (ETA). ETA can localize regions of errors as high energy; test-time adaptation involves updating the model weights to minimize the energy, which in turn mitigates the covariate shift. We evaluate ETA across 3 indoor and 3 outdoor datasets, where ETA improves over the previous state of the art by an average of 6.94% on outdoor and 10.23% on indoor settings.
Poster
Nikita Karaev · Iurii Makarov · Jianyuan Wang · Natalia Neverova · Andrea Vedaldi · Christian Rupprecht

[ Exhibit Hall I ]

Abstract
We introduce CoTracker3, a new state-of-the-art point tracker. With CoTracker3, we revisit the design of recent trackers, removing components and reducing the number of parameters while also improving performance. We also explore the interplay of synthetic and real data. Recent trackers are trained on synthetic videos due to the difficulty of collecting tracking annotations for real data. However, this can result in suboptimal performance due to the statistical gap between synthetic and real videos. We thus suggest using off-the-shelf trackers as teachers, annotating real videos with pseudo-labels. Compared to other recent attempts at using real data for learning trackers, this scheme is much simpler and achieves better results using 1,000 times less data. CoTracker3 is available in online (causal) and offline variants and is particularly robust to occlusions.
Poster
Yunpeng Bai · Qixing Huang

[ Exhibit Hall I ]

Abstract
Monocular Depth Estimation (MDE) is a fundamental 3D vision problem with numerous applications such as 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust and generalizable MDE remains challenging due to limited real-world labeled data and distribution gaps between synthetic datasets and real data. Existing methods often struggle on real-world test data with low efficiency, reduced accuracy, and lack of detail. To address these issues, we propose an efficient MDE approach named FiffDepth. The key feature of FiffDepth is its use of diffusion priors. It transforms diffusion-based image generators into a feed-forward architecture for detailed depth estimation. FiffDepth preserves key generative features and integrates the strong generalization capabilities of models like DINOv2. Through benchmark evaluations, we demonstrate that FiffDepth achieves exceptional accuracy, stability, and fine-grained detail, offering significant improvements in MDE performance against state-of-the-art MDE approaches.
Poster
Inwoo Hwang · Bing Zhou · Young Min Kim · Jian Wang · chuan guo

[ Exhibit Hall I ]

Abstract
Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening---a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes.Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos.
Poster
Zerui Chen · Rolandos Alexandros Potamias · Shizhe Chen · Cordelia Schmid

[ Exhibit Hall I ]

Abstract
Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
Poster
Marvin Burges · Philipe Dias · Dalton Lunga · Carson Woody · Sarah Walters

[ Exhibit Hall I ]

Abstract
Object detection in remote sensing demands extensive, high-quality annotations—a process that is both labor-intensive and time-consuming. In this work, we introduce a real-time active learning and semi-automated labeling framework that leverages foundation models to streamline dataset annotation for object detection in remote sensing imagery. For example, by integrating a Segment Anything Model (SAM), our approach generates mask-based bounding boxes that serve as the basis for dual sampling: (a) uncertainty estimation to pinpoint challenging samples, and (b) diversity assessment to ensure broad data coverage. Furthermore, our Dynamic Box Switching Module (DBS) addresses the well-known cold start problem for object detection models by replacing its suboptimal initial predictions with SAM-derived masks, thereby enhancing early-stage localization accuracy. Extensive evaluations on multiple remote sensing datasets plus a real-world user study, demonstrate that our framework not only reduces annotation effort, but also significantly boosts detection performance compared to traditional active learning sampling methods. The code for training and the user interface will be made available.
Poster
Fengyuan Yang · Kerui Gu · Ha Linh Nguyen · Tze Ho Elden Tse · Angela Yao

[ Exhibit Hall I ]

Abstract
Accurate camera motion estimation is essential for recovering global human motion in world coordinates from RGB video inputs. While SLAM is widely used for estimating camera trajectory and point cloud, monocular SLAM does so only up to an unknown scale factor. Previous works estimate the scale factor through optimization, but this is unreliable and time-consuming. This paper presents an optimization-free scale calibration framework, Human as Checkerboard (HAC). HAC explicitly leverages the human body predicted by human mesh recovery model as a calibration reference. Specifically, it innovatively uses the absolute depth of human-scene contact joints as references to calibrate the corresponding relative scene depth from SLAM. HAC benefits from geometric priors encoded in human mesh recovery models to estimate the SLAM scale and achieves precise global human motion estimation. Simple yet powerful, our method sets a new state-of-the-art performance for global human mesh estimation tasks. It reduces motion errors by 50\% over prior local-to-global methods while using 100$\times$ less post-SLAM inference time than optimization-based methods. Our code will be made public.
Poster
Xin Qiao · Matteo Poggi · Xing Wei · Pengchao Deng · Yanhui Zhou · Stefano Mattoccia

[ Exhibit Hall I ]

Abstract
Under-display ToF imaging aims to both achieve precise depth sensing and maximize user experience by embedding a ToF camera beneath a screen panel. However, multiple complex degradations may occur during the imaging process, resulting in significant degradation of depth quality. To alleviate this drawback, we introduce a hybrid framework, named Learnable Fractional Reaction-Diffusion Dynamics (LFRD$^2$), which integrates the robust feature representation capabilities of neural networks with the interpretability of physical models. Specifically, we design a neural module implementing the time-fractional reaction-diffusion equation, which allows for iterative refinement to enhance depth quality, whose differential orders are generated dynamically. This module can correlate the current state of the predicted depth with any preceding states, keeping track of the long-term memory of the system itself. Furthermore, we propose a novel approach to construct an efficient continuous convolution operator based on coefficient prediction and repeated differentiation, further enhancing the final quality. Experimental results illustrate the effectiveness of our framework on four benchmark datasets. The code will be made available upon acceptance.
Poster
Lena Wild · Rafael Valencia · Patric Jensfelt

[ Exhibit Hall I ]

Abstract
Reliable integration of prior information is crucial for self-verifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prior-aided HD mapping, ArgoTweak advances scalable, self-improving mapping solutions. Code, dataset, and our map modification toolbox will be made available at [URL].
Poster
Jijun Xiang · Xuan Zhu · Xianqi Wang · Yu Wang · Hong Zhang · Fei Guo · Xin Yang

[ Exhibit Hall I ]

Abstract
Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noise-robust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves state-of-the-art results, improving Rel and RMSE by 27\% and 18\%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23\% and 22\%, respectively. Our code, trained …
Poster
Jingxi Liao · Shijie Hao · Richang Hong · Meng Wang

[ Exhibit Hall I ]

Abstract
Low-light image enhancement (LLIE) aims to improve the visual quality of images captured under poor lighting conditions. In supervised LLIE tasks, there exists a significant yet often overlooked inconsistency between the overall brightness of an enhanced image and its ground truth counterpart, referred to as $\textit{brightness mismatch}$ in this study. Brightness mismatch negatively impact supervised LLIE models by misleading model training. However, this issue is largely neglected in current research. In this context, we propose the $ \textit{GT-mean loss}$, a simple yet effective loss function directly modeling the mean values of images from a probabilistic perspective.The GT-mean loss is flexible, as it extends existing supervised LLIE loss functions into the GT-mean form with minimal additional computational costs. Extensive experiments demonstrate that the incorporation of the GT-mean loss results in consistent performance improvements across various methods and datasets.
Poster
Li Mi · Manon Béchaz · Zeming Chen · Antoine Bosselut · Devis Tuia

[ Exhibit Hall I ]

Abstract
Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by learning to estimate the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.
Poster
Anurag Ghosh · Shen Zheng · Robert Tamburo · Khiem Vuong · Juan Alvarez-Padilla · Hailiang Zhu · Nicholas Dunn · Michael Cardei · Christoph Mertz · Srinivasa Narasimhan

[ Exhibit Hall I ]

Abstract
Perceiving and navigating autonomously through work zones is a challenging and underexplored problem. Open datasets for developing algorithms for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models perform poorly when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8×) around the world. Open-vocabulary methods fail on work zones, whereas detectors fine-tuned on our data improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques: Video label propagation provides additional gains (+2.6 AP). While reading work zone signs, composing a work zone detector and text spotter through crop-scaling improves performance (+14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We compute drivable paths from work zone navigation videos and predict navigational goals and pathways. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 degrees (+9.9%) and 75.3% pathways have AE < 0.5 …
Poster
Rangel Daroya · Elijah Cole · Oisin Mac Aodha · Grant Horn · Subhransu Maji

[ Exhibit Hall I ]

Abstract
Species distributions encode valuable ecological and environmental information, yet their potential for guiding representation learning in remote sensing remains underexplored. We introduce WildSAT, which pairs satellite images with millions of geo-tagged wildlife observations readily-available on citizen science platforms. WildSAT employs a contrastive learning approach that jointly leverages satellite images, species occurrence maps, and textual habitat descriptions to train or fine-tune models. This approach significantly improves performance on diverse satellite image recognition tasks, outperforming both ImageNet-pretrained models and satellite-specific baselines. Additionally, by aligning visual and textual information, WildSAT enables zero-shot retrieval, allowing users to search geographic locations based on textual descriptions. WildSAT surpasses recent cross-modal learning methods, including approaches that align satellite images with ground imagery or wildlife photos, demonstrating the advantages of our approach. Finally, we analyze the impact of key design choices and highlight the broad applicability of WildSAT to remote sensing and biodiversity monitoring.
Poster
Lily Goli · Sara Sabour · Mark Matthews · Marcus Brubaker · Dmitry Lagun · Alec Jacobson · David Fleet · Saurabh Saxena · Andrea Tagliasacchi

[ Exhibit Hall I ]

Abstract
There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. Estimating accurate camera poses from videos through structure-from-motion (SfM) relies on robustly separating static and dynamic parts of a video. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
Poster
Hao Zhou · Zhanning Gao · Zhili Chen · Maosheng Ye · Qifeng Chen · Tongyi Cao · Honggang Qi

[ Exhibit Hall I ]

Abstract
In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.
Poster
Hoonhee Cho · Yuhwan Jeong · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
With advancements in sensor and display technologies, high-resolution imagery is becoming increasingly prevalent in diverse applications. As a result, optical flow estimation needs to adapt to larger image resolutions, where even moderate movements lead to substantial pixel displacements, making long-range motion estimation more critical than ever. However, existing datasets primarily focus on short-range flow in low-resolution settings, limiting the generalization of models to high-resolution scenarios with large displacements. Additionally, there is a lack of suitable datasets for evaluating model capacity in long-range motion estimation, further hindering progress in this area. To address this, we introduce RelayFlow-4K, high-resolution 4K optical flow dataset designed to capture diverse motion patterns, including long-range intermediate frame flows. While such datasets provide valuable training resources, long-range estimation remains challenging due to increased matching ambiguity. Simply incorporating these datasets does not inherently improve performance. To this end, we propose a novel training framework that integrates matching cost distillation and incremental time-step learning to refine cost volume estimation and stabilize training. Additionally, we leverage the distance map, which measures the distance from unmatched regions to their nearest matched pixels, improving occlusion handling. Our approach significantly enhances long-range optical flow estimation in high-resolution settings. Our datasets and code will …
Poster
Ze Li · Feng Zhang · Xiatian Zhu · Zhang Meng · Yanghong Zhou · P.Y. Mok

[ Exhibit Hall I ]

Abstract
Synthesizing normal-light novel views from low-light multiview images remains a challenging yet practical task due to the low visibility and high ISO noise challenges. Existing low-light enhancement methods often struggle to preprocess these images effectively due to their inability to structurally correlate multiple views. While state-of-the-art approaches have advanced by manipulating illumination-related components during rendering, they often introduce color distortions and artifacts. Moreover, they rely solely on NeRF’s multi-view optimization, which offers limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework termed (RoSe), which enables novel-view synthesis under normal lighting from low-light multiview images. Inspired by the 2D Retinex theory, we frame this task as an illuminance transition estimation problem in 3D space, further conceptualizing it as a specialized rendering task. This multiview-consistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To this end, we design a concise dual-branch architecture and propose a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard …
Poster
Dongyoung Kim · Mahmoud Afifi · Dongyun Kim · Michael Brown · Seon Joo Kim

[ Exhibit Hall I ]

Abstract
Computational color constancy, or white balancing, is a key module in a camera’s image signal processor (ISP) that corrects color casts from scene lighting. Because this operation occurs in the camera-specific raw color space, white balance algorithms must adapt to different cameras. This paper introduces a learning-based method for cross-camera color constancy that generalizes to new cameras without retraining. Our method leverages pre-calibrated color correction matrices (CCMs) available on ISPs that map the camera’s raw color space to a standard space (e.g., CIE XYZ). Our method uses these CCMs to transform predefined illumination colors (i.e., along the Planckian locus) into the test camera's raw space. The mapped illuminants are encoded into a compact _camera fingerprint embedding_ (CFE) that enables the network to adapt to unseen cameras. To prevent overfitting due to limited cameras and CCMs during training, we introduce a data augmentation technique that interpolates between cameras and their CCMs. Experimental results across multiple datasets and backbones show that our method achieves state-of-the-art cross-camera color constancy while remaining lightweight and relying only on data readily available in camera ISPs.
Poster
Jie Feng · Shengyuan Wang · Tianhui Liu · Yanxin Xi · Yong Li

[ Exhibit Hall I ]

Abstract
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data, such as structured geospatial data, trajectory data, satellite image data, and street view image data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In UrbanLLaVA, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we design an effective multi-stage training pipeline to ensure the training stability and compatibility across various urban tasks. We also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that UrbanLLaVA outperforms open source and commercial MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. UrbanLLaVA sheds lights …
Poster
Hao Zheng · Yuting Zheng · Hanbo Huang · Chaofan Sun · Enhui Liao · Lin Liu · Yi Han · Hao Zhou · Shiyu Liang

[ Exhibit Hall I ]

Abstract
Reconstructing atmospheric surface $\text{CO}_2$ is crucial for understanding climate dynamics and informing global mitigation strategies. Traditional inversion models achieve precise global $\text{CO}_2$ reconstruction but rely heavily on uncertain prior estimates of fluxes and emissions. Inspired by recent advances in data-driven weather forecasting, we explore whether data-driven models can reduce reliance on these priors. However, $\text{CO}_2$ reconstruction presents unique challenges, including complex spatio-temporal dynamics, periodic patterns and sparse observations. We propose $\text{CO}_2$-Net, a data-driven model that addresses these challenges without requiring extensive prior data. We formulate $\text{CO}_2$ reconstruction as solving a constrained advection-diffusion equation and derive three key components: physics-informed spatio-temporal factorization for capturing complex transport dynamics, wind-based embeddings for modeling periodic variations and a semi-supervised loss for integrating sparse $\text{CO}_2$ observations with dense meteorological data. $\text{CO}_2$-Net is designed in three sizes---small (S), base (B) and large (L)---to balance performance and efficiency. On CMIP6 reanalysis data, $\text{CO}_2$-Net (S) and (L) reduce RMSE by {11\%} and {71\%}, respectively, when compared to the best data-driven baseline. On real observations, $\text{CO}_2$-Net (L) achieves RMSE comparable to inversion models. The ablation study shows that the effectiveness of wind-based embedding and semi-supervised loss stems from their compatibility with our spatio-temporal factorization.
Poster
Pattaramanee Arsomngern · Sasikarn Khwanmuang · Matthias Nießner · Supasorn Suwajanakorn

[ Exhibit Hall I ]

Abstract
One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing retrieve-and-alignmethods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose an unsupervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss.Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space.We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA unsupervised baselines by +4.3% mean alignment accuracy and is the only unsupervised approach to surpass the supervised ROCA by +2.7%.To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.
Poster
Yahao Liu · Qin Wang · Lixin Duan · Wen Li

[ Exhibit Hall I ]

Abstract
Regression is fundamental in computer vision and is widely used in various tasks including age estimation, depth estimation, target localization, \etc However, real-world data often exhibits imbalanced distribution, making regression models perform poorly especially for target values with rare observations (known as the imbalanced regression problem). In this paper, we reframe imbalanced regression as an imbalanced generalization problem. To tackle that, we look into the loss sharpness property for measuring the generalization ability of regression models in the observation space. Namely, given a certain perturbation on the model parameters, we check how model performance changes according to the loss values of different target observations. We propose a simple yet effective approach called Balanced Sharpness-Aware Minimization (BSAM) to enforce the uniform generalization ability of regression models for the entire observation space. In particular, we start from the traditional sharpness-aware minimization and then introduce a novel targeted reweighting strategy to homogenize the generalization ability across the observation space, which guarantees a theoretical generalization bound. Extensive experiments on multiple vision regression tasks, including age and depth estimation, demonstrate that our BSAM method consistently outperforms existing approaches. The code will be available soon.
Poster
CHEN LIANG · Zhicheng Shi · Wenguan Wang · Yi Yang

[ Exhibit Hall I ]

Abstract
Language-based human motion understanding focuses on describing human motions using natural language descriptions. Conversely, human motion generation aims to generate human motions from textual inputs. Despite significant progress in both fields, further advancements are hindered by two primary challenges: (i) Both tasks heavily rely on vast amounts of paired motion-language data for model training. However, human labeling is costly, making it increasingly unsustainable as model scales increase. (ii) Existing models often learn the two tasks in parallel. The strong reciprocity between them has not been fully explored. In response, this work proposes Dual Reciprocal Learning (DRL) for language-based human motion understanding and generation. DRL establishes a symmetric learning framework where both tasks collaboratively evolve in a closed-loop, bootstrapping manner, effectively leveraging the reciprocity between them. In DRL, the tasks serve as evaluators for each other, enabling the generation of informative feedback signals even with easily acquired unidirectional motion or language data. Furthermore, to mitigate dataset-specific bias in existing evaluations, we propose a generalized protocol that extends evaluation to a general-domain cross-modal feature space. Experimental results on standard benchmarks demonstrate that DRL achieves remarkable performance boosts over state-of-the-art models in both tasks. Our code will be made publicly available.
Poster
Junseong Shin · Seungwoo Chung · Yunjeong Yang · Tae Hyun Kim

[ Exhibit Hall I ]

Abstract
Dehazing involves removing haze or fog from images to restore clarity and improve visibility by estimating atmospheric scattering effects. While deep learning methods show promise, the lack of paired real-world training data and the resulting domain gap hinder generalization to real-world scenarios.In this context, physics-grounded learning becomes crucial; however, traditional methods based on the Atmospheric Scattering Model (ASM) often fall short in handling real-world complexities and diverse haze patterns.To solve this problem, we propose HazeFlow, a novel ODE-based framework that reformulates ASM as an ordinary differential equation (ODE). Inspired by Rectified Flow (RF), HazeFlow learns an optimal ODE trajectory to map hazy images to clean ones, enhancing real-world dehazing performance with only a single inference step. Additionally, we introduce a non-homogeneous haze generation method using Markov Chain Brownian Motion (MCBM) to address the scarcity of paired real-world data. By simulating realistic haze patterns through MCBM, we enhance the adaptability of HazeFlow to diverse real-world scenarios. Through extensive experiments, we demonstrate that HazeFlow achieves state-of-the-art performance across various real-world dehazing benchmark datasets.
Poster
Yuan Wang · Yuxin Chen · Zhongang Qi · Lijun Liu · Jile Jiao · Xuetao Feng · Yujia Liang · Ying Shan · Zhipeng Zhang

[ Exhibit Hall I ]

Abstract
3D vision-language (3D-VL) reasoning, connecting natural language with 3D physical world, represents a milestone in advancing spatial intelligence. While transformer-based methods dominate 3D-VL research, their quadratic complexity and simplistic positional embedding mechanisms severely limits effective modeling of long-range 3D-VL dependencies and spatial relationships in 3D-VL tasks. State Space Models (SSM) have emerged as promising linear-complexity alternatives for sequential data processing, while inherent selection mechanism offers notable capability for spatial modeling. Despite its potential, straightforward adoption of Mamba to 3D-VL tasks encounters two obstacles: (1) how to perceive the position of 3D objects and understand complex spatial relationships, and (2) how to achieve thorough synergies of multi-modal features. In this paper, we propose Mamba-3VL, a pioneering 3D-VL framework to model complex intra- and inter-modality correlations and enhance spatial relation reasoning, while guaranteeing top-tier performance, high efficiency, and generalization potential for 3D-VL tasks. Specifically, Mamba Mixer explicitly models 3D-VL interaction via channel twisting and relation-prioritized spatial scanning policy. It maximally retain spatial relation of object-centric features. To further provide precise spatial encoding for mamba, we develop Instance-aware Dynamic Position Adapter (IDPA) to dynamically adjust instance-specific positional embeddings and enhance local spatial relation of 3D objects. Extensive results validate Mamba-3VL trumps other competitors …
Poster
Hai Wu · Hongwei Lin · Xusheng Guo · Xin Li · Mingming Wang · Cheng Wang · Chenglu Wen

[ Exhibit Hall I ]

Abstract
The performance of unsupervised 3D object classification and bounding box regression relies heavily on the quality of initial pseudo-labels. Traditionally, the labels of classification and regression are represented by \textbf{a single set} of candidate boxes generated by motion or geometry heuristics. However, due to the similarity of many objects to the background in shape or lack of motion, the labels often fail to achieve high accuracy in two tasks simultaneously. Using these labels to directly train the network results in decreased detection performance. To address this challenge, we introduce Motal that performs unsupervised 3D object detection by Modality and task-specific knowledge transfer. Motal decouples the pseudo-labels into two sets of candidates, from which Motal discovers classification knowledge by motion and image appearance prior, and discovers box regression knowledge by geometry prior, respectively. Motal finally transfers all knowledge to a single student network by a TMT (Task-specific Masked Training) scheme, attaining high performance in both classification and regression. Motal can greatly enhance various unsupervised methods by about 2x mAP. For example, on the WOD test set, Motal improves the state-of-the-art CPD by 21.56% mAP L1 (from 20.54% to 42.10%) and 19.90% mAP L2 (from 18.18% to 38.08%). These achievements highlight the …
Poster
Rui Wang · Quentin Lohmeyer · Mirko Meboldt · Siyu Tang

[ Exhibit Hall I ]

Abstract
Reconstructing clean, distractor-free 3D scenes from real-world captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstruction in highly dynamic, interaction-rich environments.
Poster
Jiahao Zhang · Anoop Cherian · Cristian Rodriguez-Opazo · Weijian Deng · Stephen Gould

[ Exhibit Hall I ]

Abstract
Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly …
Poster
Pingrui Zhang · Xianqiang Gao · Yuhan Wu · Kehui Liu · Dong Wang · Zhigang Wang · Bin Zhao · Yan Ding · Xuelong Li

[ Exhibit Hall I ]

Abstract
In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce \ours, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded floor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, \ourmodel, for navigation affordance grounding that demonstrates promising performance on the \ours benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in …
Poster
Peiran Xu · Xicheng Gong · Yadong Mu

[ Exhibit Hall I ]

Abstract
In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.
Poster
Yue Fan · Xiaojian Ma · Rongpeng Su · Jun Guo · Rujie Wu · Xi Chen · Qing Li

[ Exhibit Hall I ]

Abstract
This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLM-based agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 6.5% on Ego4D-VQ3D, 2.6% on OpenEQA, and 15.3% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation. The code and demo will be made public.
Poster
Shr-Ruei Tsai · Wei-Cheng Chang · Jie-Ying Lee · Chih-Hai Su · Yu-Lun Liu

[ Exhibit Hall I ]

Abstract
Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution.
Poster
Tianyi Zhao · Boyang Liu · Yanglei Gao · Yiming Sun · Maoxun Yuan · Xingxing Wei

[ Exhibit Hall I ]

Abstract
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon--Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.
Poster
Phillip Mueller · Talip Ünlü · Sebastian Schmidt · Marcel Kollovieh · Jiajie Fan · Stephan Günnemann · Lars Mikelsons

[ Exhibit Hall I ]

Abstract
Precise geometric control in image generation is essential for fields like engineering \& product design and creative industries to control 3D object features accurately in 2D image space. Traditional 3D editing approaches are time-consuming and demand specialized skills, while current image-based generative methods lack accuracy in geometric conditioning. To address these challenges, we propose GeoDiffusion, a training-free framework for accurate and efficient geometric conditioning of 3D features in image generation. GeoDiffusion employs a class-specific 3D object as a geometric prior to define keypoints and parametric correlations in 3D space. We ensure viewpoint consistency through a rendered image of a reference 3D object, followed by style transfer to meet user-defined appearance specifications. At the core of our framework is GeoDrag, improving accuracy and speed of drag-based image editing on geometry guidance tasks and general instructions on DragBench. Our results demonstrate that GeoDiffusion enables precise geometric modifications across various iterative design workflows.
Poster
Heng Su · Mengying Xie · Nieqing Cao · Yan Ding · Beichen Shao · Xianlei Long · Fuqiang Gu · Chao Chen

[ Exhibit Hall I ]

Abstract
In recent years, affordance detection has become essential for robotic manipulation in real-world scenes, where robots must autonomously interpret commands and perform actions. Current methods often focus on individual point cloud objects or simple semantic queries, limiting their effectiveness in diverse scenes and complex instructions. To address this, we introduce OVA-Fields, a framework for affordance detection in 3D scenes with complex semantics. By integrating multilevel geometric encoding and enhanced semantic affordance embeddings, OVA-Fields maps user commands directly to operational parts, embedding enriched affordance information into the 3D scene. Experimental results demonstrate that OVA-Fields achieves 52.4\% mIoU on complex semantic real-world scenes and 90\% success rate in real-world robot manipulation tasks (e.g., "take out some food from the refirgerator") using RGB-D sensing. Our approach enables the precise identification of operational parts, transforming natural language queries into targeted manipulations in real-world environments.
Poster
Jianhua Sun · Yuxuan Li · Jiude Wei · Xu Longfei · Wang Nange · Yining Zhang · Cewu Lu

[ Exhibit Hall I ]

Abstract
The acquisition of substantial volumes of 3D articulated object data is expensive and time-consuming, and consequently the scarcity of 3D articulated object data becomes an obstacle for deep learning methods to achieve remarkable performance in various articulated object understanding tasks. Meanwhile, pairing these object data with detailed annotations to enable training for various tasks is also difficult and labor-intensive to achieve. In order to expeditiously gather a significant number of 3D articulated objects with comprehensive and detailed annotations for training, we propose Articulated Object Procedural Generation toolbox, a.k.a. Arti-PG toolbox. Arti-PG toolbox consists of i) descriptions of articulated objects by means of a generalized structure program along with their analytic correspondence to the objects’ point cloud, ii) procedural rules about manipulations on the structure program to synthesize large-scale and diverse new articulated objects, and iii) mathematical descriptions of knowledge (e.g. affordance, semantics, etc.) to provide annotations to the synthesized object. Arti-PG has two appealing properties for providing training data for articulated object understanding tasks: i) objects are created with unlimited variations in shape through program-oriented structure manipulation, ii) Arti-PG is widely applicable to diverse tasks by easily providing comprehensive and detailed annotations. Arti-PG now supports the procedural generation of 26 …
Poster
Xiaoding Yuan · Prakhar Kaushik · Guofeng Zhang · Artur Jesslen · Adam Kortylewski · Alan Yuille

[ Exhibit Hall I ]

Abstract
Deep learning algorithms for object classification and 3D object pose estimation lack robustness to out-of-distribution factors such as synthetic stimuli, changes in weather conditions, and partial occlusion. Recently, a class of Neural Mesh Models have been developed where objects are represented in terms of 3D meshes with learned features at the vertices. These models have shown robustness in small-scale settings, involving 10 objects, but it is unclear that they can be scaled up to 100s of object classes. The main problem is that their training involves contrastive learning among the vertices of all object classes, which scales quadratically with the number of classes. We present a strategy which exploits the compositionality of the objects, i.e. the independence of the feature vectors of the vertices, which greatly reduces the training time while also improving the performance of the algorithms. We first restructure the per-vertex contrastive learning into contrasting within class and between classes. Then we propose a process that dynamically decouples the contrast between classes which are rarely confused, and enhances the contrast between the vertices of classes that are most confused. Our large-scale 3D compositional model not only achieves state-of-the-art performance on the task of predicting classification and pose estimation …
Poster
Yufeng Zhong · Chengjian Feng · Feng yan · Fanfan Liu · Liming Zheng · Lin Ma

[ Exhibit Hall I ]

Abstract
In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents must possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we introduce \textbf{P3Nav}, a unified framework that integrates \textbf{P}erception, \textbf{P}lanning, and \textbf{P}rediction capabilities through \textbf{Multitask Collaboration} on navigation and embodied question answering (EQA) tasks, thereby enhancing navigation performance. Furthermore, P3Nav employs an \textbf{Adaptive 3D-aware History Sampling} strategy to effectively and efficiently utilize historical observations. By leveraging the large language models (LLM), P3Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. P3Nav achieves a 75\% success rate in object goal navigation on the $\mathrm{CHORES}$-$\mathbb{S}$ benchmark, setting a new state-of-the-art performance.
Poster
Chen-Liang Fan · Mingpei Cao · Chih-Chien Hung · Yuesheng Zhu

[ Exhibit Hall I ]

Abstract
Autofocus (AF) is essential for imaging systems, particularly in industrial applications such as automated optical inspection (AOI), where achieving precise focus is critical. Conventional AF methods rely on peak-searching algorithms that require dense focal sampling, making them inefficient in small depth-of-field (DoF) scenarios. Deep learning (DL)-based AF methods, while effective in general imaging, have a limited working range in small DoF conditions due to defocus uncertainty.In this work, we propose a novel AF framework that integrates an optical model-based sharpness indicator with a deep learning approach to predict sharpness from defocused images. We leverage sharpness estimation as a reliable focus measure and apply an adaptive adjustment algorithm to adjust the focus position based on the sharpness-to-distance mapping. This method effectively addresses defocus uncertainty and enables robust autofocus across a 35× DoF range.Experimental results on an AOI system demonstrate that our approach achieves reliable autofocus even from highly defocused starting points and remains robust across different textures and illumination conditions. Compared to conventional and existing DL-based approaches, our method offers improved precision, efficiency, and adaptability, making it suitable for industrial applications and small DoF scenarios.
Poster
Samuel Clarke · Suzannah Wistreich · Yanjie Ze · Jiajun Wu

[ Exhibit Hall I ]

Abstract
Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset from 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and breadth. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.
Poster
Chen Lin · Weizhi Du · Zhixiang Min · Baochen She · Enrique Dunn · Sonya Hanson

[ Exhibit Hall I ]

Abstract
We explore a quaternion adjugate matrix-based representation for rotational motion in the Perspective-n-Point (PnP) problem. Leveraging quadratic quaternion terms within a Determinant Ratio Matrix (DRaM) estimation framework, we extend its application to perspective scenarios, providing a robust and efficient initialization for iterative PnP pose estimation. Notably, by solving the orthographic projection least-squares problem, DRaM provides a reliable initialization that enhances the accuracy and stability of iterative PnP solvers. Experiments on synthetic and real data demonstrate its efficiency, accuracy, and robustness, particularly under high noise conditions. Furthermore, our non-minimal formulation ensures numerical stability, making it effective for real-world applications.
Poster
Minchao Jiang · Shunyu Jia · Jiaming Gu · Xiaoyuan Lu · Guangming Zhu · Anqi Dong · zhang liang

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has become horsepower in high-quality, real-time rendering for novel view synthesis of 3D scenes. However, existing methods focus primarily on geometric and appearance modeling, lacking deeper scene understanding while also incurring high training costs that complicate the originally streamlined differentiable rendering pipeline. To this end, we propose VoteSplat, a novel 3D scene understanding framework that integrates Hough voting with 3DGS. Specifically, Segment Anything Model (SAM) is utilized for instance segmentation, extracting objects, and generating 2D vote maps. We then embed spatial offset vectors into Gaussian primitives. These offsets construct 3D spatial votes by associating them with 2D image votes, while depth distortion constraints refine localization along the depth axis. For open-vocabulary object localization, VoteSplat maps 2D image semantics to 3D point clouds via voting points, reducing training costs associated with high-dimensional CLIP features while preserving semantic unambiguity. Extensive experiments demonstrate VoteSplat’s effectiveness in open-vocabulary 3D instance localization, 3D point cloud understanding, click-based 3D object localization, hierarchical segmentation, and ablation studies.
Poster
Siyuan Yao · Rui Zhu · Ziqi Wang · Wenqi Ren · Yanyang Yan · Xiaochun Cao

[ Exhibit Hall I ]

Abstract
Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects' representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin.
Poster
Pengfei Ren · Jingyu Wang · Haifeng Sun · Qi Qi · Xingyu Liu · Menghao Zhang · Lei Zhang · Jing Wang · Jianxin Liao

[ Exhibit Hall I ]

Abstract
3D hand pose estimation plays a critical role in various human-computer interaction tasks. Single-frame 3D hand pose estimation methods have poor temporal smoothness and are easily affected by self-occlusion, which severely impacts their practical applicability. Traditional joint-based sequential pose estimation methods primarily focus on the human body and struggle to handle the complex hand structure, high degrees of freedom in hand motion, and rapidly changing hand motion trends. To address these challenges, we propose a prior-aware dynamic temporal modeling framework for sequential 3D hand pose estimation. We introduce a flexible memory mechanism to model hand prior information, which alleviates the scale and depth ambiguity in single-frame hand pose estimation. Additionally, we propose a dynamic temporal convolution module that adjusts the receptive field size and feature aggregation weights based on the motion information at each moment, effectively capturing rapid motion trends. By decoupling dynamic temporal modeling at the joint and hand levels, our method captures both subtle short-term variations and long-term motion trends, significantly improving the smoothness and accuracy of hand pose estimation. Experiments on four public datasets demonstrate that our method achieves the state-of-the-art results in terms of hand pose estimation accuracy and temporal smoothness.
Poster
Chen Gao · Shuo Zhang · Youfang Lin

[ Exhibit Hall I ]

Abstract
Disparity estimation is an essential step in processing and analyzing Light Field (LF) images. Recent methods construct the cost volume to exploit the correspondence of the LFs over the preset maximum disparity, limiting them to process the large parallax scenes. Different from constructing cost volume, the self-attention mechanism calculates the parallax attention between epipolar lines to find the matching points. However, for LFs that have different views, the related disparity scales are different in parallax attention since the baselines with the central view are different. Moreover, if the matching information is occluded in one view, the disparity information can be explored through other views. Therefore, mapping these attentions to the same scale and selecting effective matching information are key points for disparity estimation from parallax attention. In this paper, we explore parallax attention for LF and design an unsupervised method, named Epipolar Consistent Attention Aggregation Network (ECAAN). We first introduce an epipolar consistent scale unification block by considering the consistency relationships to standardize disparity scales of the parallax attention maps. Based on the intra-properties and inter-relationships of parallax attention, we further propose a consistent occlusion-free aggregation block to integrate the information from the occlusion-free areas. Besides, we design an improved …
Poster
Tianma Shen · Aditya Shrish Puranik · James Vong · Vrushabh Deogirikar · Ryan Fell · Julianna Dietrich · Maria Kyrarini · Christopher Kitts · David Jeong

[ Exhibit Hall I ]

Abstract
Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's first-person perspective. Although research has used pose estimation techniques to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We introduce $\textbf{Fish2Mesh}$, a fisheye-aware transformer-based model designed for 3D egocentric human mesh recovery. We propose an egocentric position embedding block to generate an ego-specific position table for the Swin Transformer to reduce fisheye image distortion. Our model utilizes multi-task heads for SMPL parametric regression and camera translations, estimating 3D and 2D joints as auxiliary loss to support model training. To address the scarcity of egocentric camera data, we create a training dataset by employing the pre-trained 4D-Human model and third-person cameras for weak supervision. Our experiments demonstrate that Fish2Mesh outperforms previous state-of-the-art 3D HMR models. Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's first-person perspective. Although research has used pose estimation techniques to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We introduce \textbf{Fish2Mesh}, a …
Poster
Jinhyung Park · Javier Romero · Shunsuke Saito · Fabian Prada · Takaaki Shiratori · Yichen Xu · Federica Bogo · Shoou-I Yu · Kris Kitani · Rawal Khirodkar

[ Exhibit Hall I ]

Abstract
Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models. The code and model will be made publicly available.
Poster
Bowen Chen · Yun Sing Koh · Gillian Dobbie

[ Exhibit Hall I ]

Abstract
Traditional image segmentation methods struggle with fine-grained pattern extraction, especially in an unsupervised setting without labeled data. Shallow and deep learning approaches either lack structural coherence or focus on object-level segmentation rather than internal textures. Additionally, existing methods often fail to generalize across diverse animal species due to variations in pattern complexity and lighting variations.We introduce GloPER, an unsupervised segmentation framework that extracts fine-grained animal patterns without labeled supervision. By enforcing local image reconstruction with only two colors per region, GloPER captures structured patterns while mitigating the effects of shadows and lighting inconsistencies.Given the lack of fine-detailed labeled data, we construct a dataset of 10 animal species, each with at least 100 well labeled images, enabling direct segmentation assessment. Experimental results show that GloPER outperforms both shallow and deep segmentation baselines, with a 42.44\% higher DICE score on average across all 10 animal species. We also assess its effectiveness through animal re-identification (ReID), where GloPER’s extracted binary patterns achieve superior accuracy, in some cases exceeding full-image ReID performance, underscoring the discriminative power of structured segmentation.
Poster
Yuqian Fu · Runze Wang · Bin Ren · Guolei Sun · Biao Gong · Yanwei Fu · Danda Pani Paudel · Xuanjing Huang · Luc Gool

[ Exhibit Hall I ]

Abstract
Bridging the gap between ego-centric and exo-centric views has been a long-standing question in computer vision. In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. While numerous segmentation models have been proposed, most operate on a single image (view), making them impractical for cross-view scenarios. PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task. However, due to the drastic viewpoint change between ego and exo, PSALM fails to accurately locate and segment objects, especially in complex backgrounds or when object appearances change significantly. To address these issues, we propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse introduces language as an additional cue, integrating both visual masks and textual descriptions to improve object localization and prevent incorrect associations. XObjAlign enforces cross-view consistency through self-supervised alignment, enhancing robustness to object appearance variations. Extensive experiments demonstrate ObjectRelator’s effectiveness on the large-scale Ego-Exo4D benchmark and HANDAL-X (an adapted dataset for cross-view segmentation) with state-of-the-art performance. Codes and models will be released.
Poster
Mohammadreza Salehi · Shashanka Venkataramanan · Ioana Simion · Stratis Gavves · Cees Snoek · Yuki Asano

[ Exhibit Hall I ]

Abstract
Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1\% to 6\% on six image and video datasets and four evaluation benchmarks.
Poster
Spyros Kondylatos · Nikolaos Ioannis Bountos · Dimitrios Michail · Xiao Xiang Zhu · Gustau Camps-Valls · Ioannis Papoutsis

[ Exhibit Hall I ]

Abstract
Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain's unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. In this study, we explore representation uncertainty in EO, highlighting its strengths and limitations, laying the groundwork for future research in the field. Code and model checkpoints will be publicly released.
Poster
Yusuke Yoshiyasu · Leyuan Sun · Ryusuke Sagawa

[ Exhibit Hall I ]

Abstract
In this paper, we introduce MeshMamba, a neural network model for learning 3D articulated mesh models by employing the recently proposed Mamba State Space Models (SSMs). MeshMamba is efficient and scalable in handling a large number of input tokens, enabling the generation and reconstruction of body mesh models with approximately 10,000 vertices. The key to effectively learning MeshMamba is the serialization technique of mesh vertices into the orderings that are easily processed by Mamba. This is achieved by sorting the vertices based on the body part annotations or the 3D vertex locations of a template mesh, such that the ordering respects the structure of articulated shapes. Based on MeshMamba we design 1) MambaDiff3D, a denoising diffusion model for generating 3D articulated meshes, and 2) Mamba-HMR, a 3D human mesh recovery model which reconstructs a human body shape pose from a single image. Experimental results showed that MambaDiff3D can generate dense 3D human meshes in clothes, with grasping hands etc. and outperforms previous approaches in the 3D human shape generation task. Also, Mamba-HMR extends the ability of previous non-parametric human mesh recovery approaches, which were limited in handling body-only poses using around 500 vertex tokens, to the whole-body setting with face …
Poster
Mingxuan Wu · Huang Huang · Justin Kerr · Chung Min Kim · Anthony Zhang · Brent Yi · Angjoo Kanazawa

[ Exhibit Hall I ]

Abstract
Whether snipping with scissors or opening a box, humans can quickly understand the 3D configurations of familiar objects. For novel objects, we can resort to long-form inspection to build intuition. The more we observe the object, the better we get at predicting its 3D state immediately. Existing systems, however, are limited to either optimizing underlying representations from multi-view observations or training a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging …
Poster
Yue Li · Qi Ma · Runyi Yang · Huapeng Li · Mengjiao Ma · Bin Ren · Nikola Popovic · Nicu Sebe · Ender Konukoglu · Theo Gevers · Luc Gool · Martin R. Oswald · Danda Pani Paudel

[ Exhibit Hall I ]

Abstract
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge.To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes.Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines. …
Poster
Chen Zhao · Xuan Wang · Tong Zhang · Saqib Javed · Mathieu Salzmann

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness in novel view synthesis (NVS). However, 3DGS tends to overfit when trained with sparse views, limiting its generalization to novel viewpoints. In this paper, we address this overfitting issue by introducing Self-Ensembling Gaussian Splatting (SE-GS). We achieve self-ensembling by incorporating an uncertainty-aware perturbation strategy during training. A $\mathbf{\Delta}$-model and a $\mathbf{\Sigma}$-model are jointly trained on the available images. The $\mathbf{\Delta}$-model is dynamically perturbed based on rendering uncertainty across training steps, generating diverse perturbed models with negligible computational overhead. Discrepancies between the $\mathbf{\Sigma}$-model and these perturbed models are minimized throughout training, forming a robust ensemble of 3DGS models. This ensemble, represented by the $\mathbf{\Sigma}$-model, is then used to generate novel-view images during inference. Experimental results on the LLFF, Mip-NeRF360, DTU, and MVImgNet datasets demonstrate that our approach enhances NVS quality under few-shot training conditions, outperforming existing state-of-the-art methods.
Poster
Shaoyuan Xie · Lingdong Kong · Yuhao Dong · Chonghao Sima · Wenwei Zhang · Qi Alfred Chen · Ziwei Liu · Liang Pan

[ Exhibit Hall I ]

Abstract
Recent advancements in Vision-Language Models (VLMs) have fueled interest in autonomous driving applications, particularly for interpretable decision-making. However, the assumption that VLMs provide visually grounded and reliable driving explanations remains unexamined. To address this, we introduce DriveBench, a benchmark evaluating 12 VLMs across 17 settings, covering 19,200 images, 20,498 QA pairs, and four key driving tasks. Our findings reveal that VLMs often generate plausible responses from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs possess inherent corruption-awareness but only explicitly acknowledge these issues when directly prompted. Given the challenges and inspired by the inherent corruption awareness, we propose Robust Agentic Utilization (RAU), leveraging VLMs’ corruption awareness and agentic planning with external tools to enhance perception reliability for downstream tasks. Our study challenges existing evaluation paradigms and provides a roadmap toward more robust and interpretable autonomous driving systems.
Poster
Changwoon Choi · Jeongjun Kim · Geonho Cha · Minkwan Kim · Dongyoon Wee · Young Min Kim

[ Exhibit Hall I ]

Abstract
Recent works on dynamic 3D neural field reconstruction assume the input from synchronized multi-view videos whose poses are known.The input constraints are often not satisfied in real-world setups, making the approach impractical. We show that unsynchronized videos from unknown poses can generate dynamic neural fields as long as the videos capture human motion. Humans are one of the most common dynamic subjects captured in videos, and their shapes and poses can be estimated using state-of-the-art libraries. While noisy, the estimated human shape and pose parameters provide a decent initialization point to start the highly non-convex and under-constrained problem of training a consistent dynamic neural representation. Given the shape and pose parameters of humans in individual frames, we formulate methods to calculate the time offsets between videos, followed by camera pose estimations that analyze the 3D joint positions. Then, we train the dynamic neural fields employing multiresolution grids while we concurrently refine both time offsets and camera poses. The setup still involves optimizing many parameters; therefore, we introduce a robust progressive learning strategy to stabilize the process. Experiments show that our approach achieves accurate spatio-temporal calibration and high-quality scene reconstruction in challenging conditions.
Poster
Hao Zhang · Haolan Xu · Chun Feng · Varun Jampani · Narendra Ahuja

[ Exhibit Hall I ]

Abstract
Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose \textbf{PhysRig}: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the …
Poster
Yingping Liang · Yutao Hu · Wenqi Shao · Ying Fu

[ Exhibit Hall I ]

Abstract
Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as Lift to Match (L2M), taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.
Poster
Tian-Xing Xu · Xiangjun Gao · Wenbo Hu · Xiaoyu Li · Song-Hai Zhang · Ying Shan

[ Exhibit Hall I ]

Abstract
Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability. Code and models will be publicly released.
Poster
Ahmed Abdelreheem · Filippo Aleotti · Jamie Watson · Zawar Qureshi · Abdelrahman Eldesokey · Peter Wonka · Gabriel Brostow · Sara Vicente · Guillermo Garcia-Hernando

[ Exhibit Hall I ]

Abstract
We introduce the novel task of Language-Guided Object Placement in 3D scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt.Compared with other language-guided localization tasks in 3D scenes such as grounding,this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space.We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models We will release the dataset and the benchmark and baseline code on acceptance.
Poster
Hongyu Shen · Junfeng Ni · Weishuo Li · Mingtao Pei · Yixin Chen · Siyuan Huang

[ Exhibit Hall I ]

Abstract
We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., self-prompting) and offline (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.
Poster
Lei Sun · Yuhan Bao · Jiajun Zhai · Jingyun Liang · YULUN ZHANG · Kaiwei Wang · Danda Pani Paudel · Luc Gool

[ Exhibit Hall I ]

Abstract
Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., "motion events" to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using "temporal-mapping" events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light condition is investigated for realistic training data synthesizing. To address the lack of datasets under this regime, we construct a beam-splitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Extensive experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RetinEV, excels in producing well-illuminated, high dynamic range images, outperforming previous state-of-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 frame-per-second on a $640\times480$ image.
Poster
Wenyao Zhang · Hongsi Liu · Bohan Li · Jiawei He · Zekun Qi · Yunnan Wang · Eastern Institute of Technology Shengyang · Ningbo Institute Of Digital Twin XinQiang · Galbot Wenjun · Eastern Institute for Advanced Study Xin

[ Exhibit Hall I ]

Abstract
Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception.
Poster
Jeong Hun Yeo · Minsu Kim · Chae Won Kim · Stavros Petridis · Yong Man Ro

[ Exhibit Hall I ]

Abstract
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.
Poster
Jiaxin Lu · Gang Hua · Qixing Huang

[ Exhibit Hall I ]

Abstract
The automatic assembly problem has attracted increasing interest due to its complex challenges that involve 3D representation. This paper introduces Jigsaw++, a novel generative method designed to tackle the multifaceted challenges of reconstructing complete shape for the reassembly problem. Existing approach focusing primarily on piecewise information for both part and fracture assembly, often overlooking the integration of complete object prior. Jigsaw++ distinguishes itself by learning a category-agnostic shape prior of complete objects. It employs the proposed ``retargeting'' strategy that effectively leverages the output of any existing assembly method to generate complete shape reconstructions. This capability allows it to function orthogonally to the current methods. Through extensive evaluations on Breaking Bad dataset and PartNet, Jigsaw++ has demonstrated its effectiveness, reducing reconstruction errors and enhancing the precision of shape reconstruction, which sets a new direction for future reassembly model developments.
Poster
Hongyu Wen · Yiming Zuo · Venkat Subramanian · Patrick Chen · Jia Deng

[ Exhibit Hall I ]

Abstract
Transparent objects are common in daily life, and understanding their multi-layer depth information—perceiving both the transparent surface and the objects behind it—is crucial for real-world applications that interact with transparent materials.In this paper, we introduce LayeredDepth, the first dataset with multi-layer depth annotations, including a real-world benchmark and a synthetic data generator, to support the task of multi-layer depth estimation. Our real-world benchmark consists of 1,500 images from diverse scenes, and evaluating state-of-the-art depth estimation methods on it reveals that they struggle with transparent objects. The synthetic data generator is fully procedural and capable of providing training data for this task with an unlimited variety of objects and scene compositions. Using this generator, we create a synthetic dataset with 15,300 images. Baseline models training solely on this synthetic dataset produce good cross-domain multi-layer depth estimation. Fine-tuning state-of-the-art single-layer depth models on it substantially improves their performance on transparent objects, with quadruplet accuracy on our benchmark increased from 55.59% to 75.16%.
Poster
Yuxi Xiao · Jianyuan Wang · Nan Xue · Nikita Karaev · Iurii Makarov · Bingyi Kang · Xing Zhu · Hujun Bao · Yujun Shen · Xiaowei Zhou

[ Exhibit Hall I ]

Abstract
3D point tracking from monocular videos has recently shown promising results, attracting increasing attention from the community. However, existing methods typically struggle with two key challenges: (a) significant background motion caused by camera movement, and (b) frequent occlusions that necessitate re-identifying previously observed objects. Monocular egocentric videos are prime examples where these challenges prominently arise. In this work, we introduce SpatialTrackerV2, a novel 3D point tracking approach capable of computing accurate 3D trajectories for arbitrary 2D pixels, excelling not only in common video scenarios but also in challenging contexts with substantial camera motion and frequent occlusions. Our method separates camera motion from object motion, explicitly modeling the camera movement and its interplay with depth maps to significantly enhance 3D point tracking. Additionally, we propose a joint refinement module that simultaneously improves depth estimation, camera motion, and 3D tracking accuracy in a unified manner. Benefiting from large-scale training on a mixture of synthetic and real-world data, SpatialTrackerV2 demonstrates strong robustness and generalization capabilities. Extensive experiments across different benchmarks validate its effectiveness and substantial performance improvements over existing approaches.
Poster
Peiming Li · Ziyi Wang · Yulin Yuan · Hong Liu · Xiangming Meng · Junsong Yuan · Mengyuan Liu

[ Exhibit Hall I ]

Abstract
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://anonymous.4open.science/r/UST-SSM.
Poster
Qi Bi · Jingjun Yi · Huimin Huang · Hao Zheng · Haolan Zhan · Wei Ji · Yawen Huang · Yuexiang Li · Yefeng Zheng

[ Exhibit Hall I ]

Abstract
Diffusion models have demonstrated powerful capability as a versatilist for dense vision tasks, yet the generalization ability to unseen domains remains rarely explored.In light of this issue, we focus on investigating generalizable paradigms for diffusion based dense prediction and propose an efficient frequency learning scheme, dubbed as \texttt{HarDiff}, alleviating the domain gap across various scenes.Interestingly, the low-frequency features, converted by the Discrete Hartley Transform, activate the broader content of an image, while the high-frequency features maintain sufficient details for dense pixels.Hence, our \texttt{HarDiff} is driven by two compelling designs:(1) Low-Frequency Training Process, which extracts structural priors from the source domain, for enhancing understanding of task-related content;(2) High-Frequency Sampling Process, which utilizes detail-oriented guidance from the unseen target domain, to infer precise dense predictions with target-related details.Extensive empirical evidence shows that \texttt{HarDiff} can be easily plugged into various dense vision tasks, \eg. semantic segmentation, depth estimation and haze removal, yielding improvements over the state-of-the-art methods in twelve public benchmarks. We will release our code.
Poster
Tuo Xiang · Xuemiao Xu · Bangzhen Liu · Jinyi Li · Yong Li · Shengfeng He

[ Exhibit Hall I ]

Abstract
The rapid growth of 3D digital content necessitates expandable recognition systems for open-world scenarios. However, existing 3D class-incremental learning methods struggle under extreme data scarcity due to geometric misalignment and texture bias. While recent approaches integrate 3D data with 2D foundation models (e.g., CLIP), they suffer from semantic blurring caused by texture-biased projections and indiscriminate fusion of geometric-textural cues, leading to unstable decision prototypes and catastrophic forgetting.To address these issues, we propose Cross-Modal Geometric Rectification (CMGR), a framework that enhances 3D geometric fidelity by leveraging CLIP’s hierarchical spatial semantics. Specifically, we introduce a Structure-Aware Geometric Rectification module to hierarchically align 3D part structures with CLIP’s intermediate spatial priors via attention-driven geometric fusion. Additionally, a Texture Amplification Module synthesizes minimal yet discriminative textures to suppress noise and reinforce cross-modal consistency. To further stabilize incremental prototypes, we employ a Base-Novel Discriminator that isolates geometric variations.Extensive experiments demonstrate that our method significantly improves 3D few-shot class-incremental learning, achieving superior geometric coherence and robustness to texture bias across cross-domain and within-domain settings.
Poster
Ruining Li · Chuanxia Zheng · Christian Rupprecht · Andrea Vedaldi

[ Exhibit Hall I ]

Abstract
Most 3D object generators focus on aesthetic quality, often neglecting physical constraints necessary in applications.One such constraint is that the 3D object should be self-supporting, i.e., remains balanced under gravity.Prior approaches to generating stable 3D objects used differentiable physics simulators to optimize geometry at test-time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models to external feedback, we propose Direct Simulation Optimization (DSO), a framework to use the feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator outputs stable 3D objects directly. We construct a dataset of 3D objects labeled with a stability score obtained from the physics simulator. We can then fine-tune the 3D generator using the stability score as the alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO), a novel objective, which we introduce, to align diffusion models without requiring pairwise preferences. Our experiments show that the fine-tuned feed-forward generator, using either DPO or DRO objective, is much faster and more likely to produce stable objects than test-time optimization. Notably, the DSO framework works even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation …
Poster
Aleksandar Jevtić · Christoph Reich · Felix Wimbauer · Oliver Hahn · Christian Rupprecht · Stefan Roth · Daniel Cremers

[ Exhibit Hall I ]

Abstract
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
Poster
Xinhao Cai · Qiuxia Lai · Gensheng Pei · Xiangbo Shu · Yazhou Yao · Wenguan Wang

[ Exhibit Hall I ]

Abstract
In this paper, we propose a generation-detection cycle consistent (GDCC) learning framework that jointly optimizes both layout-to-image (L2I) generation and object detection (OD) tasks in an end-to-end manner. The key of GDCC lies in the inherent duality between the two tasks, where L2I takes all object boxes and labels as input conditions to generate images, and OD maps images back to these layout conditions. Specifically, in GDCC, L2I generation is guided by a layout translation cycle loss, ensuring that the layouts used to generate images align with those predicted from the synthesized images. Similarly, OD benefits from an image translation cycle loss, which enforces consistency between the synthesized images fed into the detector and those generated from predicted layouts. While current L2I and OD tasks benefit from large-scale annotated layout-image pairs, our GDCC enables more efficient use of annotation-free synthetic data, thereby further enhancing data efficiency. It is worth noting that our GDCC framework is computationally efficient thanks to the perturbative single-step sampling strategy and a priority timestep re-sampling strategy during training. Besides, GDCC preserves the architectures of L2I, OD models, and the generation pipeline within the framework, thus maintaining the original inference speed. Extensive experiments demonstrate that GDCC significantly …
Poster
Wenxuan Guo · Xiuwei Xu · Hang Yin · Ziwei Wang · Jianjiang Feng · Jie Zhou · Jiwen Lu

[ Exhibit Hall I ]

Abstract
Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the …
Poster
Dejie Yang · Zijing Zhao · Yang Liu

[ Exhibit Hall I ]

Abstract
Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they neither utilize web data that differs from robotic tasks, nor train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action …
Poster
Donghyeon Kwon · Youngseok Yoon · Hyeongseok Son · Suha Kwak

[ Exhibit Hall I ]

Abstract
Camera-based 3D object detection has gained attention for its cost-effectiveness, but it in general lags behind LiDAR-based approaches due to its lack of explicit 3D spatial cues. To take the best of both camera- and LiDAR-based detectors, we propose MemDistill, a novel cross-modal knowledge distillation framework for 3D object detection.MemDistill transfers rich 3D knowledge from a LiDAR-based teacher model to a camera-based student model through a dedicated memory unit and a scene-dependent memory retrieval module.To be specific, our framework distills the teacher's 3D knowledge, optimizes the memory to store that knowledge compactly, and learns the retriever that searches the memory to produce 3D features relevant to the input scene, compensating for the missing LiDAR modality.Experiments on the nuScenes dataset demonstrate that MemDistill significantly improves performance of its camera-only baseline, achieving the state of the art in camera-based 3D object detection.
Poster
Daixun Li · Yusi Zhang · Mingxiang Cao · donglai Liu · Weiying Xie · Tianlin Hui · Lunkai Lin · Zhiqiang Xie · Yunsong Li

[ Exhibit Hall I ]

Abstract
Vision-Language-Action (VLA) is crucial for autonomous decision-making in embodied systems. While current methods have advanced single-skill abilities, their short-horizon capability limits applicability in real-world scenarios. To address this challenge, we innovatively propose $\textbf{MindExplore}$, a general hierarchical VLA system with cross-skill for long-horizon tasks in highly dynamic sand. The key insight is to iteratively align the knowledge domain of task planning and action execution. Thus, this task-oriented action enables outstanding generalization across a wide range of real-world scenarios. In the reasoning layer, task-specific chains of thought (CoT) are designed for planning long-horizon task sequences and providing meta-action signals. In the acting layer, a simple but powerful Mixture of Policy Experts strategy is built inspired by signals and multimodal inputs for adaptively selecting skill experts and generating closed-loop action sequences. Also, it integrates a lightweght Multimodal Diffusion Policy (MMDP) to enhance spatial perception by fusing multi-visual modality features. Besides, the pioneering memory mechanism establishes feedback between the reasoning and acting layers, facilitating adaptive execution of long-horizon tasks and real-time replanning. Notably, we create $\textbf{SandGo-1k}$ and $\textbf{SandThink-21k}$, the first expert-level multimodal CoT dataset and embodied dataset tailored for sandy environments. At a high execution frequency of 30 FPS, MindExplore is 3.01 $\times$ more …
Poster
Minh Tran · Hongda Mao · Qingshuang Chen · Yelin Kim

[ Exhibit Hall I ]

Abstract
Generating body pose from head-mounted, egocentric inputs is essential for immersive VR/AR and assistive technologies, as it supports more natural interactions. However, the task is challenging due to limited visibility of body parts in first-person views and the sparseness of sensory data, with only a single device placed on the head. To address these challenges, we introduce Head2Body, a novel framework for body pose estimation that effectively combines IMU and visual data. First, we introduce a pre-trained IMU encoder, trained on over 1,700 hours of head-IMU data from wearable eyeglasses, to better capture detailed temporal motion cues given limited labeled egocentric pose data. For visual processing, we leverage large vision-language models (LVLMs) to segment body parts that appear sporadically in video frames to improve visual feature extraction. To better guide the pose generation process with sparse signals from only head-mounted devices, we incorporates a Vector Quantized Variational Autoencoder (VQ-VAE) to represent poses as discrete tokens, which capture high-frequency motion patterns and provide a more structured representation of body pose. Our experiments demonstrate the effectiveness of the proposed approach, yielding 8–13% gains over state-of-the-art baselines on four datasets: AMASS, KinPoly, GIMO, and EgoExo4D. By capturing subtle temporal dynamics and leveraging complementary …
Poster
Xingxiang Zhou · Xiangdong Su · Haoran Zhang · Wei Chen · Guanglai Gao

[ Exhibit Hall I ]

Abstract
Low-light image enhancement (LLIE) is a fundamental task in computer vision. Its goal is to extract more useful information from dark regions. Many existing methods have made excellent strides in improving image brightness and enhancing texture details. However, these approaches often lead to overexposure in certain regions when dealing with unevenly illuminated images, resulting in the loss of original information within the images. To address this issue, we propose a Bézier surface constraint (BSCNet) method based on task decoupling to enhance low-light images with uneven brightness. Specifically, we design a diffusion model with a branch structure that separates the enhancement process into brightness adjustment and color restoration, enabling independent control over brightness uniformity. Additionally, we use Bézier surfaces as a learning target to impose smoothness constraints on the image, thereby addressing the issue of uneven brightness in the enhanced image. To counteract potential detail loss introduced by Bézier surfaces, we introduce a spatial-frequency reconstruction module based on the Fourier transform to enhance fine-grained texture information. Experimental comparisons of six generalized LLIE datasets show that our proposed method has demonstrated outstanding effectiveness.
Poster
Jae Young Kang · Hoonhee Cho · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuous-time detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuous-time 3D perception. Our project code will be publicly available.
Poster
Fang Zhang · Wenzhao Zheng · Linqing Zhao · Zelan Zhu · Jiwen Lu · Xiuzhuang Zhou

[ Exhibit Hall I ]

Abstract
3D plane recovery from monocular images constitutes a fundamental task in indoor scene understanding. Recent methods formulate this problem as 2D pixel-level segmentation through convolutional networks or query-based architectures, which purely rely on 2D pixel features while neglecting the inherent 3D spatial nature of planar surfaces. To address this limitation, we propose an end-to-end Plane Reconstruction, Aggregation, and Splatting (PlaneRAS) framework that explicitly leverages 3D geometric reasoning combined with online planar primitive reconstruction. Our framework introduces two core components: 1) a reconstruction module utilizing customized planar primitives to compactly represent 3D scene, and 2) a recovery module that aggregates local primitives to derive globally consistent plane instances. The proposed 3D-aware representation enables direct integration of pretrained geometric priors, significantly enhancing performance beyond conventional 2D-centric approaches. Extensive experiments on ScanNet and NYUv2 datasets demonstrate state-of-the-art results across various evaluation metrics, resulting from our explicit 3D geometric modeling and effective fusion of cross-dimensional features.
Poster
Lorenzo Mur-Labadia · Maria Santos-Villafranca · Jesus Bermudez-cameo · Alejandro Perez-Yus · Ruben Martinez-Cantin · Jose Guerrero

[ Exhibit Hall I ]

Abstract
Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem.We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task.Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) a Ego↔Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space and,(4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects.O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +31% and 94% in the Ego2Exo and Exo2Ego IoU against the official challenge baselines and a +13% and +6% compared with the SOTA with 1% of the training parameters.
Poster
Cui Miao · Tao Chang · meihan wu · Hongbin Xu · Chun Li · Ming Li · Xiaodong Wang

[ Exhibit Hall I ]

Abstract
Vision-Language-Action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large-scale user-specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose \name{}, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task-aware representation learning, adaptive expert selection, and expert-driven federated aggregation, enabling efficient and privacy-preserving training of VLA models. Specifically, we introduce an Instruction-Oriented Scene-Parsing mechanism, which decomposes and enhances object-level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixture-of-Experts (DGMoE) mechanism, where not only input tokens but also self-aware experts adaptively decide their activation. Finally, we propose an Expert-Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross-client knowledge transfer. Extensive simulations and real-world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.
Poster
Richard D Paul · Johannes Seiffarth · David Rügamer · Hanno Scharr · Katharina Nöh

[ Exhibit Hall I ]

Abstract
Cell tracking is a key computational task in live-cell microscopy, but fully automated analysis of high-throughput imaging requires reliable and, thus, uncertainty-aware data analysis tools, as the amount of data recorded within a single experiment exceeds what humans are able to overlook. We here propose and benchmark various methods to reason about and quantify uncertainty in linear assignment-based cell tracking algorithms. Our methods take inspiration from statistics and machine learning, leveraging two perspectives on the cell tracking problem explored throughout this work: Considering it as a Bayesian inference problem and as a classification problem. Our methods admit a framework-like character in that they equip any frame-to-frame tracking method with uncertainty quantification. We demonstrate this by applying it to various existing tracking algorithms including the recently presented Transformer-based trackers. We demonstrate empirically that our methods yield useful and well-calibrated tracking uncertainties.
Poster
Wufei Ma · Haoyu Chen · Guofeng Zhang · Yu-Cheng Chou · Celso de Melo · Alan Yuille · Jieneng Chen

[ Exhibit Hall I ]

Abstract
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of applications, such as autonomous navigation, robotics, and AR/VR. Despite the remarkable improvements achieved by large multi-modal models (LMMs) in a wide range of image and video understanding tasks, their abilities to perform 3D spatial reasoning are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 3,000 annotated image question answering triplets from 12 question types. We balance the data distribution by collecting complimentary images that lead to opposite answers given the same question. We also adopt a novel FlipEval for robust evaluation of 3D spatial reasoning capabilities. Moreover, to study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench involves two subsets with 3D spatial reasoning questions on images from the same scene with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, revealing their limitations in different types of 3D awareness, i.e., height, orientation, location, and multi-object reasoning. Our 3DSRBench also allows …
Poster
Fengxiang Wang · Hongzhen Wang · Di Wang · Zonghao Guo · Zhenyu Zhong · Long Lan · Wenjing Yang · Jing Zhang

[ Exhibit Hall I ]

Abstract
Masked Image Modeling (MIM) has become an essential method for building foundational visual models in remote sensing (RS). However, the limitations in size and diversity of existing RS datasets restrict the ability of MIM methods to learn generalizable representations. Additionally, conventional MIM techniques, which require reconstructing all tokens, introduce unnecessary computational overhead. To address these issues, we present a new pre-training pipeline for RS models, featuring the creation of a large-scale RS dataset and an efficient MIM approach. We curated a high-quality dataset named **OpticalRS-13M** by collecting publicly available RS datasets and processing them through exclusion, slicing, and deduplication. OpticalRS-13M comprises 13 million optical images covering various RS tasks, such as object detection and pixel segmentation. To enhance efficiency, we propose **SelectiveMAE**, a pre-training method that dynamically encodes and reconstructs semantically rich patch tokens, thereby reducing the inefficiencies of traditional MIM models caused by redundant background pixels in RS images. Extensive experiments show that OpticalRS-13M significantly improves classification, detection, and segmentation performance, while SelectiveMAE increases training efficiency over 2$\times$ times. This highlights the effectiveness and scalability of our pipeline in developing RS foundational models.
Poster
Taowen Wang · Cheng Han · James Liang · Wenhao Yang · Dongfang Liu · Luna Zhang · Qifan Wang · Jiebo Luo · Ruixiang Tang

[ Exhibit Hall I ]

Abstract
Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. While VLA models offer significant capabilities, they also introduce new attack surfaces, making them vulnerable to adversarial attacks. With these vulnerabilities largely unexplored, this paper systematically quantifies the robustness of VLA-based robotic systems. Recognizing the unique demands of robotic execution, our attack objectives target the inherent spatial and functional characteristics of robotic systems. In particular, we introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. Additionally, we design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments. Our evaluation reveals a marked degradation in task success rates, with up to a 100\% reduction across a suite of simulated robotic tasks, highlighting critical security gaps in current VLA architectures. By unveiling these vulnerabilities and proposing actionable evaluation metrics, we advance both the understanding and enhancement of safety for VLA-based robotic systems, underscoring the necessity for continuously developing robust defense strategies prior to …
Poster
Shintaro Shiba · Yoshimitsu Aoki · Guillermo Gallego

[ Exhibit Hall I ]

Abstract
Event cameras are emerging vision sensors,whose noise is challenging to characterize.Existing denoising methods for event cameras consider other tasks such as motion estimation separately (i.e., sequentially after denoising).However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion.This work proposes, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise.The method is flexible, as it allows replacing the 1-step motion estimation ofthe widely-used Contrast Maximization framework with any other motion estimator,such as deep neural networks.The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark,while showing its efficacy on motion estimation and intensity reconstruction tasks.We believe that the proposed approach contributes to strengthening the theory ofevent-data denoising, as well as impacting practical denoising use-cases, aswe release the code upon acceptance.
Poster
Lu Chen · Yizhou Wang · SHIXIANG TANG · Qianhong Ma · Tong He · Wanli Ouyang · Xiaowei Zhou · Hujun Bao · Sida Peng

[ Exhibit Hall I ]

Abstract
This paper addresses the task of learning an agent model behaving like humans, which can jointly perceive, predict, and act in egocentric worlds. Previous methods usually train separate models for these three abilities, which prevents them from learning from each other. In this paper, we propose a joint predictive agent model, named EgoAgent, that simultaneously learns to represent the world, predict future states, and take reasonable actions within a single transformer. EgoAgent introduces two innovations to learn from the causal and temporally intertwined nature of these abilities: (1) Interleaved sequential modeling of states and actions with the causal attention mechanism, and (2) A joint embedding-action-prediction architecture featuring temporal asymmetric predictor-observer branches. Integrating these designs based on JEPA, EgoAgent unifies these capabilities in a cohesive learning framework. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction tasks demonstrate the superiority of our method. The code and trained model will be released for reproducibility.
Poster
Stefan Stojanov · Linan Zhao · Yunzhi Zhang · Daniel Yamins · Jiajun Wu

[ Exhibit Hall I ]

Abstract
Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.
Poster
Jin-Hee Lee · Jae-keun Lee · Jeseok Kim · Kwon Soon

[ Exhibit Hall I ]

Abstract
To ensure safe autonomous driving in complex urban environments, it is essential not only to develop high-performance object detection models but also to establish a diverse and representative dataset that captures a wide range of urban scenarios and object characteristics. To address these challenges, we introduce a new multi-class 3D LiDAR dataset that comprehensively reflects various urban environments and object types, along with a robust semi-supervised 3D object detection (SSOD) framework. Our SSOD framework leverages a novel multiple teachers model, where similar object classes are grouped and supervised by category-specialized teacher networks. This category-specific collaborative guidance enables the student network to learn more effectively, leading to improved object detection performance. Additionally, we propose the Pseudo-points Generator (PointGen), a simple yet effective technique designed to enhance the generation of high-quality pseudo-labels for the teacher network, mitigating the impact of sparse LiDAR point clouds. Extensive experiments on the Waymo Open Dataset, KITTI, and our newly introduced dataset validate the effectiveness of both our dataset and SSOD framework. Experimental results demonstrate that our approach consistently outperforms state-of-the-art 3D SSOD methods across all evaluated datasets. To encourage further research in this domain, we will publicly release our multi-class LiDAR dataset and source code on …
Poster
Xuange Zhang · Dengjie Li · Bo Liu · Zenghao Bao · Yao Zhou · Baisong Yang · liuzhongying liuzhongying · Yujie Zhong · Tongtong Yuan

[ Exhibit Hall I ]

Abstract
Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models~(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel vision-language interaction mechanism called **L**ayer-wise **V**ision **I**njection with **D**isentangled **A**ttention (LVIDA). In LVIDA, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, LVIDA achieves approximately a 10× reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance. Our code will be made publicly available soon.
Poster
Jianyu Wu · Yizhou Wang · Xiangyu Yue · Xinzhu Ma · Jinyang Guo · Dongzhan Zhou · Wanli Ouyang · SHIXIANG TANG

[ Exhibit Hall I ]

Abstract
While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface'' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC. The dataset, code and pretrained network shall be released.
Poster
Haru Kondoh · Asako Kanezaki

[ Exhibit Hall I ]

Abstract
The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems cannot outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-the-art performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.
Poster
Zhigang Wang · Yifei Su · Chenhui Li · Dong Wang · Yan Huang · Xuelong Li · Bin Zhao

[ Exhibit Hall I ]

Abstract
Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and can not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method. Codes will be publicly available.
Poster
Emily Jia · Jiageng Mao · Zhiyuan Gao · Yajie Zhao · Yue Wang

[ Exhibit Hall I ]

Abstract
Humans possess an exceptional ability to imagine 4D scenes, encompassing both motion and 3D geometry, from a single still image. This ability is rooted in our accumulated observations of similar scenes and an intuitive understanding of physics. In this paper, we aim to replicate this capacity in neural networks, specifically focusing on natural fluid imagery. Existing methods for this task typically employ simplistic 2D motion estimators to animate the image, leading to motion predictions that often defy physical principles, resulting in unrealistic animations. Our approach introduces a novel method for generating 4D scenes with physics-consistent animation from a single image. We propose the use of a physics-informed neural network that predicts motion for each point, guided by a loss term derived from fundamental physical principles, including the Navier-Stokes equations. To reconstruct the 3D geometry, we predict feature-based 3D Gaussians from the input image, which are then animated using the predicted motions and rendered from any desired camera perspective. Experimental results highlight the effectiveness of our method in producing physically plausible animations, showcasing significant performance improvements over existing methods.
Poster
Ming Dai · Wenxuan Cheng · Jiedong Zhuang · Jiang-Jiang Liu · Hongshen Zhao · Zhenhua Feng · Wankou Yang

[ Exhibit Hall I ]

Abstract
Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the model’s capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO/+/g, and RefCOCO/+/g (REC/RES) benchmarks demonstrate the effectiveness of PropVG.
Poster
Zheng Zhang · Lihe Yang · Tianyu Yang · Chaohui Yu · Xiaoyang Guo · Yixing Lao · Hengshuang Zhao

[ Exhibit Hall I ]

Abstract
Recent advances in monocular depth estimation have significantly improved its robustness and accuracy. Despite these improvements, relative depth models, which offer strong generalization capability, fail to provide real-world depth measurements. Notably, these models exhibit severe flickering and 3D inconsistency when applied to video data, limiting their application for 3D reconstruction. To address these challenges, we introduce StableDepth, a scene-consistent and scale-invariant depth estimation method that achieves stable predictions with scene-level 3D consistency. We propose a dual decoder structure to learn smooth depth supervised by large-scale unlabeled video data. Our approach not only enhances the generalization capability but also reduces flickering during video depth estimation. Leveraging the vast amount of unlabeled video data, our method offers extensive stability and is easy to scale up with low cost. Unlike previous methods requiring full video sequences, StableDepth enables online inference at 13$\times$ faster speed, while achieving significant accuracy improvements (6.4\%-86.8\%) across multiple benchmarks and delivering comparable temporal consistency to video diffusion based depth estimators. We highly encourage viewing the supplementary video materials to gain a better understanding of the effectiveness of our approach.
Poster
Jun-Hee Kim · Jumin Han · Seong-Whan Lee

[ Exhibit Hall I ]

Abstract
Standard 3D human pose estimation (HPE) benchmarks employ root-centering, which normalizes poses relative to the pelvis but discards absolute root position information. While effective for evaluation, this approach limits real-world applications such as motion tracking, AR/VR, and human-computer interaction, where absolute root position is essential. Moreover, incorporating root position into these models often leads to performance degradation.To address these limitations, we introduce PoseAnchor, a unified framework that seamlessly integrates root position estimation while improving overall pose accuracy.PoseAnchor leverages Iterative Hard Thresholding Robust Least Squares Regression (ITRR), a novel robust regression approach introduced to 3D HPE for the first time. ITRR effectively mitigates the impact of noisy 2D detections, enabling more accurate root position estimation.With ITRR, PoseAnchor enables zero-shot root localization, allowing existing models to estimate absolute root positions without retraining or architectural modifications.ITRR identifies a support set of reliable joints based on their spatial relationships to achieve robust root estimation, effectively filtering out unreliable joints.Beyond zero-shot localization, PoseAnchor incorporates ITRR into a Data-Driven Training framework that selectively utilizes the support set to optimize pose learning.By dynamically filtering high-confidence joint data, PoseAnchor mitigates noise while improving robustness.Experiments demonstrate that PoseAnchor achieves state-of-the-art results, surpassing both root-centered and root-aware methods in fully …
Poster
Ling Liu · Jun Tian · Li Yi

[ Exhibit Hall I ]

Abstract
4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. Our method consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.
Poster
Artem Nikonorov · Georgy Perevozchikov · Andrei Korepanov · Nancy Mehta · Mahmoud Afifi · Egor Ershov · Radu Timofte

[ Exhibit Hall I ]

Abstract
We present cmKAN, a versatile framework for color matching. Given an input image with colors from a source color distribution, our method effectively and accurately maps these colors to match a target color distribution in both supervised and unsupervised settings. Our framework leverages the spline capabilities of Kolmogorov-Arnold Networks (KANs) to model the color matching between source and target distributions. Specifically, we developed a hypernetwork that generates spatially varying weight maps to control the nonlinear splines of a KAN, enabling accurate color matching. As part of this work, we introduce a first large-scale dataset of paired images captured by two distinct cameras and evaluate the efficacy of our and existing methods in matching colors. We evaluated our approach across various color-matching tasks, including: (1) raw-to-raw mapping, where the source color distribution is in one camera’s raw color space and the target in another camera’s raw space; (2) raw-to-sRGB mapping, where the source color distribution is in a camera’s raw space and the target is in the display sRGB space, emulating the color rendering of a camera ISP; and (3) sRGB-to-sRGB mapping, where the goal is to transfer colors from a source sRGB space (e.g., produced by a source camera ISP) …
Poster
Junyuan Deng · Wei Yin · Xiaoyang Guo · Qian Zhang · Xiaotao Hu · Weiqiang Ren · XIAOXIAO LONG · Ping Tan

[ Exhibit Hall I ]

Abstract
In this paper, we present DM-Calib, a diffusion-based approach for estimating pinhole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, resulting in poor generalization across diverse real-world images. Recent advancements in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence indicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Image, which losslessly encodes the numerical camera intrinsics and integrates seamlessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Image conditioned on an input image. By fine-tuning a stable diffusion model to generate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot …
Poster
Zhenyang Liu · Yikai Wang · Kuanning Wang · Longfei Liang · Xiangyang Xue · Yanwei Fu

[ Exhibit Hall I ]

Abstract
Visual imitation learning is effective for robots to learn versatile tasks. However, many existing methods rely on behavior cloning with supervised historical trajectories, limiting their 3D spatial and 4D spatiotemporal awareness. Consequently, these methods struggle to capture the 3D structures and 4D spatiotemporal relationships necessary for real-world deployment. In this work, we propose 4D Diffusion Policy (DP4), a novel visual imitation learning method that incorporates spatiotemporal awareness into diffusion-based policies. Unlike traditional approaches that rely on trajectory cloning, DP4 leverages a dynamic Gaussian world model to guide the learning of 3D spatial and 4D spatiotemporal perceptions from interactive environments. Our method constructs the current 3D scene from a single-view RGB-D observation and predicts the future 3D scene, optimizing trajectory generation by explicitly modeling both spatial and temporal dependencies. Extensive experiments across 17 simulation tasks with 173 variants and 3 real-world robotic tasks demonstrate that the 4D Diffusion Policy (DP4) outperforms baseline methods, improving the average simulation task success rate by 16.4\% (Adroit), 14\% (DexArt), and 6.45\% (RLBench), and the average real-world robotic task success rate by 8.6\%.
Poster
Muhammad Danish · Muhammad Akhtar Munir · Syed Shah · Kartik Kuckreja · Fahad Khan · Paolo Fraccaro · Alexandre Lacoste · Salman Khan

[ Exhibit Hall I ]

Abstract
While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they do not effectively address the specific challenges of geospatial applications.Generic VLM benchmarks are not designed to handle the complexities of geospatial data, an essential component for applications such as environmental monitoring, urban planning, and disaster management.Key challenges in the geospatial domain include temporal change detection, large-scale object counting, tiny object detection, and understanding relationships between entities in remote sensing imagery.To bridge this gap, we present GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, segmentation, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges. The results indicate that although existing VLMs demonstrate potential, they face challenges when dealing with geospatial-specific tasks, highlighting the room for further improvements. Notably, the best-performing LLaVa-OneVision achieves only 41.7\% accuracy on MCQs, slightly more than GPT-4o, which is approximately double the random guess performance. Our benchmark will be publicly available.
Poster
Petr Hruby · Marc Pollefeys

[ Exhibit Hall I ]

Abstract
We propose a novel approach for estimating the relative pose between rolling shutter cameras using the intersections of line projections with a single scanline per image. This allows pose estimation without explicitly modeling camera motion. Alternatively, scanlines can be selected within a single image, enabling single-view relative pose estimation for scanlines of rolling shutter cameras. Our approach is designed as a foundational building block for rolling shutter structure-from-motion (SfM), where no motion model is required, and each scanline's pose can be computed independently.We classify minimal solvers for this problem in both generic and specialized settings, including cases with parallel lines and known gravity direction. Furthermore, we develop minimal solvers for the parallel-lines scenario, both with and without gravity priors, by leveraging connections between this problem and the estimation of 2D structure from 1D cameras.Experiments on rolling shutter images from the Fastec dataset demonstrate the feasibility of our approach for initializing rolling shutter SfM, highlighting its potential for further development.The code will be made publicly available.
Poster
William Gao · Dilin Wang · Yuchen Fan · Aljaz Bozic · Tuur Stuyck · Zhengqin Li · Zhao Dong · Rakesh Ranjan · Nikolaos Sarafianos

[ Exhibit Hall I ]

Abstract
We present a novel approach to mesh shape editing, building on recent progress in 3D reconstruction from multi-view images. We formulate shape editing as a conditional reconstruction problem, where the model must reconstruct the input shape with the exception of a specified 3D region, in which the geometry should be generated from the conditional signal. To this end, we train a conditional Large Reconstruction Model (LRM) for masked reconstruction, using multi-view consistent masks rendered from a randomly generated 3D occlusion, and using one clean viewpoint as the conditional signal. During inference, we manually define a 3D region to edit and provide an edited image from a canonical viewpoint to fill that region. We demonstrate that, in just a single forward pass, our method not only preserves the input geometry in the unmasked region through reconstruction capabilities on par with SoTA, but is also expressive enough to perform a variety of mesh edits from a single image guidance that past works struggle with, while being 2-10 times faster than the top-performing prior work.
Poster
Yulin Wang · Mengting Hu · Hongli Li · Chen LUO

[ Exhibit Hall I ]

Abstract
In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets.
Poster
Kaijie Yin · Zhiyuan Zhang · Shu Kong · Tian Gao · Cheng-zhong Xu · Hui Kong

[ Exhibit Hall I ]

Abstract
In this paper, we propose Binarized Change Detection (BiCD), the first binary neural network (BNN) designed specifically for change detection. Conventional network binarization approaches, which directly quantize both weights and activations in change detection models, severely limit the network's ability to represent input data and distinguish between changed and unchanged regions. This results in significantly lower detection accuracy compared to real-valued networks. To overcome these challenges, BiCD enhances both the representational power and feature separability of BNNs, improving detection performance. Specifically, we introduce an auxiliary objective based on the Information Bottleneck (IB) principle, guiding the encoder to retain essential input information while promoting better feature discrimination. Since directly computing mutual information under the IB principle is intractable, we design a compact, learnable auxiliary module as an approximation target, leading to a simple yet effective optimization strategy that minimizes both reconstruction loss and standard change detection loss.Extensive experiments on street-view and remote sensing datasets demonstrate that BiCD establishes a new benchmark for BNN-based change detection, achieving state-of-the-art performance in this domain.
Poster
Andrew Bond · Jui-Hsien Wang · Long Mai · Erkut Erdem · Aykut Erdem

[ Exhibit Hall I ]

Abstract
Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous camera motion modeling. By leveraging Neural ODEs, our approach learns smooth camera trajectories while maintaining an explicit 3D scene representation through Gaussians. Additionally, we introduce a spatiotemporal hierarchical learning strategy, progressively refining spatial and temporal features to enhance reconstruction quality and accelerate convergence. This memory-efficient approach achieves high-quality rendering at impressive speeds. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency, achieving state-of-the-art performance across diverse video datasets in both high- and low-motion scenarios. Unlike prior methods that depend heavily on extensive external supervision, our approach operates entirely within a self-contained pipeline without requiring any additional supervision.
Poster
Shuting Dong · Mingzhi Chen · Feng Lu · Hao Yu · Guanghao Li · Zhe Wu · Ming Tang · Chun Yuan

[ Exhibit Hall I ]

Abstract
With the rapid advancement of Visual Place Recognition (VPR) systems, their unauthorized use on social media images enables monitoring of individuals' daily movements, posing serious privacy risks. However, privacy protection for addressing these risks in VPR systems remains an underexplored area. While adversarial perturbations have been widely explored for visual privacy protection, existing methods still fail to simultaneously satisfy the black-box constraint, imperceptibility, and real-time performance required in realistic VPR privacy protection scenarios. In this paper, we present the first look at privacy protection in VPR systems and introduce VPR-Cloak, an efficient privacy-preserving network. We introduce a saliency-aware prior to identify decisive regions for place recognition and propose Saliency-Aware Prior Guided Perturbation Optimization (SAP-PO) to selectively optimize perturbation generation in these areas. To enhance imperceptibility, we further optimize perturbations in the frequency domain, meticulously refining high-frequency components of perturbations while preserving low-frequency structures essential for human perception. Extensive experiments on multiple benchmark datasets and on various black-box VPR models verify that our method outperforms existing SOTA methods. Additionally, our method achieves a \textbf{15× speedup} in runtime compared to SOTA methods. We also validate the effectiveness of our method based on commercial APIs, including \textbf{Google and Microsoft Bing}, demonstrating the practical …
Poster
Nuo Chen · Chao Xiao · Yimian Dai · Shiman He · Miao Li · Wei An

[ Exhibit Hall I ]

Abstract
Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Event-based Small object detection (EVSOD) dataset (namely EV-UAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 × 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the …
Poster
Hanxiao Jiang · Hao-Yu Hsu · Kaifeng Zhang · Hsin-Ni Yu · Shenlong Wang · Yunzhu Li

[ Exhibit Hall I ]

Abstract
Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects in interaction to produce a photo- and physically realistic, real-time interactive virtual replica.Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering, and (2) a novel multi-stage optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints.PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning. (See our supplement webpage for all videos and demos.)
Poster
Xinli Xu · Wenhang Ge · Dicong Qiu · ZhiFei Chen · Dongyu Yan · Zhuoyun LIU · Haoyu Zhao · hanfeng Zhao · Shunsi Zhang · Junwei Liang · Ying-Cong Chen

[ Exhibit Hall I ]

Abstract
Estimating physical properties for visual data is a crucial task in computer vision, graphics, and robotics, underpinning applications such as augmented reality, physical simulation, and robotic grasping. However, this area remains under-explored due to the inherent ambiguities in physical property estimation. To address these challenges, we introduce GaussianProperty, a training-free framework that assigns physical properties of materials to 3D Gaussians. Specifically, we integrate the segmentation capability of SAM with the recognition capability of GPT-4V(ision) to formulate a global-local physical property reasoning module for 2D images. Then we project the physical properties from multi-view 2D images to 3D Gaussians using a voting strategy. We demonstrate that 3D Gaussians with physical property annotations enable applications in physics-based dynamic simulation and robotic grasping. For physics-based dynamic simulation, we leverage the Material Point Method (MPM) for realistic dynamic simulation. For robot grasping, we develop a grasping force prediction strategy that estimates a safe force range required for object grasping based on the estimated physical properties. Extensive experiments on material segmentation, physics-based dynamic simulation, and robotic grasping validate the effectiveness of our proposed method, highlighting its crucial role in understanding physical properties from visual data.
Poster
Gang Fu

[ Exhibit Hall I ]

Abstract
Dichromatic Reflection Model (DRM), a widely used physical image formation model, has been extensively applied to specular highlight removal. However, traditional DRM solvers fail to effectively recover the missing content underneath specular highlights and are prone to incur visual artifacts. Additionally, existing deep learning-based methods do not exploit the underlying variables in DRM; instead, they primarily learn to translate an input image into its diffuse image (and specular residue image). As a result, their performance remains somewhat limited. To overcome these issues, we propose a neural DRM solver for specular highlight removal. Our pipeline for the solver consists of three networks: Highlight Detection Network (HDNet), Alpha-chrom Estimation Network (ACENet), and Refinement Network (RNet). Specifically, HDNet is first used to detect specular highlights. Meanwhile, leveraging multi-level contextural contrasted features from HDNet, ACENet estimates the underlying variables in DRM. Using these estimates, our new reconstruction models generate specular-free and specular residue images. To bridge the domain gap between color spaces, we additionally introduce RNet to refine the results. Extensive experiments on various datasets demonstrate that our neural solver is superior to previous traditional solvers as well as deep learning-based methods.
Poster
Haowen Bai · Jiangshe Zhang · Zixiang Zhao · Lilun Deng · Yukun Cui · Shuang Xu

[ Exhibit Hall I ]

Abstract
Multi-exposure image fusion consolidates multiple low dynamic range images of the same scene into a singular high dynamic range image. Retinex theory, which separates image illumination from scene reflectance, is naturally adopted to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To better adapt this theory for multi-exposure image fusion, we introduce an unsupervised and controllable method termed Retinex-MEF. Specifically, our method decomposes multi-exposure images into separate illumination components and a shared reflectance component, and effectively modeling the glare induced by overexposure. Employing a bidirectional loss constraint to learn the common reflectance component, our approach effectively mitigates the glare effect. Furthermore, we establish a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of fixed-level fusion. A series of experiments across multiple datasets, including underexposure-overexposure fusion, exposure control fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model. The code will be released.
Poster
Yuqi Li · Chuanguang Yang · Hansheng Zeng · Zeyu Dong · Zhulin An · Yongjun Xu · Yingli Tian · Hao Wu

[ Exhibit Hall I ]

Abstract
Spatiotemporal forecasting tasks, such as traffic flow, combustion dynamics, and weather forecasting, often require complex models that suffer from low training efficiency and high memory consumption. This paper proposes a lightweight framework, Spectral Decoupled Knowledge Distillation, which transfers the multi-scale spatiotemporal representations from a complex teacher model to a more efficient lightweight student network. The teacher model follows an encoder-latent evolution-decoder architecture, where its latent evolution module decouples high-frequency details (e.g., instant traffic fluctuations) and low-frequency trends (e.g. long-term weather evolution) using convolution (local high-frequency extractor) and Transformer (global low-frequency modeler). However, the multi-layer convolution and deconvolution structures result in slow training and high memory usage. To address these issues, we propose a frequency-aligned knowledge distillation strategy, which extracts multi-scale spectral features from the teacher’s latent space, including high and low frequency components, to guide the lightweight student model (e.g., ResNet, U-Net) in capturing both local fine-grained variations and global evolution patterns. Experiments show that the student model achieves over 95% of the teacher’s forecasting accuracy while using only 20%-30% of its memory, with training speed improved by more than 50%. Our theoretical analysis reveals that the frequency-domain decoupling enables the student model to capture long-range dependencies without the need …
Poster
Zixuan Hu · Dongxiao Li · Xinzhu Ma · SHIXIANG TANG · Xiaotong Li · Wenhan Yang · LINGYU DUAN

[ Exhibit Hall I ]

Abstract
Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (**DUO**), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel conjugate loss, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and …
Poster
Dimitrios Mallis · Ahmet Karadeniz · Sebastian Cavada · Danila Rukhovich · Niki Foteinopoulou · Kseniya Cherenkova · Anis Kacem · Djamila Aouada

[ Exhibit Hall I ]

Abstract
We propose CAD-Assistant, a general-purpose CAD agent for AI-assisted design. Our approach is based on a powerful Vision and Large Language Model (VLLM) as a planner and a tool-augmentation paradigm using CAD-specific tools. CAD-Assistant addresses multimodal user queries by generating actions that are iteratively executed on a Python interpreter equipped with the FreeCAD software, accessed via its Python API. Our framework is able to assess the impact of generated CAD commands on geometry and adapts subsequent actions based on the evolving state of the CAD design. We consider a wide range of CAD-specific tools including a sketch image parameterizer, rendering modules, a 2D cross-section generator, and other specialized routines. CAD-Assistant is evaluated on multiple CAD benchmarks, where it outperforms VLLM baselines and supervised task-specific methods. Beyond existing benchmarks, we qualitatively demonstrate the potential of tool-augmented VLLMs as general-purpose CAD solvers across diverse workflows.
Poster
Edgar Sucar · Zihang Lai · Eldar Insafutdinov · Andrea Vedaldi

[ Exhibit Hall I ]

Abstract
DUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance.
Poster
Yufei Zhang · Zijun Cui · Jeffrey Kephart · Qiang Ji

[ Exhibit Hall I ]

Abstract
While 3D hand reconstruction from monocular images has made significant progress, generating accurate and temporally coherent motion estimates from video sequences remains challenging, particularly during complex hand-object interactions. In this paper, we present a novel 3D hand motion recovery framework that enhances image-based reconstructions through a diffusion-based and physics-augmented motion refinement model. Our model captures the distribution of refined motion estimates conditioned on initial ones, generating improved sequences through an iterative denoising process. Instead of relying on scarce annotated video data, we train our model only using existing motion capture data without images. Moreover, we identify valuable intuitive physics knowledge during hand-object interactions, including key motion states and their associated motion constraints. We effectively integrate these physical insights into our diffusion model to improve its performance. Extensive experiments demonstrate that our approach significantly improves various frame-wise reconstruction methods, achieving state-of-the-art (SOTA) performance on existing benchmarks.
Poster
Vahid Balazadeh · Mohammadmehdi Ataei · Hyunmin Cheong · Amir Khasahmadi · Rahul Krishnan

[ Exhibit Hall I ]

Abstract
Physical reasoning, which involves interpreting object behaviors within dynamic environments, remains a significant challenge for Vision-Language Models (VLMs). The limitations in physical reasoning arise from an inability to translate learned knowledge into predictions about physical behavior. We perform a careful study to show how continual fine-tuning can mitigate this issue. However, fine-tuning is expensive for large models and impractical to repeatedly perform for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a novel modular framework where specialized VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts for larger VLMs to enhance their reasoning capabilities. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform careful experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8\% on complex physical reasoning tasks. Notably, PCBs show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.Our work demonstrates that enhancing visual perception through …
Poster
Xiuyu Wu · Xinhao Wang · Xiubin Zhu · Lan Yang · Jiyuan Liu · Xingchen Hu

[ Exhibit Hall I ]

Abstract
Due to the arbitrary orientation of objects in aerial images, rotation equivariance is a critical property for aerial object detectors. However, recent studies on rotation-equivariant aerial object detection remain scarce. Most detectors rely on data augmentation to enable models to learn approximately rotation-equivariant features. A few detectors have constructed rotation-equivariant networks, but due to the breaking of strict rotation equivariance by typical downsampling processes, these networks only achieve approximately rotation-equivariant backbones. Whether strict rotation equivariance is necessary for aerial image object detection remains an open question. In this paper, we implement a strictly rotation-equivariant backbone and neck network with a more advanced network structure and compare it with approximately rotation-equivariant networks to quantitatively measure the impact of rotation equivariance on the performance of aerial image detectors. Additionally, leveraging the inherently grouped nature of rotation-equivariant features, we propose a multi-branch head network that reduces the parameter count while improving detection accuracy. Based on the aforementioned improvements, this study proposes the Multi-branch head rotation-equivariant single-stage Detector (MessDet), which achieves state-of-the-art performance on the challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and DIOR-R with an exceptionally low parameter count. The code will be made publicly available.
Poster
Shuo Zhang · Chen Gao · Youfang Lin

[ Exhibit Hall I ]

Abstract
Light Field (LF) images captured under low illumination conditions typically exhibit low quality. Recent learning-based methods for low-light LF enhancement are generally tailored to specific illumination inputs, limiting their performance in real-world scenes. Moreover, how to maintain the inherent view-consistency in the enhanced images also remain as a difficult problem. In this paper, we propose to explore the view consistency for scene-adaptive low-light LF enhancement. We first analyze the view consistency for LF illumination maps and design a self-supervised view-consistent loss to keep the consistency between the illumination maps of different views in LFs. To enhance the model's perception of illumination, we combine both global and local information to estimate the illumination map, which is easily plugged into other models. Subsequently, we use the illumination maps to light up the low-light LF images and restore the corruption to produce the final enhanced image. Extensive experiments demonstrate that our View-Consistency Network (VCNet) outperforms state-of-the-art methods on real-world low-light LF datasets in both fixed lighting conditions and dynamic lighting conditions. Our proposed illumination adjustment is also demonstrated that can comprehensively improve the performance of existing methods in terms of both image quality and view consistency.
Poster
Yash Garg · Saketh Bachu · Arindam Dutta · Rohit Lal · Sarosij Bose · Calvin-Khang Ta · M. Salman Asif · Amit Roy-Chowdhury

[ Exhibit Hall I ]

Abstract
Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a $\textbf{V}$ideo-based human $\textbf{Occ}$lusion dataset with $\textbf{3D}$ body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end …
Poster
Jiahao Zhang · Zongli Jiang · Gang Wang · Jinli Zhang · Yixin Wei · Liang Li · Yizheng Wang

[ Exhibit Hall I ]

Abstract
Tracking flying drones in infrared videos is a crucial yet challenging task. Existing drone trackers and datasets have limitations in dealing with and characterizing tiny targets ($\leq$20×20 pixels) against highly complex backgrounds. To tackle this issue, we have developed a large-scale benchmark for tiny drone tracking in infrared videos (TDTIV), which comprises 290k frames and 280k manually annotated bounding boxes. Unlike traditional trackers that primarily rely on appearance matching, we introduce a novel method called Motion-Centric Adaptive Tracking (MCATrack), which initially employs a magnocell-inspired motion response to enhance the local signal-to-noise ratio of tiny target regions while suppressing complex clutter. Moreover, we design a Dynamic Cross-Guided module that integrates both initial and updated target features to address pose variations in long-term tracking. This module captures the latest target information to generate highly relevant candidate regions and refines them through precise optimization to achieve more accurate tracking results.Extensive experiments performed on the TDTIV and the well-recognized Anti-UAV 410 datasets have demonstrated the superiority of MCATrack over state-of-the-art competing trackers. The codes along with the benchmark will be made publicly available.
Poster
Jinhao Duan · Fei Kong · Hao Cheng · James Diffenderfer · Bhavya Kailkhura · Lichao Sun · Xiaofeng Zhu · Xiaoshuang Shi · Kaidi Xu

[ Exhibit Hall I ]

Abstract
Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the “overall truthfulness” of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as “per-token” hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist “generic truthful directions” shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods in OH mitigation.
Poster
Johannes Jakubik · Felix Yang · Benedikt Blumenstiel · Erik Scheurer · Rocco Sedona · Stefano Maurogiovanni · Valerio Marsocci · Nikolaos Dionelis · Jente Bosmans · Niklas Kopp · Rahul Ramachandran · Paolo Fraccaro · Thomas Brunschwiler · Gabriele Cavallaro · Juan Moreno · Nicolas Longépé

[ Exhibit Hall I ]

Abstract
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "thinking in modalities" (TiM)---the capability of generating additional artificial data during finetuning and inference to improve the model output---and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code will be open-sourced under a permissive license.
Poster
Erik Daxberger · Nina Wenzel · David Griffiths · Haiming Gang · Justin Lazarow · Gefen Kohavi · Kai Kang · Marcin Eichner · Yinfei Yang · Afshin Dehghan · Peter Grasch

[ Exhibit Hall I ]

Abstract
Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.
Poster
Yi-Ting Shen · Sungmin Eum · Doheon Lee · Rohit Shete · Chiao-Yi Wang · Heesung Kwon · Shuvra Bhattacharyya

[ Exhibit Hall I ]

Abstract
Composed pose retrieval (CPR) enables users to search for human poses by specifying a reference pose and a transition description, but progress in this field is hindered by the scarcity and inconsistency of annotated pose transitions. Existing CPR datasets rely on costly human annotations or heuristic-based rule generation, both of which limit scalability and diversity. In this work, we introduce AutoComPose, the first framework that leverages multimodal large language models (MLLMs) to automatically generate rich and structured pose transition descriptions. Our method enhances annotation quality by structuring transitions into fine-grained body part movements and introducing mirrored/swapped variations, while a cyclic consistency constraint ensures logical coherence between forward and reverse transitions. To advance CPR research, we construct and release two dedicated benchmarks, AIST-CPR and PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive experiments demonstrate that training retrieval models with AutoComPose yields superior performance over human-annotated and heuristic-based methods, significantly reducing annotation costs while improving retrieval quality. Our work pioneers the automatic annotation of pose transitions, establishing a scalable foundation for future CPR research.
Poster
Dinh-Vinh-Thuy Tran · Ruochen Chen · Shaifali Parashar

[ Exhibit Hall I ]

Abstract
Shape-from-Template (SfT) refers to the class of methods that reconstruct the 3D shape of a deforming object from images/videos using a 3D template. Traditional SfT methods require point correspondences between images and the texture of the 3D template in order to reconstruct 3D shapes from images/videos in real time. Their performance severely degrades when encountered with severe occlusions in the images because of the unavailability of correspondences. In contrast, modern SfT methods use a correspondence-free approach by incorporating deep neural networks to reconstruct 3D objects, thus requiring huge amounts of data for supervision. Recent advances use a fully unsupervised or self-supervised approach by combining differentiable physics and graphics to deform 3D template to match input images. In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a $400\times$ faster pace than (best-performing) unsupervised SfT. Moreover, when it comes to generating finer details and severe occlusions, our method outperforms the existing methodologies by a large margin. Code will be released upon acceptance.
Poster
shengyuan zhang · An Zhao · Ling Yang · Zejian Li · Chenye Meng · Haoran Xu · Tianrun Chen · AnYang Wei · Perry GU · Lingyun Sun

[ Exhibit Hall I ]

Abstract
Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality.However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D LiDAR scene completion models, dubbed $\textbf{ScoreLiDAR}$, which achieves efficient yet high-quality scene completion.ScoreLiDAR enables the distilled model to sample in significantly fewer steps after distillation.To improve completion quality, we also introduce a novel $\textbf{Structural Loss}$, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene.The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration.Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame ($>$5$\times$) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models.
Poster
Ruonan Yu · Songhua Liu · Zigeng Chen · Jingwen Ye · Xinchao Wang

[ Exhibit Hall I ]

Abstract
Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments show that our method significantly reduces the storage cost to merely 0.001% compared to full soft-label storage methods while achieving comparable performance to state-of-the-art …
Poster
Yung-Hsu Yang · Luigi Piccinelli · Mattia Segu · Siyuan Li · Rui Huang · Yuqian Fu · Marc Pollefeys · Hermann Blum · Zuria Bauer

[ Exhibit Hall I ]

Abstract
Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D → Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models will be released.
Poster
Juliette Marrie · Romain Menegaux · Michael Arbel · Diane Larlus · Julien Mairal

[ Exhibit Hall I ]

Abstract
We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2.Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups.Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object localization tasks, highlighting the versatility of our approach.
Poster
Gwanghyun Kim · Xueting Li · Ye Yuan · Koki Nagano · Tianye Li · Jan Kautz · Se Young Chun · Umar Iqbal

[ Exhibit Hall I ]

Abstract
Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model’s role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the …
Poster
Xiaopeng LIN · Yulong Huang · Hongwei Ren · Zunchang Liu · Hongxiang Huang · Yue Zhou · Haotian FU · Bojun Cheng

[ Exhibit Hall I ]

Abstract
Motion deblurring addresses the challenge of image blur caused by camera or scene movement. Event cameras provide motion information that is encoded in the asynchronous event streams. To efficiently leverage the temporal information of event streams, we employ Spiking Neural Networks (SNNs) for motion feature extraction and Artificial Neural Networks (ANNs) for color information processing. Due to the non-uniform distribution and inherent redundancy of event data, existing cross-modal feature fusion methods exhibit certain limitations. Inspired by the visual attention mechanism in the human visual system, this study introduces a bioinspired dual-drive hybrid network (BDHNet). Specifically, the Neuron Configurator Module (NCM) is designed to dynamically adjust neuron configurations based on cross-modal features, thereby focusing the spikes in blurry regions and adapting to varying blurry scenarios dynamically. Additionally, the Region of Blurry Attention Module (RBAM) is introduced to generate a blurry mask in an unsupervised manner, effectively extracting motion clues from the event features and guiding more accurate cross-modal feature fusion. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art methods on both synthetic and real-world datasets.
Poster
Zekun Qian · Ruize Han · Junhui Hou · Linqi Song · Wei Feng

[ Exhibit Hall I ]

Abstract
Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, not fully leveraging the video information. In this work, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video analysis standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate detection (localization and classification) of time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object tracking (association). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for the open-vocabulary tracking task.
Poster
Mostofa Rafid Uddin · Jana Armouti · Min Xu

[ Exhibit Hall I ]

Abstract
Identifying different protein compositions and conformations from microscopic images of protein mixtures is a challenging open problem. We address this problem through disentangled representation learning, where separating protein compositions and conformations in an intermediate latent space enables accurate identification. Since conformations manifest as transformations that cause subtle changes in voxel space and compositions correspond to content invariant to these transformations, the task reduces to content-transformation disentangling. However, existing content-transformation disentanglement methods require an explicit parametric form for the transformation, which conformation transformations lack, making those methods unsuitable. To overcome this limitation, we propose DualContrast, a novel contrastive learning-based method that implicitly parameterizes both transformation and content and disentangles them. DualContrast achieves this by generating positive and negative pairs for content and transformation in both data and latent spaces. We demonstrate that existing contrastive approaches fail under similar implicit parameterization, underscoring the necessity of our method. We validate our claims through extensive experiments on 3D microscopic images of protein mixtures and additional shape-focused datasets beyond microscopy. Finally, we achieve the first completely unsupervised identification of different protein compositions and conformations in 3D microscopic images of protein mixtures.
Poster
Liying Yang · Chen Liu · Zhenwei Zhu · Ajian Liu · Hui Ma · Jian Nong · Yanyan Liang

[ Exhibit Hall I ]

Abstract
Recently, the generation of dynamic 3D objects from a video has shown impressive results. Existing methods directly optimize Gaussians using whole information in frames. However, when dynamic regions are interwoven with static regions within frames, particularly if the static regions account for a large proportion, existing methods often overlook information in dynamic regions and are prone to overfitting on static regions. This leads to producing results with blurry textures. We consider that decoupling dynamic-static features to enhance dynamic representations can alleviate this issue. Thus, we propose a dynamic-static feature decoupling module (DSFD). Along temporal axes, it regards the regions of current frame features that possess significant differences relative to reference frame features as dynamic features. Conversely, the remaining parts are the static features. Then, we acquire decoupled features driven by dynamic features and current frame features. Moreover, to further enhance the dynamic representation of decoupled features from different viewpoints and ensure accurate motion prediction, we design a temporal-spatial similarity fusion module (TSSF). Along spatial axes, it adaptively selects similar information of dynamic regions. Hinging on the above, we construct a novel approach, DS4D. Experimental results verify our method achieves state-of-the-art (SOTA) results in video-to-4D. In addition, the experiments on a …
Poster
Shijie Li · Chunyu Liu · Xun Xu · Si Yong Yeo · Xulei Yang

[ Exhibit Hall I ]

Abstract
Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions.We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code will be released according …
Poster
Jinxiu Liang · Bohan Yu · Siqi Yang · Haotian Zhuang · Jieji Ren · Peiqi Duan · Boxin Shi

[ Exhibit Hall I ]

Abstract
We present EventUPS, the first uncalibrated photometric stereo method using event cameras—neuromorphic sensors that asynchronously detect brightness changes with microsecond resolution. Frame-based uncalibrated photometric stereo methods imposed high bandwidth demands and limiting applicability in dynamic scenes. They require dense image correspondence under varying illumination, cannot be directly applicable due to event data due to their fundamentally different sensing paradigm. Our approach introduces three key innovations: i) an augmented null space formulation that directly relates each event to constraints on surface normals and lighting, naturally handling ambient illumination; ii) a continuous parameterization of time-varying illumination that bridges asynchronous events to synchronized lighting estimation; iii) a structured lighting approach with known relative geometry that resolves the ambiguity to merely convex-concave uncertainty. We validate EventUPS using a custom-built LED-based lighting system implementing dual-ring and trefoil curve patterns. Extensive experiments on synthetic, semi-real, and real data demonstrate that our method achieves accuracy surpassing frame-based counterpart while requiring only 5\% of the data bandwidth.
Poster
Hsuan-I Ho · Chen Guo · Po-Chen Wu · Ivan Shugurov · Chengcheng Tang · Abhay Mittal · Sizhe An · Manuel Kaufmann · Linguang Zhang

[ Exhibit Hall I ]

Abstract
We introduce PHD, a novel approach for 3D human pose and shape estimation that leverages user identity information from videos to improve pose estimation accuracy and shape consistency. Unlike traditional methods designed to be user-agnostic and optimized for generalization, our pipeline precomputes the body shape and then employs a personalized pose fitting process conditioned on the body shape and input image. We observe that while existing methods commonly improve 2D alignment by refining the pose with constraints derived from the 2D image, the lack of 3D pose prior often reduces pose plausibility, thereby compromising 3D accuracy. To address this, we integrate a body shape-conditioned 3D pose prior, implemented as a Point Diffusion model, to iteratively guide pose fitting via a Point Distillation loss. Our results demonstrate that our 3D pose prior significantly prevents artifacts introduced by 2D-only constraints, which consequently improves the pose accuracy. In addition, our 3D prior-driven fitting method is highly versatile and can be seamlessly combined with state-of-the-art 3D pose estimators to improve pose accuracy.
Poster
Chengxu Liu · Lu Qi · Jinshan Pan · Xueming Qian · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
Unpaired image dehazing has attracted increasing attention due to its flexible data requirements during model training. Dominant methods based on contrastive learning not only introduce haze-unrelated content information, but also ignore haze-specific properties in the frequency domain (\ie,~haze-related degradation is mainly manifested in the amplitude spectrum). To address these issues, we propose a novel frequency domain-based diffusion model, named FrDiff, for fully exploiting the beneficial knowledge in unpaired clear data. In particular, inspired by the strong generative ability shown by Diffusion Models (DMs), we tackle the dehazing task from the perspective of frequency domain reconstruction and perform the DMs to yield the amplitude spectrum consistent with the distribution of clear images. To implement it, we propose an Amplitude Residual Encoder (ARE) to extract the amplitude residuals, which effectively compensates for the amplitude gap from the hazy to clear domains, as well as provide supervision for the DMs training. In addition, we propose a Phase Correction Module (PCM) to eliminate artifacts by further refining the phase spectrum during dehazing with a simple attention mechanism. Experimental results demonstrate that our FrDiff outperforms other state-of-the-art methods on both synthetic and real-world datasets.
Poster
Zhu Yu · Bowen Pang · Lizhe Liu · Runmin Zhang · Qiang Li · Si-Yuan Cao · Maochun Luo · Mingxia Chen · Sheng Yang · Hui-liang Shen

[ Exhibit Hall I ]

Abstract
This work presents LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and fine-grained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and ultimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our transitive semantic labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-of-the-art zero-shot occupancy prediction approaches on the Occ3D-nuScenes dataset. The code for the proposed method is available.
Poster
Ruixuan Cong · Yu Wang · Mingyuan Zhao · Da Yang · Rongshan Chen · Hao Sheng

[ Exhibit Hall I ]

Abstract
Deep learning-based light field image super-resolution methods have witnessed remarkable success in recent years. However, most of them only focus on the encoder design and overlook the importance of upsampling process in decoder part. Inspired by the recent progress in single image domain with implicit neural representation, we elaborately propose spatial-epipolar implicit image function (SEIIF), which optimizes upsampling process to significantly improve performance and supports arbitrary-scale light filed image super-resolution. Specifically, SEIIF contains two complementary upsampling patterns. One is spatial implicit image function (SIIF) that exploits intra-view information in sub-aperture images. The other is epipolar implicit image function (EIIF) that mines inter-view information in epipolar plane images. By unifying the upsampling step of two branches, SEIIF extra introduces cross-branch feature interaction to fully fuse intra-view information and inter-view information. Besides, given that line structure in epipolar plane image integrates spatial-angular correlation of light field, we present an oriented line sampling strategy to exactly aggregate inter-view information. The experimental results demonstrate that our SEIIF can be effectively combined with most encoders and achieve outstanding performance on both fixed-scale and arbitrary-scale light field image super-resolution. Our code will be available upon acceptance.
Poster
Shizun Wang · Zhenxiang Jiang · Xingyi Yang · Xinchao Wang

[ Exhibit Hall I ]

Abstract
Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction.To address this, we introduce **C4D**, a framework that leverages temporal **C**orrespondences to extend existing 3D reconstruction formulation to **4D**. Specifically, apart from predicting pointmaps, C4D captures two types of *correspondences*: *short-term* optical flow and *long-term* point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes.Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction.Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking.
Poster
Yuan Liang · Yang Zhou · Ziming Sun · Tianyi Xiang · Guiqing Li · Shengfeng He

[ Exhibit Hall I ]

Abstract
Depth estimation in dynamic, multi-object scenes remains a major challenge, especially under severe occlusions. Existing monocular models, including foundation models, struggle with instance-wise depth consistency due to their reliance on global regression. We tackle this problem from two key aspects: data and methodology. First, we introduce the Group Instance Depth (GID) dataset, the first large-scale video depth dataset with instance-level annotations, featuring 101,500 frames from real-world activity scenes. GID bridges the gap between synthetic and real-world depth data by providing high-fidelity depth supervision for multi-object interactions. Second, we propose InstanceDepth, the first occlusion-aware depth estimation framework for multi-object environments. Our two-stage pipeline consists of (1) Holistic Depth Initialization, which assigns a coarse scene-level depth structure, and (2) Instance-Aware Depth Rectification, which refines instance-wise depth using object masks, shape priors, and spatial relationships. By enforcing geometric consistency across occlusions, our method sets a new state-of-the-art on the GID dataset and multiple benchmarks.
Poster
AO LI · Jinpeng Liu · Yixuan Zhu · Yansong Tang

[ Exhibit Hall I ]

Abstract
Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI’s superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.
Poster
Liang Qin · Min Wang · Peiwei Li · Wengang Zhou · Houqiang Li

[ Exhibit Hall I ]

Abstract
Object Goal Navigation (ObjectNav) in unknown environments presents significant challenges, particularly in Open-Vocabulary Mobile Manipulation (OVMM), where robots must efficiently explore large spaces, locate small objects, and accurately position themselves for subsequent manipulation. Existing approaches struggle to meet these demands: rule-based methods offer structured exploration but lack adaptability, while reinforcement learning (RL)-based methods enhance adaptability but fail to ensure effective long-term navigation. Moreover, both approaches often overlook precise stopping positions, which are critical for successful manipulation.To address these challenges, we propose APRR (Active Perception Meets Rule-Guided RL), a two-phase framework that designs a new rule-guided RL policy for the exploration phase and a novel active target perception policy for the last-mile navigation phase. Inspired by human search behavior, our rule-guided RL policy enables efficient and adaptive exploration by combining structured heuristics with learning-based decision-making. In the last-mile navigation phase, we introduce an RL-based policy enhanced with active target perception, allowing the robot to refine its position dynamically based on real-time detection feedback. Experimental results demonstrate that APRR improves the success rate by 13\%, significantly outperforming existing methods. Furthermore, real-world experiments validate the practicality and effectiveness of APRR in real-world mobile manipulation scenarios, offering a robust and adaptable solution for precise …
Poster
Jan Skvrna · Lukas Neumann

[ Exhibit Hall I ]

Abstract
Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured.We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets.The method is evaluated on three public datasets, where despite using no human labels, it outperforms prior work by a significant margin. It also shows its versatility as a pre-training tool for fully-supervised training and shows that combining pseudo-labels from multiple datasets can achieve comparable accuracy to using human labels from a single dataset.
Poster
Haipeng Li · Tianhao Zhou · Zhanglei Yang · WuYi WuYi · Chen Yan · Zijing Mao · Shen Cheng · Bing Zeng · Shuaicheng Liu

[ Exhibit Hall I ]

Abstract
Estimating 2D camera motion is a fundamental task in computer vision, representing the non-linear projection of 3D rotation and translation onto a 2D plane. Current methods primarily rely on homography-based approaches, which model perspective transformations for planar scenes, or meshflow-based techniques, which utilize grid-based local homographies to accommodate non-linear motion. However, homography is restricted to dominant planes and meshflow’s nonlinear capacity remains limited. To address these challenges, we introduce **CamFlow**, a novel representation that captures non-linear 2D camera motion through the use of hybrid motion bases: 1) physical bases to model essential motion patterns and 2) noisy motion bases to enhance flexibility. In addition, we propose a hybrid probabilistic loss function, leveraging a Laplace distribution to improve robustness and facilitate efficient training.We also design a test-time adaptation strategy to refine motion estimates for video stabilization in unseen video contexts. To evaluate the camera motion, we propose a new benchmark by masking dynamic objects in existing optical flow datasets. Extensive experiments, including zero-shot evaluations across diverse conditions, demonstrate that CamFlow outperforms state-of-the-art homography and meshflow methods in terms of robustness and generalization.Code and dataset will be released upon publication.
Poster
Risa Shinoda · Nakamasa Inoue · Hirokatsu Kataoka · Masaki Onishi · Yoshitaka Ushiku

[ Exhibit Hall I ]

Abstract
Precise automated understanding of agricultural tasks such as disease identification is essential for the sustainable crop production. Recent advances in vision-language models (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 197 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code will be available.
Poster
Karhan Kayan · Stamatis Alexandropoulos · Rishabh Jain · Yiming Zuo · Erich Liang · Jia Deng

[ Exhibit Hall I ]

Abstract
We introduce PosedVideo365, a diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a $360^{\circ}$ camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with $360^{\circ}$ camera trajectories.
Poster
Heng Jia · Na Zhao · Linchao Zhu

[ Exhibit Hall I ]

Abstract
Despite recent advances in feed-forward 3DGS methods, generalizable 3D reconstruction remains challenging, particularly in the multi-view correspondence modeling. We present a hybrid framework for multi-view correspondence modeling, which integrates volumetric latent fusion with Transformer-based feature aggregation. Our framework consists of two complementary components: a latent volume that encodes view-invariant correspondences through epipolar geometry, and a camera-aware Transformer conditioned on Plücker coordinates. By combining explicit and implicit feature aggregation mechanisms, our approach enhances generalization while demonstrating accelerated convergence, requiring only half the training steps to achieve results comparable to state-of-the-art methods. Additionally, through comprehensive evaluation, we show that Visual Foundation Models trained with pixel-aligned supervision are more suitable for 3D reconstruction tasks. Our approach supports variable input views, improving reconstruction quality as view count increases while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code will be released.
Poster
Bo Wang · Huiyuan Fu · Zhiye Huang · Siru Zhang · Xin Wang · Huadong Ma

[ Exhibit Hall I ]

Abstract
Exposure correction aims to restore over/under-exposed images to well-exposed ones using a single network. However, existing methods mainly handle non-extreme exposure conditions and struggle with the severe luminance and texture loss caused by extreme exposure. Through a thorough investigation, we find that the lack of high-quality benchmark datasets significantly limits progress in extreme exposure correction.To address this issue, we introduce the first Real-world Extreme Exposure Dataset, REED. By leveraging the burst shooting mode of cameras, we capture image sequences covering a luminance range from extremely dark to extremely bright. To prevent misalignment caused by camera motion and scene changes, we apply cropping and an improved SIFT algorithm to ensure precise alignment.We also propose a novel Context-Guided Luminance-Normalized Iterative Exposure Refinement Network. We employ Contrastive Loss and Luminance Normalizer to disentangle the coupled distribution of over/under-exposed images. In certain cases, luminance alone is insufficient for determining over/under-exposure, so we integrate semantic guidance into the Semantic-aware Exposure Diffusion Model to further enhance luminance and texture restoration. Inspired by the effectiveness of iterative correction in improving color and texture, we introduce the CLIP-Guided Iterative Refinement Strategy. Extensive experiments validate the superiority of our dataset and method. Our dataset and code will be publicly …
Poster
Hai Jiang · Binhao Guan · Zhen Liu · Xiaohong Liu · Jian Yu · Zheng Liu · Songchen Han · Shuaicheng Liu

[ Exhibit Hall I ]

Abstract
Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset will be released to facilitate future research.
Poster
Zhi Hou · Tianyi Zhang · Yuwen Xiong · Haonan Duan · Hengjun Pu · Ronglei Tong · Chengyang Zhao · Xizhou Zhu · Yu Qiao · Jifeng Dai · Yuntao Chen

[ Exhibit Hall I ]

Abstract
While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning—enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By capitalizing on the Transformer's scalability, Dita effectively unifies cross-embodiment datasets spanning varying camera perspectives, tasks, and action spaces. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. The code and website are included in the supplementary materials.
Poster
Fengrui Tian · Tianjiao Ding · Jinqi Luo · Hancheng Min · Rene Vidal

[ Exhibit Hall I ]

Abstract
This paper studies the problem of generating an unbounded dynamic scene from a single view, which has wide applications in augmented/virtual reality and robotics. Since the scene is changing over time, different generated views need to be consistent with the underlying 3D motions. While previous works learn such consistency by training from multiple views, the generated scene regions are bounded to be close to the training views with limited camera movements. To address this issue, we propose DynamicVoyager that reformulates the dynamic scene generation as a scene outpainting process for new dynamic content. As 2D outpainting models can hardly generate 3D consistent motions from only 2D pixels at a single view, we consider pixels as rays to enrich the pixel input with the ray context, so that the 3D motion consistency can be learned from the ray information. More specifically, we first map the single-view video input to a dynamic point cloud with the estimated video depths. Then we render the partial video at a novel view and outpaint the video with ray contexts from the point cloud to generate 3D consistent motions. We employ the outpainted video to update the point cloud, which is used for scene outpainting from …
Poster
Haotian Wang · Aoran Xiao · Xiaoqin Zhang · Meng Yang · Shijian Lu

[ Exhibit Hall I ]

Abstract
Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances geometry diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on insights into inherent 2D-to-3D projection ambiguities and consistencies in object shapes and positions, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens data coverage by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a novel data synthesis pipeline built upon multiple depth foundation models. These models robustly provide pseudo depth labels with varied scene scales in both local objects and global layouts, while ensuring projection consistency that contributes to generalization. To further diversify geometries, we introduce interpolation and relocation strategies, as well as unlabeled images, extending the coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings.
Poster
mingze sun · Shiwei Mao · Keyi Chen · Yurun Chen · Shunlin Lu · Jingbo Wang · Junting Dong · Ruqi Huang

[ Exhibit Hall I ]

Abstract
Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potential dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokensA mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation. …
Poster
Jiahao Ma · Tianyu Wang · Miaomiao Liu · David Ahmedt Aristizabal · Chuong Nguyen

[ Exhibit Hall I ]

Abstract
Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistency Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline iteratively achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive experiments demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to the best of our knowledge, we are the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting.
Poster
Weihao Wang · Yu Lan · Mingyu You · Bin He

[ Exhibit Hall I ]

Abstract
3D assembly completion represents a fundamental task in 3D computer vision and robotics. This task aims to retrieve the missing parts from a set of candidates and predict their 6-DoF poses to make the partial assembly complete. However, due to the inherent uncertainty in completion and the similarity among candidates, even humans struggle to achieve precise completion without external guidance. To address this challenge, we introduce an auxiliary image depicting the complete assembly from a specific view. The primary challenge lies in the lack of correspondence or grounding between the partial assembly and the image, leading to ambiguities in identifying missing parts and ineffective guidance for completion. Moreover, this correspondence heavily depends on the view of image, which, unfortunately, is often unknown in real-world scenarios. To this end, we propose a novel cross-modal 3D assembly completion framework. At its core is missing-oriented feature fusion augmented by self-supervised view alignment to establish view-consistent 2D-3D correspondence between the image and the partial assembly, which effectively captures clues of missing parts from the image and provides targeted guidance for completion. Extensive experiments demonstrate our state-of-the-art performance on the PartNet dataset and show its generalization capabilities in two downstream applications: component suggestion and furniture …
Poster
Ye Tao · jiawei zhang · Yahao Shi · Dongqing Zou · Bin Zhou

[ Exhibit Hall I ]

Abstract
Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.
Poster
Xiaogang Xu · Jiafei Wu · Qingsen Yan · Jiequan Cui · Richang Hong · Bei Yu

[ Exhibit Hall I ]

Abstract
A major challenge in Low-Light Image Enhancement (LLIE) is its ill-posed nature: low-light images often lack sufficient information to align with normal-light ones (e.g., not all training data can be fully fitted to the ground truth). Numerous studies have attempted to bridge the gap between low- and normal-light data by introducing effective additional information, which is called "references" in this paper. However, existing methods overlook the valuable references hidden within the training dataset itself. In this work, we propose a novel LLIE strategy that simultaneously learns image-specific features by neural networks while formulating effective common features from the training data as the reference. These common features are correlated with the samples that are not fully fitted by the LLIE network itself, and they are represented as a set of Learnable Feature Patches and Vectors (LFPVs) in the hidden feature space. LFPVs are updated through two mechanisms: the sample-updater, which extracts useful features from training samples to refine LFPVs, and the mutual-updater, which propagates information across LFPVs to mutually update them. LFPVs can be adaptively aligned with image-specific features via our designed query-and-fusion procedure, boosting the LLIE performance. Our proposed method can be integrated into any LLIE framework, improving both enhancement …
Poster
Zesong Yang · Bangbang Yang · Wenqi Dong · Chenxuan Cao · Liyuan Cui · Yuewen Ma · Zhaopeng Cui · Hujun Bao

[ Exhibit Hall I ]

Abstract
Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects. Code will be released upon acceptance.
Poster
Mohammad Mohammadi · Ziyi Wu · Igor Gilitschenski

[ Exhibit Hall I ]

Abstract
Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for recurrent architectures. To unleash the power of recurrent models, TESPEC is the first method utilizing longer sequences of events in the pre-training stage. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art performance in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation.
Poster
Gengze Zhou · Yicong Hong · Zun Wang · Chongyang Zhao · Mohit Bansal · Qi Wu

[ Exhibit Hall I ]

Abstract
The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent.This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.
Poster
Jan Ackermann · Jonas Kulhanek · Shengqu Cai · Haofei Xu · Marc Pollefeys · Gordon Wetzstein · Leonidas Guibas · Songyou Peng

[ Exhibit Hall I ]

Abstract
In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene.This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures.CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation.Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications.Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.We will release our source code and the synthetic and real-world datasets we created to support further research in this area.
Poster
Ziqi Ma · Yisong Yue · Georgia Gkioxari

[ Exhibit Hall I ]

Abstract
Why don't we have foundation models in 3D yet? A key limitation is data scarcity. For 3D object part segmentation, existing datasets are small in size and lack diversity. We show that it is possible to break this data barrier by building a data engine powered by 2D foundation models. Our data engine automatically annotates any number of object parts: 1755x more unique part types than existing datasets combined. By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. Even when evaluated zero-shot, we outperform existing methods on the datasets they train on. We achieve 260% improvement in mIoU and boost speed by 6x to 300x. Our scaling analysis confirms that this generalization stems from the data scale, which underscores the impact of our data engine. Finally, to advance general-category open-world 3D part segmentation, we release a benchmark covering a wide range of objects and parts.
Poster
Junho Kim · Gwangtak Bae · Eun Sun Lee · Young Min Kim

[ Exhibit Hall I ]

Abstract
Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications.
Poster
Yidi Shao · Mu Huang · Chen Change Loy · Bo Dai

[ Exhibit Hall I ]

Abstract
We introduce GausSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels. We leverage continuum mechanics and treat each kernel as a Center of Mass System (CMS) that describes continuous piece of matter, accounting for realistic deformations without idealized assumptions. To improve computational efficiency and fidelity, we employ a hierarchical structure that further organizes kernels into CMSs with explicit formulations, enabling a coarse-to-fine simulation approach. This structure significantly reduces computational overhead while preserving detailed dynamics. In addition, GausSim incorporates explicit physics constraints, such as mass and momentum conservation, ensuring interpretable results and robust, physically plausible simulations. To validate our approach, we present a new dataset, READY, containing multi-view videos of real-world elastic deformations. Experimental results demonstrate that GausSim achieves superior performance compared to existing physics-driven baselines, offering a practical and accurate solution for simulating complex dynamic behaviors. Code and model will be released.
Poster
Zhenhua Ning · Zhuotao Tian · Shaoshuai Shi · Daojing He · Guangming Lu · Wenjie Pei · Li Jiang

[ Exhibit Hall I ]

Abstract
Recent advances in point cloud perception have demonstrated remarkable progress in scene understanding through vision-language alignment leveraging large language models (LLMs). However, existing methods may still encounter challenges in handling complex instructions that require accurate spatial reasoning, even if the 3D point cloud data provides detailed spatial cues such as size and position for identifying the targets. To tackle this issue, we propose Relevant Reasoning Segmentation (R$^2$S), a reasoning-based segmentation framework. The framework emulates human cognitive processes by decomposing spatial reasoning into two sequential stages: first identifying relevant elements, then processing instructions guided by their associated visual priors. Furthermore, acknowledging the inadequacy of existing datasets in complex reasoning tasks, we introduce 3D ReasonSeg, a reasoning-based segmentation dataset comprising 25,185 training samples and 3,966 validation samples with precise annotations. Both quantitative and qualitative experiments demonstrate that the R$^2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities, and we hope that they can serve as a new baseline and benchmark for future work.
Poster
Kailai Zhou · Fuqiang Yang · Shixian Wang · Bihan Wen · Chongde Zi · Linsen Chen · Qiu Shen · Xun Cao

[ Exhibit Hall I ]

Abstract
RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene's generalizability across eleven datasets for four RGBT downstream tasks.
Poster
Han Wang · Shengyang Li · Jian Yang · Yuxuan Liu · Yixuan Lv · Zhuang Zhou

[ Exhibit Hall I ]

Abstract
Detecting and tracking ground objects using earth observation imagery remains a significant challenge in the field of remote sensing. Continuous maritime ship tracking is crucial for applications such as maritime search and rescue, law enforcement, and shipping analysis. However, most current ship tracking methods rely on geostationary satellites or video satellites. The former offer low resolution and are susceptible to weather conditions, while the latter have short filming durations and limited coverage areas, making them less suitable for the real-world requirements of ship tracking. To address these limitations, we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the effectiveness of ship tracking using low-Earth orbit constellations of optical and SAR sensors. This approach ensures shorter re-imaging cycles and enables all-weather tracking. HOSS ReID dataset includes images of the same ship captured over extended periods under diverse conditions, using different satellites of different modalities at varying times and angles. Furthermore, we propose a baseline method for cross-modal ship re-identification, TransOSS, which is built on the Vision Transformer architecture. It refines the patch embedding structure to better accommodate cross-modal tasks, incorporates additional embeddings to introduce more reference information, and employs contrastive learning …
Poster
Junru Lin · Chirag Vashist · Mikaela Uy · Colton Stearns · Xuan Luo · Leonidas Guibas · Ke Li

[ Exhibit Hall I ]

Abstract
Existing dynamic scene interpolation methods typically assume that the motion between consecutive time steps is small enough so that displacements can be locally approximated by linear models. In practice, even slight deviations from this small-motion assumption can cause conventional techniques to fail. In this paper, we introduce Global Motion Corresponder (GMC), a novel approach that robustly handle large motion and achieves smooth transitions. GMC learns a unary potential field that predicts SE(3) mappings into a shared canonical space, balancing correspondence, spatial and semantic smoothness, and local rigidity. We demonstrate that our method significantly outperforms existing baselines on 3D scene interpolation when the two states undergo large global motions. Furthermore, our method enables extrapolation where other baseline methods cannot.
Poster
Shouwei Ruan · Hanqing Liu · Yao Huang · XIaoqi Wang · Caixin KANG · Hang Su · Yinpeng Dong · Xingxing Wei

[ Exhibit Hall I ]

Abstract
Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs' robustness to real-world 3D variations, we propose AdvDreamer, the first framework capable of generating physically reproducible Adversarial 3D Transformation (Adv-3DT) samples from single-view observations. In AdvDreamer, we integrate three key innovations: Firstly, to characterize real-world 3D variations with limited prior knowledge precisely, we design a zero-shot Monocular Pose Manipulation pipeline built upon generative 3D priors. Secondly, to ensure the visual quality of worst-case Adv-3DT samples, we propose Naturalness Reward Model that provides continuous naturalness regularization during adversarial optimization, effectively preventing convergence to hallucinated or unnatural elements. Thirdly, to enable systematic evaluation across diverse VLM architectures and visual-language tasks, we introduce the Inverse Semantic Probability loss as the adversarial optimization objective, which solely operates in the fundamental visual-textual alignment space. Based on the captured Adv-3DT samples with high aggressiveness and transferability, we establish MM3DTBench, the first VQA benchmark dataset tailored to evaluate VLM robustness under challenging 3D variations. Extensive evaluations of representative VLMs with varying architectures reveal that real-world 3D variations can pose severe threats to model performance across various tasks.
Poster
Siqi Yang · Jinxiu Liang · Zhaojun Huang · Yeliduosi Xiaokaiti · Yakun Chang · Zhaofei Yu · Boxin Shi

[ Exhibit Hall I ]

Abstract
High-speed video reconstruction from neuromorphic spike cameras offers a promising alternative to traditional frame-based imaging, providing superior temporal resolution and dynamic range with reduced power consumption. Nevertheless, reconstructing high-quality colored videos from spikes captured in ultra-short time interval remains challenging due to the noisy nature of spikes. While some existing methods extend temporal capture window to improve reconstruction quality, they compromise the temporal resolution advantages of spike cameras. In this paper, we introduce SpikeDiff, the first zero-shot framework that leverages pretrained diffusion models to reconstruct high-quality colored videos from sub-millisecond chromatic spikes. By incorporating physics-based guidance into the diffusion sampling process, SpikeDiff bridges the domain gap between chromatic spikes and conventional images, enabling high-fidelity reconstruction without requiring domain-specific training data. Extensive experiments demonstrate that SpikeDiff achieves impressive reconstruction quality while maintaining ultra-high temporal resolution, outperforming existing methods across diverse challenging scenarios.
Poster
Hao Chen · Tao Han · Song Guo · Jie ZHANG · Yonghan Dong · Yunlong Yu · LEI BAI

[ Exhibit Hall I ]

Abstract
This paper presents Variables-Adaptive Mixture of Experts (VA-MoE), a novel framework for incremental weather forecasting that dynamically adapts to evolving spatiotemporal patterns in real-time data. Traditional weather prediction models often struggle with exorbitant computational expenditure and the need to continuously update forecasts as new observations arrive. VA-MoE addresses these challenges by leveraging a hybrid architecture of experts, where each expert specializes in capturing distinct sub-patterns of atmospheric variables (e.g., temperature, humidity, wind speed). Moreover, the proposed method employs a variable-adaptive gating mechanism to dynamically select and combine relevant experts based on the input context, enabling efficient knowledge distillation and parameter sharing. This design significantly reduces computational overhead while maintaining high forecast accuracy. Experiments on real-world ERA5 dataset demonstrate that VA-MoE performs comparable against state-of-the-art models in both short-term (e.g., 1–3 days) and long-term (e.g., 5 days) forecasting tasks, with only about 25\% of trainable parameters and 50\% of the initial training data.
Poster
hyunjin cho · Giyun choi · Jongwon Choi

[ Exhibit Hall I ]

Abstract
Existing Human Mesh Recovery (HMR) methods typically assume a standard human body structure, overlooking diverse anatomical conditions such as limb loss or mobility impairments. This assumption biases the models when applied to individuals with disabilities—a shortcoming further exacerbated by the limited availability of suitable datasets. To address this gap, we propose Amputated Joint Aware Human Recovery (AJAHR), which is an adaptive pose estimation framework that enhances mesh reconstruction for individuals with impairments. Our model incorporates a body-part amputation classifier—jointly trained alongside human mesh recovery—to detect potential amputations. We also introduce Amputee 3D (A3D), a synthetic dataset offering a wide range of amputee poses for more robust training. While maintaining strong performance on non-amputees, our approach achieves state-of-the-art results for amputated individuals.
Poster
Zhexiong Wan · Jianqin Luo · Yuchao Dai · Gim Hee Lee

[ Exhibit Hall I ]

Abstract
Recent point tracking methods have made great strides in recovering the trajectories of any point (especially key points) in long video sequences associated with large motions. However, the spatial and temporal granularities of point trajectories remain constrained by limited motion estimation accuracy and video frame rate. Leveraging the high temporal resolution and motion sensitivity of event cameras, we introduce event data for the first time to recover spatially dense and temporally continuous trajectories of every point at any time. Specifically, we define the dense and continuous point trajectory representation as estimating multiple control points of curves for each pixel and model the movement of sparse events triggered along continuous point trajectories. Building on this, we propose a novel multi-frame iterative streaming framework that first estimates local inter-frame motion representations from two consecutive frames with inter-frame events, then aggregates them into a global long-term motion representation to utilize input full video and event data with an arbitrary number of frames. Extensive experiments on simulated and real data demonstrate the significant improvement of our framework over state-of-the-art methods and the crucial role of introducing events to model continuous point trajectories.
Poster
Athinoulla Konstantinou · Georgios Leontidis · Mamatha Thota · Aiden Durrant

[ Exhibit Hall I ]

Abstract
Learning self-supervised representations that are invariant and equivariant to transformations is crucial for advancing beyond traditional visual classification tasks. However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. Instead, we leverage the intrinsic pose-awareness capabilities of capsules to improve performance in pose estimation tasks. To further challenge our assumptions, we increase task complexity via multi-geometric transformations to enable a more thorough evaluation of invariance and equivariance by introducing 3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical results demonstrate that EquiCaps outperforms prior state-of-the-art equivariant methods on geometric tasks, including rotation and translation, achieving a supervised-level $R^2$ of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE and CapsIE by 0.05 and 0.04 $R^2$, respectively. Moreover, in contrast to non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant performance under combined geometric transformations, underscoring its generalisation capabilities and the promise of predictor-free capsule architectures. Code and dataset will be released.
Poster
Ye Lu · Jie Wang · Jianjun Gao · Rui Gong · Chen Cai · Kim-Hui Yap

[ Exhibit Hall I ]

Abstract
Recent Mamba-based methods for the pose-lifting task tend to model joint dependencies by 2D-to-1D mapping with diverse scanning strategies.Though effective, they struggle to model intricate joint connections and uniformly process all joint motion trajectories while neglecting the intrinsic differences across motion characteristics.In this work, we propose a structure-aware and motion-adaptive framework to capture spatial joint topology along with diverse motion dynamics independently, named as SAMA. Specifically, SAMA consists of a Structure-aware State Integrator (SSI) and a Motion-adaptive State Modulator (MSM). The Structure-aware State Integrator is tasked with leveraging dynamic joint relationships to fuse information at both the joint feature and state levels in the state space, based on pose topology rather than sequential state transitions.The Motion-adaptive State Modulator is responsible for joint-specific motion characteristics recognition, thus applying tailored adjustments to diverse motion patterns across different joints.Through the above key modules, our algorithm enables structure-aware and motion-adaptive pose lifting.Extensive experiments across multiple benchmarks demonstrate that our algorithm achieves advanced results with fewer computational costs.
Poster
Dehao Yuan · Levi Burner · Jiayi Wu · Minghui Liu · Jingxi Chen · Yiannis Aloimonos · Cornelia Fermuller

[ Exhibit Hall I ]

Abstract
Event-based motion field estimation is an important task. However, current optical flow methods face challenges: learning-based approaches, often frame-based and relying on CNNs, lack cross-domain transferability, while model-based methods, though more robust, are less accurate. To address the limitations of optical flow estimation, recent works have focused on normal flow, which can be more reliably measured in regions with limited texture or strong edges. However, existing normal flow estimators are predominantly model-based and suffer from high errors.In this paper, we propose a novel supervised point-based method for normal flow estimation that overcomes the limitations of existing event learning-based approaches. Using a local point cloud encoder, our method directly estimates per-event normal flow from raw events, offering multiple unique advantages: 1) It produces temporally and spatially sharp predictions. 2) It supports more diverse data augmentation, such as random rotation, to improve robustness across various domains. 3) It naturally supports uncertainty quantification via ensemble inference, which benefits downstream tasks. 4) It enables training and inference on undistorted data in normalized camera coordinates, improving transferability across cameras. Extensive experiments demonstrate our method achieves better and more consistent performance than state-of-the-art methods when transferred across different datasets. Leveraging this transferability, we train our model …
Poster
Shuang Guo · Friedhelm Hamann · Guillermo Gallego

[ Exhibit Hall I ]

Abstract
Event cameras rely on motion to obtain information about scene appearance. In other words, for event cameras, motion and appearance are seen both or neither, which are encoded in the output event stream. Previous works consider recovering these two visual quantities as separate tasks, which does not fit with the nature of event cameras and neglects the inherent relations between both tasks. In this paper, we propose an unsupervised learning framework that jointly estimates optical flow (motion) and image inten-sity (appearance), with a single network. Starting from the event generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity, which is further combined with the contrast maximization framework, yielding a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show that our model achieves state-of-the-art performance for both optical flow (achieves 20% and 25% improvement in EPE and AE respectively in the unsupervised learning category) and intensity estimation (produces competitive results with other baselines, particularly in high dynamic range scenarios). Last but not least, our model achieves shorter inference time than all the other optical flow models and many of the image reconstruction models, while they …
Poster
Adrian Chow · Evelien Riddell · Yimu Wang · Sean Sedwards · Krzysztof Czarnecki

[ Exhibit Hall I ]

Abstract
Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusion-induced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance.
Poster
Atin Pothiraj · Jaemin Cho · Elias Stengel-Eskin · Mohit Bansal

[ Exhibit Hall I ]

Abstract
Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, **C**ounting **A**modally for **P**atterns **T**hrough **U**nseen **RE**gions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene).CAPTURe requires both recognizing visual patterns and reasoning, making it an ideal testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models, allowing them to fill in missing information. CAPTURe consists of two parts:(1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs -- GPT-4o, Intern-VL2-Llama3, Molmo, and Qwen2-VL -- on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in …
Poster
Zhijian Huang · Chengjian Feng · Baihui Xiao · Feng yan · ZEQUN JIE · Yujie Zhong · Xiaodan Liang · Lin Ma

[ Exhibit Hall I ]

Abstract
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world.
Poster
Yuyi Liu · Xinhang Song · Tianliang Qi · Shuqiang Jiang

[ Exhibit Hall I ]

Abstract
Towards visual room rearrangement for embodied agents, this paper tackles the intricate challenge of restoring a disarrayed scene configuration to its intended goal state. The task necessitates a range of sophisticated capabilities, including efficient spatial navigation, precise and accurate object interaction, sensitive scene change detection, and meticulous restoration techniques. The inherent complexity of this endeavor stems from the diverse nature of potential object changes, encompassing movements within the space, alterations in appearance, and changes in existence—where objects may be introduced or removed from the scene. Previous methods, either end-to-end reinforcement learning or modular approaches, struggle with handling these changes in a unified manner due to the heterogeneous nature of the inference spaces. To address this, this paper proposes a Trial-Oriented Visual Rearrangement (TOR) framework, which leverages the principles of stronger embodiment to prune the joint reasoning space and identify a smaller shared space for processing various object changes. TOR maintains a differential point cloud representation to capture environmental changes and uses two core mechanisms, assessment and refinement, to iteratively restore the scene to the goal state. Experimental results demonstrate the effectiveness of TOR in restoring both object movement and appearance changes and show its generalization to complex multi-room environments.
Poster
Yufeng Jin · Vignesh Prasad · Snehal Jauhri · Mathias Franzius · Georgia Chalvatzaki

[ Exhibit Hall I ]

Abstract
Efficient and accurate object pose estimation is an essential component for modern vision systems in many applications such as Augmented Reality, autonomous driving, and robotics. While research in model-based 6D object pose estimation has delivered promising results, model-free methods are hindered by the high computational load in rendering and inferring consistent poses of arbitrary objects in a live RGB-D video stream. To address this issue, we present 6DOPE-GS, a novel method for online 6D object pose estimation & tracking with a single RGB-D camera by effectively leveraging advances in Gaussian Splatting. Thanks to the fast differentiable rendering capabilities of Gaussian Splatting, 6DOPE-GS can simultaneously optimize for 6D object poses and 3D object reconstruction. To achieve the necessary efficiency and accuracy for live tracking, our method uses incremental 2D Gaussian Splatting with an intelligent dynamic keyframe selection procedure to achieve high spatial object coverage and prevent erroneous pose updates. We also propose an opacity statistic-based pruning mechanism for adaptive Gaussian density control, to ensure training stability and efficiency. We evaluate our method on the HO3D and YCBInEOAT datasets and show that 6DOPE-GS matches the performance of state-of-the-art baselines for model-free simultaneous 6D pose tracking and reconstruction while providing a 5x speedup. …
Poster
Javier Tirado-Garín · Javier Civera

[ Exhibit Hall I ]

Abstract
We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and Kannala-Brandt. Our approach also applies to edited---cropped and stretched---images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. We will make our code and weights publicly available.
Poster
Zukang Liao · Min Chen

[ Exhibit Hall I ]

Abstract
In many applications, machine-learned (ML) models are required to hold some invariance qualities, such as rotation, size, and intensity invariance. Among these, testing for background invariance presents a significant challenge due to the vast and complex data space it encompasses. To evaluate invariance qualities, we use a visualization-based testing framework which allows human analysts to assess and make informed decisions about the invariance properties of ML models. We show such informative testing framework is preferred as ML models with the same global statistics (e.g., accuracy scores) can behave differently and have different visualized testing patterns. However, such human analysts might not lead to consistent decisions without a systematic sampling approach to select representative testing suites. In this work, we present a technical solution for selecting background scenes according to their semantic proximity to a target image that contains a foreground object being tested. We construct an ontology for storing knowledge about relationships among different objects using association analysis. This ontology enables efficient and meaningful search for background scenes of different semantic distances to a target image, enabling the selection of a test suite that is both diverse and reasonable. Compared with other testing techniques, e.g., random sampling, nearest neighbours, or …
Poster
Sung-Yeon Park · Can Cui · Yunsheng Ma · Ahmadreza Moradipari · Rohit Gupta · Kyungtae Han · Ziran Wang

[ Exhibit Hall I ]

Abstract
Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we will publicly release both NuPlanQA-Eval and NuPlanQA-1M upon acceptance of this paper.
Poster
Byeongjun Kwon · Munchurl Kim

[ Exhibit Hall I ]

Abstract
.Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches and results in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias-Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after …
Poster
Xiao Fang · Minhyek Jeon · Zheyang Qin · Stanislav Panev · Celso de Melo · Shuowen Hu · Shayok Chakraborty · Fernando De la Torre

[ Exhibit Hall I ]

Abstract
Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper presents a novel approach to address this challenging problem by leveraging generative AI for the high-quality synthesis of aerial images and corresponding labels to enhance detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Through extensive experiments across diverse aerial imagery domains, we demonstrate significant performance gains (more than 40% in some cases) over existing domain adaptation and weakly supervised learning methods. Our method also outperforms the baseline detectors trained on a source dataset by 4-12%. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah, which along …
Poster
Chong Cheng · Yu Hu · Sicheng Yu · Beizhen ZHAO · Zijian Wang · Hao Wang

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein $(\text{MW}_2)$ distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in $\mathrm{Sim}(3)$ space. Furthermore, we design a joint 3DGS registration module that integrates the $\text{MW}_2$ distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the \textit{RE10K} and \textit{ACID} datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis.
Poster
Xiaoyang Hao · Han Li

[ Exhibit Hall I ]

Abstract
Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting.By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 …
Poster
ZIYU ZHU · Xilin Wang · Yixuan Li · Zhuofan Zhang · Xiaojian Ma · Yixin Chen · Baoxiong Jia · Wei Liang · Qian Yu · Zhidong Deng · Siyuan Huang · Qing Li

[ Exhibit Hall I ]

Abstract
Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{MTU3D}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploration that represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbf{V}ision-\textbf{L}anguage-\textbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14\%, 27\%, 11\%, and 3\% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. MTU3D's versatility enables navigation …
Poster
Yinuo Zhao · Jiale Yuan · Zhiyuan Xu · Xiaoshuai Hao · Xinyi Zhang · Kun Wu · Zhengping Che · Chi Liu · Jian Tang

[ Exhibit Hall I ]

Abstract
Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time applicability. To address this, we propose **$T^2$-VLM**, a novel training-free, temporally consistent framework that generates accurate rewards through tracking the changes in VLM-derived subgoals. Specifically, our method first queries the VLM to establish spatially aware subgoals and an initial completion estimate before each round of interaction. We then employ a Bayesian tracking algorithm to update the goal completion status dynamically, using subgoal hidden states to generate structured rewards for reinforcement learning (RL) agents. This approach enhances long-horizon decision-making and improves failure recovery capabilities with RL. Extensive experiments indicate that **$T^2$-VLM** achieves state-of-the-art performance in two robot manipulation benchmarks, demonstrating superior reward accuracy with reduced computation consumption. We believe our approach not only advances reward generation techniques but also contributes to the broader field of embodied AI.
Poster
Hongjin Lyu · Bo Li · Paul Rosin · Yu-Kun Lai

[ Exhibit Hall I ]

Abstract
Image colorization is a typical ill-posed problem. Among various colorization methods, scribble-based methods have a unique advantage that allows users to accurately resolve ambiguities and modify the colors of any objects to suit their specific tastes. However, due to the time-consuming scribble drawing process, users tend to draw sparse scribbles instead of dense and detailed scribbles, which makes it challenging for existing methods, especially for regions with no immediate scribbles. Facing the above problems, this paper proposes a novel colorization algorithm named Local and Global Affinity Net (LGA-Net) that formulates the scribble-based colorization task as an affinity propagation process at both local and global levels. Instead of predicting color values directly, our neural network learns to predict local and global affinity relationships between pixels for a given grayscale input, describing how colors should be propagated, which are independent of the scribbles. Given reliable affinity relationships, the color propagation process is formulated as a maximum a posteriori problem. Both local and global affinities are represented using a weighted graph and enabled by a graph Laplacian regularizer to ensure accurate color propagation. Extensive experiments demonstrate that LGA-Net produces state-of-the-art colorization results when using sparse scribbles.
Poster
Aneel Damaraju · Dean Hazineh · Todd Zickler

[ Exhibit Hall I ]

Abstract
Vision benefits from grouping pixels into objects and understanding their spatial relationships, both laterally and in depth. This is captured by a scene representation comprising of an occlusion-ordered stack of "object layers,’’ each containing an isolated and amodally-completed object. To infer this representation from an image we introduce a diffusion-based architecture named Concurrent Object Layers (CObL). CObL generates a stack of object layers concurrently, using Stable Diffusion as a prior for natural objects, and using inference-time guidance to ensure the inferred layers composite back to the input image. We train CObL using a few thousand synthetically-generated images of multi-object tabletop scenes, and we find that it zero-shot generalizes to scenes of real-world tabletops with varying numbers of novel objects. In contrast to recent models for amodal object completion, CObL reconstructs multiple partially-occluded objects without any user prompting and without knowing the number of objects beforehand; and unlike previous models for object-centric representation learning, CObL is not limited to the closed world it was trained in.
Poster
Matthew Beveridge · Shree Nayar

[ Exhibit Hall I ]

Abstract
We introduce a taxonomy of solid materials for hierarchical material recognition from local appearance. Our taxonomy is motivated by vision applications, and is arranged according to the physical traits of materials. We contribute a diverse dataset of images and aligned depth maps of materials in the wild. The depth maps can be used to generate novel views to augment the dataset. Utilizing the taxonomy and dataset, we present a learning-based approach to hierarchical material recognition that uses graph neural networks. Our model leverages taxonomic proximity between material classes, and achieves state-of-the-art performance. We show that our model has the potential to generalize in few-shot learning settings. As a result, it achieves coarse classification of underrepresented materials.
Poster
Haoran Wang · Zekun Li · Jian Zhang · Lei Qi · Yinghuan Shi

[ Exhibit Hall I ]

Abstract
Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.
Poster
Vladislav Bargatin · Egor Chistov · Alexander Yakovenko · Dmitriy Vatolin

[ Exhibit Hall I ]

Abstract
Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling.We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289. On Sintel (clean), we share first place with the 5-frame VideoFlow-MOF, achieving an endpoint error (EPE) of 0.991, and on KITTI-2015, we place first with an Fl-all error of 2.94\%. Ablation studies demonstrate the critical role of multi-frame strategies, correlation-volume scaling, and resolution-aware training in striking an optimal …
Poster
Qingwang Zhang · Yingying Zhu

[ Exhibit Hall I ]

Abstract
This paper addresses the limitations of existing cross-view object geo-localization schemes, which rely on rectangular proposals to localize irregular objects in satellite imagery. These ``rectangular shackles" inherently struggle to precisely define objects with complex geometries, leading to incomplete coverage or erroneous localization. We propose a novel scheme, cross-view object segmentation (CVOS), which achieves fine-grained geo-localization by predicting pixel-level segmentation masks of query objects. CVOS enables accurate extraction of object shapes, sizes, and areas—critical for applications like urban planning and agricultural monitoring. We also created the CVOGL-Seg dataset specifically to support and evaluate CVOS. To tackle CVOS challenges, we introduce Transformer Object Geo-localization (TROGeo), a two-stage framework. First, the Heterogeneous Task Training Stage (HTTS) employs a single transformer encoder with a Cross-View Object Perception Module (CVOPM) and is trained by learning a heterogeneous task.Second, the SAM Prompt Stage (SPS) utilizes SAM’s zero-shot segmentation capability, guided by HTTS outputs, to generate precise masks. We extensively evaluate our method on CVOGL and CVOGL-Seg datasets and demonstrate state-of-the-art performance compared to existing models. Our work demonstrates that CVOS breaks the rectangular shackles and unlocks new potential for fine-grained object geo-localization.
Poster
Xianghui Xie · Jan Lenssen · Gerard Pons-Moll

[ Exhibit Hall I ]

Abstract
We propose MVGBench, a comprehensive benchmark for multi-view image generation models (MVGs) that evaluates 3D consistency in geometry and texture, image quality, and semantics (using vision language models).Recently, MVGs have been the main driving force in 3D object creation. However, existing metrics compare generated images against ground truth target views, which is not suitable for generative tasks where multiple solutions exist while differing from ground truth. Furthermore, different MVGs are trained on different view angles, synthetic data and specific lightings -- robustness to these factors and generalization to real data are rarely evaluated thoroughly. Without a rigorous evaluation protocol, it is also unclear what design choices contribute to the progress of MVGs. MVGBench evaluates three different aspects: best setup performance, generalization to real data and robustness. Instead of comparing against ground truth, we introduce a novel 3D self-consistency metric which compares 3D reconstructions from disjoint generated multi-views. We systematically compare 12 existing MVGs on 4 different curated real and synthetic datasets. With our analysis, we identify important limitations of existing methods specially in terms of robustness and generalization, and we find the most critical design choices. Using the discovered best practices, we propose ViFiGen, a method that outperforms all evaluated …
Poster
Dongwoo Kang · Akhil Perincherry · Zachary Coalson · Aiden Gabriel · Stefan Lee · Sanghyun Hong

[ Exhibit Hall I ]

Abstract
An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2$\times$ reduction in computation across three off-the-shelf agents in both standard and continuous environments.
Poster
Shaocong Dong · Lihe Ding · Xiao Chen · Yaokun Li · Yuxin WANG · Yucheng Wang · Qi WANG · Jaehyeok Kim · Chenjian Gao · Zhanpeng Huang · Zibin Wang · Tianfan Xue · Dan Xu

[ Exhibit Hall I ]

Abstract
To generate 3D objects, early research focused on multi-view-driven approaches relying solely on 2D renderings. Recently, the 3D native latent diffusion paradigm has demonstrated superior performance in 3D generation, because it fully leverages the geometric information provided in ground truth 3D data. Despite its fast development, 3D diffusion still faces three challenges. First, the majority of these methods represent a 3D object by one single latent, regardless of its complexity. This may lead to detail loss when generating 3D objects with multiple complicated parts. Second, most 3D assets are designed parts by parts, yet the current holistic latent representation overlooks the independence of these parts and their interrelationships, limiting the model's generative ability. Third, current methods rely on global conditions (e.g., text, image, point cloud) to control the generation process, lacking detailed controllability. Therefore, motivated by how 3D designers create a 3D object, we present a new part-based 3D generation framework, CoPart, which represents a 3D object with multiple contextual part latents and simultaneously generates coherent 3D parts. This part-based framework has several advantages, including: i) reduces the encoding burden of intricate objects by decomposing them into simpler parts, ii) facilitates part learning and part relationship modeling, and iii) naturally …
Poster
Xin Wang · Xinlin Wang · Shuiping Gou

[ Exhibit Hall I ]

Abstract
Vision-based geolocation techniques that establish spatial correspondences between smaller query images and larger georeferenced images have gained significant attention. Existing approaches typically employ a separate "retrieve-then-match" paradigm, whereas such paradigms suffer from computational inefficiency or precision limitations.To this end, we propose TopicGeo, an unified framework for direct and precise query-to-reference image matching via three key innovations.The textual object semantics, called topics, distilled from CLIP prompt learning are embedded into the geolocation framework to eliminate intra-class and inter-class distribution discrepancies while also enhancing processing efficiency.Center-based adaptive label assignment and outlier rejection mechanisms as a joint retrieval-matching optimization strategy ensure task-coherent feature learning and precise spatial correspondences. A multi-level fine matching pipeline is introduced to refine matching from quality and quantity.Evaluations on large-scale synthetic and real-world datasets illustrate that TopicGeo achieves state-of-the-art performance in retrieval recall and matching accuracy while maintaining a balance in computational efficiency.
Poster
Haoyu Wu · Jingyi Xu · Hieu Le · Dimitris Samaras

[ Exhibit Hall I ]

Abstract
Token merging can effectively accelerate various vision systems by processing groups of similar tokens only once and sharing the results across them. However, existing token grouping methods are often ad hoc and random, disregarding the actual content of the samples. We show that preserving high-information tokens during merging—those essential for semantic fidelity and structural details—significantly improves sample quality, producing finer details and more coherent, realistic generations. Despite being simple and intuitive, this approach remains underexplored.To do so, we propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation, leveraging readily available importance scores, such as those from classifier-free guidance in diffusion models. Experiments show that our approach significantly outperforms baseline methods across multiple applications, including text-to-image synthesis, multi-view image generation, and video generation with various model architectures such as Stable Diffusion, Zero123++, AnimateDiff, or PixArt-$\alpha$.
Poster
Yunuo Chen · Zezheng Lyu · Bing He · Ning Cao · Gang chen · Guo Lu · Wenjun Zhang

[ Exhibit Hall I ]

Abstract
Recent learned image compression (LIC) models have achieved remarkable rate-distortion (RD) performance, yet their high computational complexity severely limits practical deployment. To overcome this challenge, we propose a novel Stage-wise Modular Distillation framework, SMoDi, which efficiently compresses LIC models while preserving RD performance. This framework treats each stage of LIC models as an independent sub-task, mirroring the teacher model’s task decomposition to student, thereby simplifying knowledge transfer.We identify two crucial factors determining the effectiveness of knowledge distillation: student model construction and loss function design. Specifically, we first propose Teacher-Guided Student Model Construction, a pruning-like method ensuring architectural consistency between teacher and student models. Next, we introduce Implicit End-to-end Supervision, facilitating adaptive energy compaction and bitrate regularization.Based on these insights, we develop KDIC, a lightweight student model derived from the state-of-the-art S2CFormer model. Experimental results demonstrate that KDIC achieves top-tier RD performance with significantly reduced computational complexity. To our knowledge, this work is among the first successful applications of knowledge distillation to learned image compression.
Poster
Zihan Wang · Jeff Tan · Tarasha Khurana · Neehar Peri · Deva Ramanan

[ Exhibit Hall I ]

Abstract
We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio) - such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views.
Poster
Boqian Li · Zeyu Cai · Michael Black · Haiwen Feng · Yuliang Xiu

[ Exhibit Hall I ]

Abstract
Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). It also reduces directional errors by (67.2% ~ 89.8%) in few-shot settings (<1% data). Qualitative results demonstrate strong performance regardless of body shape, loose clothing, or challenging poses. We will release the code and models for research purposes.
Poster
David Serrano · Aditya Arora · Luis Herranz · Kosta Derpanis · Michael Brown · Javier Vazquez-Corral

[ Exhibit Hall I ]

Abstract
White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input image, each processed with predefined WB presets. However, we demonstrate that these methods are suboptimal for common multi-illuminant scenarios. Additionally, existing fusion-based methods rely on sRGB WB datasets lacking dedicated multi-illuminant images, limiting both training and evaluation. To address these challenges, we introduce two key contributions. First, we propose an efficient transformer-based model that effectively captures spatial dependencies across sRGB WB presets, substantially improving upon linear fusion techniques. Second, we introduce a large-scale multi-illuminant dataset comprising over 16,000 sRGB images rendered with five different WB settings, along with WB-corrected images. Our method achieves up to 100% improvement over existing techniques on our new multi-illuminant image fusion dataset. We will release our code and dataset upon acceptance.
Poster
Siyu Chen · Ting Han · Changshe Zhang · Xin Luo · Meiliu Wu · Guorong Cai · Jinhe Su

[ Exhibit Hall I ]

Abstract
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code …
Poster
Mingtao Feng · Longlong Mei · Zijie Wu · Jianqiao Luo · Fenghao Tian · Jie Feng · Weisheng Dong · Yaonan Wang

[ Exhibit Hall I ]

Abstract
Text to point cloud cross-modal localization is a crucial vision-language task for future human-robot collaboration. Existing coarse-to-fine frameworks assume that each query text precisely corresponds to the center area of a submap, limiting their applicability in real-world scenarios. This work redefines the task under a more realistic assumption, relaxing the one-to-one retrieval constraint by allowing patially matching query text and submap pairs. To address this challenge, we augment datasets with partially matching submaps and introduce an uncertainty-aware framework. Specifically, we model cross-modal ambiguity in fine-grained location regression by integrating uncertainty scores, represented as 2D Gaussian distributions, to mitigate the impact of challenging samples. Additionally, we propose an uncertainty-aware similarity metric that enhances similarity assessment between query text and submaps by propagating uncertainty into coarse place recognition, enabling the model to learn discriminative features, effectively handle partially matching samples and improve task synergy. Extensive experiments on KITTI360Pose and CityRefer demonstrate that our method achieves state-of-the-art performance across both stages. Our code will be publicly available.
Poster
Zhuoyuan Li · Jiahao Lu · Jiacheng Deng · Hanzhi Chang · Lifan Wu · Yanzhe Liang · Tianzhu Zhang

[ Exhibit Hall I ]

Abstract
The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.
Poster
Yijun Yang · Zhao-Yang Wang · Qiuping Liu · Shu Wen Sun · Kang Wang · Rama Chellappa · Zongwei Zhou · Alan Yuille · Lei Zhu · Yu-Dong Zhang · Jieneng Chen

[ Exhibit Hall I ]

Abstract
Providing effective treatment and making informed decisions are essential goals of modern medicine and clinical care.We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models.To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that predicts future disease states based on clinical decisions. MeWM comprises (i) vision-language models to serve as policy models, and (ii) tumor generative models as dynamics models. The policy model generates action plans, such as clinical treatments, while the dynamics model simulates tumor progression or regression under given treatment conditions. Building on this, we propose the inverse dynamics model that applies survival analysis to the simulated post-treatment tumor, enabling the evaluation of treatment efficacy and the selection of the optimal clinical action plan. As a result, the proposed MeWM simulates disease dynamics by synthesizing post-treatment tumors, with state-of-the-art specificity in Turing tests evaluated by radiologists. Simultaneously, its inverse dynamics model outperforms medical-specialized GPTs in optimizing individualized treatment protocols across all metrics.Notably, MeWM improves clinical decision-making for interventional physicians, boosting F1-score in selecting the optimal TACE protocol by 13\%, paving the way for future integration of medical world models as the second readers.
Poster
Yanrui Bin · Wenbo Hu · Haoyuan Wang · Xinya Chen · Bing WANG

[ Exhibit Hall I ]

Abstract
Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications.While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge.Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models.To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context.Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.Code and models will be publicly available.
Poster
Han Han · Wei Zhai · Yang Cao · Bin Li · Zheng-Jun Zha

[ Exhibit Hall I ]

Abstract
Tracking Any Point (TAP) plays a crucial role in motion analysis. Video-based approaches rely on iterative local matching for tracking, but they assume linear motion during the blind time between frames, which leads to point loss under large displacements or nonlinear motion. The high temporal resolution and motion blur-free characteristics of event cameras provide continuous, fine-grained motion information, capturing subtle variations with microsecond precision. This paper presents an event-based framework for tracking any point, which tackles the challenges posed by spatial sparsity and motion sensitivity in events through two tailored modules. Specifically, to resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process. Additionally, a variable motion aware module is integrated to ensure temporally consistent responses that are insensitive to varying velocities, thereby enhancing matching precision.To validate the effectiveness of the approach, two event dataset for tracking any point is constructed by simulation. The method improves the $Survival_{50}$ metric by 17.9\% over event-only tracking of any point baseline. Moreover, on standard feature tracking benchmarks, it outperforms all existing methods, even those that combine events and video frames.
Poster
Ruijie Zhu · Mulin Yu · Linning Xu · Lihan Jiang · Yixuan Li · Tianzhu Zhang · Jiangmiao Pang · Bo Dai

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing.
Poster
Zhiqiang Yan · Zhengxue Wang · Haoye Dong · Jun Li · Jian Yang · Gim Hee Lee

[ Exhibit Hall I ]

Abstract
We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory, offering a flexible integration of multiple constraints and reconstruction objectives to enhance accuracy and robustness. Our DuCos is the first to significantly improve generalization across diverse scenarios with foundation models as prompts. The prompt design consists of two key components: Correlative Fusion (CF) and Gradient Regulation (GR). CF facilitates precise geometric alignment and effective fusion between prompt and depth features, while GR refines depth predictions by enforcing consistency with sharp-edged depth maps derived from foundation models. Crucially, these prompts are seamlessly embedded into the Lagrangian constraint term, forming a synergistic and principled framework. Extensive experiments demonstrate that DuCos outperforms existing state-of-the-art methods, achieving superior accuracy, robustness, and generalization. The source codes and pre-trained models will be publicly available.
Poster
Muhammad Usama Saleem · Ekkasit Pinyoanuntapong · Mayur Patel · Hongfei Xue · Ahmed Helmy · Srijan Das · Pu Wang

[ Exhibit Hall I ]

Abstract
Reconstructing a 3D hand mesh from a single RGB image is challenging due to complex articulations, self-occlusions, and depth ambiguities. Traditional discriminative methods, which learn a deterministic mapping from a 2D image to a single 3D mesh, often struggle with the inherent ambiguities in 2D-to-3D mapping. To address this challenge, we propose MaskHand, a novel generative masked model for hand mesh recovery that synthesizes plausible 3D hand meshes by learning and sampling from the probabilistic distribution of the ambiguous 2D-to-3D mapping process. MaskHand consists of two key components: (1) a VQ-MANO, which encodes 3D hand articulations as discrete pose tokens in a latent space, and (2) a Context-Guided Masked Transformer that randomly masks out pose tokens and learns their joint distribution, conditioned on corrupted token sequence, image context, and 2D pose cues. This learned distribution facilitates confidence-guided sampling during inference, producing mesh reconstructions with low uncertainty and high precision. Extensive evaluations on benchmark and real-world datasets demonstrate that MaskHand achieves state-of-the-art accuracy, robustness, and realism in 3D hand mesh reconstruction. Project website: https://anonymous-ml-model.github.io/MaskHand.
Poster
Ziyue Huang · Yongchao Feng · Ziqi Liu · Shuai Yang · Qingjie Liu · Yunhong Wang

[ Exhibit Hall I ]

Abstract
Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Open-vocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models will be released.
Poster
Maolin Wei · Wanzhou Liu · Eshed Ohn-Bar

[ Exhibit Hall I ]

Abstract
If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present RoadRules, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using RoadRules, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts (2) fine-tuning on RoadRules improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in RoadRules-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on RoadRules enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and DriveLM, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks. …
Poster
Huixin Sun · Yanjing Li · Linlin Yang · Xianbin Cao · Baochang Zhang

[ Exhibit Hall I ]

Abstract
Despite advances in generic object detection, there remains a performance gap in detecting small objects compared to normal-scale objects. We reveal that conventional object localization methods suffer from gradient instability in small objects due to sharper loss curvature, leading to a convergence challenge. To address the issue, we propose Uncertainty-Aware Gradient Stabilization (UGS), a framework that reformulates object localization as a classification task to stabilize gradients. UGS quantizes continuous labels into interval non-uniform discrete representations. Under a classification-based objective, the localization branch generates bounded and confidence-driven gradients, mitigating instability. Furthermore, UGS integrates an uncertainty minimization (UM) loss that reduces prediction variance and an uncertainty-guided refinement (UR) module that identifies and refines high-uncertainty regions via perturbations. Evaluated on four benchmarks, UGS consistently improves anchor-based, anchor-free, and state-of-the-art small object detectors. Especially, UGS boosts the prior art DNTR by 3.2\% AP on the VisDrone dataset. The code will be released upon acceptance.
Poster
Sangwon Baik · Hyeonwoo Kim · Hanbyul Joo

[ Exhibit Hall I ]

Abstract
We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture plausible and realistic OOR cues, enabling efficient ways to collect a 3D dataset to learn OOR for various unbounded object categories. Our approach begins by synthesizing diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of plausible 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to real-world 3D scene arrangement tasks using the OOR diffusion model.
Poster
Yuru Jia · Valerio Marsocci · Ziyang Gong · Xue Yang · Maarten Vergauwen · Andrea Nascetti

[ Exhibit Hall I ]

Abstract
Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models—which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation—remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs.
Poster
Xiaoxiao Wang · Chunxiao Li · Peng Sun · Boming Miao · Yunjian Zhang · Yao Zhu

[ Exhibit Hall I ]

Abstract
Human keypoint detection is fundamental in computer vision, with applications in pose estimation and action recognition. However, existing evaluation metrics (e.g., OKS, PCP, PDJ) rely on human-annotated ground truth, a labor-intensive process that increases costs, limits scalability. To address this, we propose KPAScore (KeyPoint-Answering Score), an annotation-free metric independent of ground truth. It evaluates keypoint detection using a two-stage VLM-based question-answering process: first, the VLM identifies the presence of keypoints within the image, and second, visual prompts are introduced to query the likelihood of each keypoint being accurately localized within a predefined boundary. To validate the rationale behind KPAScore, we propose KPUBench (KeyPoint Understanding Benchmark), which comprehensively evaluates the VLM's ability to determine keypoint presence and localization. Extensive experiments demonstrate KPAScore’s effectiveness from three perspectives: consistency to keypoint variation, correlation with traditional metrics, alignment with human perception. We hope KPAScore will reduce reliance on manual annotations, facilitating broader adoption of keypoint detection in real-world applications.
Poster
Chengkai Hou · Yanjie Ze · Yankai Fu · Zeyu Gao · Songbo Hu · Yue Yu · Shanghang Zhang · Huazhe Xu

[ Exhibit Hall I ]

Abstract
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of \ours adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks.
Poster
Jiakai Zhang · Shouchen Zhou · Haizhao Dai · Xinhang Liu · Peihao Wang · Zhiwen Fan · Yuan Pei · Jingyi Yu

[ Exhibit Hall I ]

Abstract
Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets. We will release our code, models, and datasets to stimulate further research.
Poster
Huachao Zhu · Zelong Liu · Zhichao Sun · Yuda Zou · Gui-Song Xia · Yongchao Xu

[ Exhibit Hall I ]

Abstract
Recognizing out-of-distribution (OoD) objects on roads is crucial for safe driving. Most existing methods rely on segmentation models' uncertainty as anomaly scores, often resulting in false positives - especially at ambiguous regions like boundaries, where segmentation models inherently exhibit high uncertainty. Additionally, it is challenging to define a suitable threshold to generate anomaly masks, especially with the inconsistencies in predictions across consecutive frames. We propose DetSeg, a novel paradigm that helps incorporate object-level understanding. DetSeg first detects all objects in the open world and then suppresses in-distribution (ID) bounding boxes, leaving only OoD proposals. These proposals can either help previous methods eliminate false positives (DetSeg-$\mathcal{R}$), or generate binary anomaly masks without complex threshold search when combined with a box-prompted segmentation module (DetSeg-$\mathcal{S}$).Additionally, we introduce vanishing point guided Hungarian matching (VPHM) to smooth the prediction results within a video clip, mitigating abrupt variations of predictions between consecutive frames. Comprehensive experiments on various benchmarks demonstrate that DetSeg significantly improves performance, reducing the FPR$\it{_{95}}$ of previous methods by up to 37.45\%, offering a more robust and practical solution for this domain.
Poster
Yu Wang · Bo Dang · Wanchun Li · Wei Chen · Yansheng Li

[ Exhibit Hall I ]

Abstract
With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces \textbf{HoliTracer}, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code will be made publicly available.
Poster
Tomasz Niewiadomski · Anastasios Yiannakidis · Hanz Cuevas Velasquez · Soubhik Sanyal · Michael Black · Silvia Zuffi · Peter Kulits

[ Exhibit Hall I ]

Abstract
The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior.Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations.However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species.Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels.While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem.Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets.Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments.Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model.We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding …
Poster
Haiwen Feng · Junyi Zhang · Qianqian Wang · Yufei Ye · Pengcheng Yu · Michael Black · Trevor Darrell · Angjoo Kanazawa

[ Exhibit Hall I ]

Abstract
Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
Poster
Leekyeung Han · Hyunji Min · Gyeom Hwangbo · Jonghyun Choi · Paul Hongsuck Seo

[ Exhibit Hall I ]

Abstract
We introduce DialNav, a novel dialog-based navigation task, where an embodied agent (Navigator) collaborates with a remote guide (Guide) through multi-turn dialog to reach a goal location. Unlike prior works our setting requires Guide to infer Navigator's location based on dialog, making dialog crucial for success. To support this task, we collect and release Remote Assistance in Navigation (RAIN) dataset, human-human dialog paired with navigation trajectories in photorealistic environments. We design a comprehensive benchmark, evaluating navigation and dialog, and conduct extensive experiments analyzing the impact of different Navigator and Guide models. We highlight key challenges and publicly release the dataset, code, and evaluation framework to foster advancements in dialog-based embodied AI.
Poster
Taewoo Kim · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
In low-light environments, a longer exposure time is generally required to enhance image visibility; however, this setting inevitably causes motion blur. Even with a long exposure time, videos captured in low-light environments still suffer from issues such as low visibility, low contrast, and color distortion. Additionally, the long exposure time results in videos with a low frame rate. Therefore, videos captured in low-light exhibit low visibility and motion blur, as well as low frame rates. To overcome these limitations, we propose a novel problem aimed at transforming motion-blurred, low-frame-rate videos with poor visibility in low-light environments into high-frame-rate videos while simultaneously enhancing their visibility. To tackle this challenge, we leverage the unique advantages of event cameras, which capture scene changes asynchronously, providing superior temporal resolution and a wider dynamic range compared to conventional frame-based cameras. These properties make event cameras particularly effective in reducing motion blur, compensating for low frame rates, and enhancing visibility in low-light conditions. To this end, we developed a hybrid camera system that integrates two RGB cameras and an event camera, capturing a dedicated dataset for this task and proposing novel network architectures to effectively address this problem. For future work, we plan to release the …
Poster
Haoyi Zhu · Yifan Wang · Jianjun Zhou · Wenzheng Chang · Yang Zhou · Zizun Li · Junyi Chen · Chunhua Shen · Jiangmiao Pang · Tong He

[ Exhibit Hall I ]

Abstract
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
Poster
Hyeongjin Nam · Donghwan Kim · Gyeongsik Moon · Kyoung Mu Lee

[ Exhibit Hall I ]

Abstract
The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which uses 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Our extensive experiments demonstrate that PARTE achieves state-of-the-art quality in 3D human reconstruction. We will release our code.
Poster
Kim Kiehn · Albin Ahlbäck · Kathlén Kohn

[ Exhibit Hall I ]

Abstract
We completely classify all minimal problems for Structure-from-Motion (SfM) where arrangements of points and lines are fully observed by multiple uncalibrated pinhole cameras. We find 291 minimal problems, 73 of which have unique solutions and can thus be solved linearly.Two of the linear problems allow an arbitrary number of views, while all other minimal problems have at most 9 cameras. All minimal problems have at most 7 points and at most 12 lines. We compute the number of solutions of each minimal problem, as this gives a measurement of the problem's intrinsic difficulty, and find that these number are relatively low (e.g., when comparing with minimal problems for calibrated cameras). Finally, by exploring stabilizer subgroups of subarrangements, we develop a geometric and systematic way to 1) factorize minimal problems into smaller problems, 2) identify minimal problems in underconstrained problems, and 3) formally prove non-minimality.
Poster
Haoye Dong · Gim Hee Lee

[ Exhibit Hall I ]

Abstract
Human pose sequence refinement plays a crucial role in improving the accuracy, smoothness, and temporal coherence of pose estimation across a sequence of frames. Despite its importance in real-world applications, human pose sequence refinement has received less attention than human pose estimation. In this paper, we propose PS-Mamba, a novel framework that refines human pose sequences by integrating spatial-temporal graph learning with state space modeling. Specifically, we introduce the Spatial-Temporal Graph State Space (ST-GSS) block, which captures spatial and temporal dependencies across joints to smooth pose sequences while preserving structural integrity. The spatial-temporal graph models intricate joint interactions, while the state space component effectively manages temporal dynamics, reducing both short- and long-term pose instability. Additionally, we incorporate a dynamic graph weight matrix to adaptively model the relative influence of joint interactions, further mitigating pose ambiguity. Extensive experiments on challenging benchmarks demonstrate that our PS-Mamba outperforms SOTAs, achieving $\mathbf{-14.21}$ mm MPJPE (+18.5\%), $\mathbf{-13.59}$ mm PA-MPJPE (+22.1\%), and $\mathbf{-0.42}$ mm/s² ACCEL (+9.7\%) compared to SynSP on AIST++, significantly reducing jitters and enhancing pose stability. Our code has been submitted as supplementary and will be open-sourced upon acceptance.
Poster
Amin Karimi Monsefi · Mridul Khurana · Rajiv Ramnath · Anuj Karpatne · Wei-Lun (Harry) Chao · Cheng Zhang

[ Exhibit Hall I ]

Abstract
We propose TaxaDiffusion, a taxonomy-informed training framework for diffusion models to generate fine-grained animal images with high morphological and identity accuracy. Unlike standard approaches that treat each species as an independent category, TaxaDiffusion incorporates domain knowledge that many species exhibit strong visual similarities, with distinctions often residing in subtle variations of shape, pattern, and color. To exploit these relationships, TaxaDiffusion progressively trains conditioned diffusion models across different taxonomic levels --- starting from broad classifications such as Class and Order, refining through Family and Genus, and ultimately distinguishing at the Species level. This hierarchical learning strategy first captures coarse-grained morphological traits shared by species with common ancestors, facilitating knowledge transfer, before refining fine-grained differences for species-level distinction. As a result, TaxaDiffusion enables accurate generation even with limited training samples per species. Extensive experiments on three fine-grained animal datasets demonstrate that TaxaDiffusion outperforms existing approaches, achieving superior fidelity in fine-grained animal image generation. Our model and code will be publicly available.
Poster
Zhenjun Yu · Wenqiang Xu · Pengfei Xie · Yutong Li · Brian Anthony · Zhuorui Zhang · Cewu Lu

[ Exhibit Hall I ]

Abstract
We present ViTaM-D, a novel visual-tactile framework for reconstructing dynamic hand-object interaction with distributed tactile sensing to enhance contact modeling. Existing methods, relying solely on visual inputs, often fail to capture occluded interactions and object deformation. To address this, we introduce DF-Field, a distributed force-aware contact representation leveraging kinetic and potential energy in hand-object interactions. ViTaM-D first reconstructs interactions using a visual network with contact constraint, then refines contact details through force-aware optimization, improving object deformation modeling. To evaluate deformable object reconstruction, we introduce the HOT dataset, featuring 600 hand-object interaction sequences in a high-precision simulation environment. Experiments on DexYCB and HOT datasets show that ViTaM-D outperforms state-of-the-art methods in reconstruction accuracy for both rigid and deformable objects. DF-Field also proves more effective in refining hand poses and enhancing contact modeling than previous refinement methods. The code, models, and datasets will be made public.
Poster
Shijie Zhou · Alexander Vilesov · Xuehai He · Ziyu Wan · Shuwang Zhang · Aditya Nagachandra · Di Chang · Dongdong Chen · Xin Wang · Achuta Kadambi

[ Exhibit Hall I ]

Abstract
Vision-language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts—abilities essential for robust real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs’ spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.
Poster
Evan Casey · Tianyu Zhang · Shu Ishida · John Thompson · Amir Khasahmadi · Joseph Lambourne · Pradeep Kumar Jayaraman · Karl Willis

[ Exhibit Hall I ]

Abstract
We adapt alignment techniques from reasoning LLMs to the task of generating engineering sketch constraints found in computer-aided design (CAD) models.Engineering sketches consist of geometric primitives (e.g. points, lines) connected by constraints (e.g. perpendicular, tangent) that define the relationships between them. For a design to be easily editable, the constraints must effectively capture design intent, ensuring the geometry updates predictably when parameters change. Although current approaches can generate CAD designs, an open challenge remains to align model outputs with design intent, we label this problem `design alignment'. A critical first step towards aligning generative CAD models is to generate constraints which fully-constrain all geometric primitives, without over-constraining or distorting sketch geometry. Using alignment techniques to train an existing constraint generation model with feedback from a constraint solver, we are able to fully-constrain 93\% of sketches compared to 34\% when using a naïve supervised fine-tuning (SFT) baseline and only 8.9\% without alignment. Our approach can be applied to any existing constraint generation model and sets the stage for further research bridging alignment strategies between the language and design domains.
Poster
Qian Liang · Ruixu Geng · Jinbo Chen · Haoyu Wang · Yan Chen · Yang Hu

[ Exhibit Hall I ]

Abstract
Remote physiological measurement based on video and radar has made significant progress in recent years. However, unimodal methods based solely on video or radar sensor have notable limitations due to their measurement principles, and multimodal remote photoplethysmography (rPPG) that combines these modalities has emerged as a promising direction. Despite its potential, the lack of large-scale multimodal data and the significant modality gap between video and radar pose substantial challenges in building robust video-radar rPPG models. To handle these problems, we suggest leveraging unimodal pre-training and present the Spatial alignment and Temporal Matching (SATM) Adapter to effectively fine-tune pre-trained unimodal backbones into a multimodal rPPG model. Given the distinct measurement principles of video- and radar-based methods, we propose Spatial Alignment to align the spatial distribution of their features. Furthermore, Temporal Matching is applied to mitigate waveform discrepancies between video and radar signals. By integrating these two modules into adapters, the unimodal backbones could retain their modality-specific knowledge while effectively extracting complementary features from each other. Extensive experiments across various challenging scenarios, including low light conditions and head motions, demonstrate that our approach significantly surpasses the state-of-the-art methods. Code will be released upon acceptance.
Poster
Yusuke Hirota · Ryo Hachiuma · Boyi Li · Ximing Lu · Michael Boone · Boris Ivanovic · Yejin Choi · Marco Pavone · Yu-Chiang Frank Wang · Noa Garcia · Yuta Nakashima · Chao-Han Yang

[ Exhibit Hall I ]

Abstract
Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do confounding features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias measurements. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to confounders rather than true gender bias, undermining their reliability. Since creating confounder-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside confounder-sensitivity measurements to enable a more reliable assessment of gender bias in VLMs.
Poster
Peizheng Li · Shuxiao Ding · You Zhou · Qingwen Zhang · Onat Inak · Larissa Triess · Niklas Hanselmann · Marius Cordts · Andreas Zell

[ Exhibit Hall I ]

Abstract
Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities.Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs.To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios.AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps.Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.The code will be released upon acceptance.
Poster
Quanmin Liang · Qiang Li · Shuai Liu · Xinzi Cao · Jinyi Lu · Feidiao Yang · Wei Zhang · Kai Huang · Yonghong Tian

[ Exhibit Hall I ]

Abstract
Applying pretraining-finetuning paradigm to event cameras presents significant challenges due to the scarcity of large-scale event datasets and the inherently sparse nature of event data, which increases the risk of overfitting during extensive pretraining.In this paper, we explore the transfer of pretrained image knowledge to the domain of event cameras to address this challenge. The key to our approach lies in adapting event data representations to align with image pretrained models while simultaneously integrating spatiotemporal information and mitigating data sparsity. To achieve this, we propose a lightweight SpatioTemporal information fusion Prompting (STP) method, which progressively fuses the spatiotemporal characteristics of event data through a dynamic perception module with multi-scale spatiotemporal receptive fields, enabling compatibility with image pretrained models.STP enhances event data representation by capturing local information within a large receptive field and performing global information exchange along the temporal dimension. This strategy effectively reduces sparse regions in event data while refining fine-grained details, all while preserving its inherent spatiotemporal structure. Our method significantly outperforms previous state-of-the-art approaches across classification, semantic segmentation, and optical flow estimation tasks. For instance, it achieves a top-1 accuracy of 68.87\% (+4.04\%) on N-ImageNet with only 1/10 of the pretraining parameters and 1/3 of the training …
Poster
Jiahao Xia · Yike Wu · Wenjian Huang · Jianguo Zhang · Jian Zhang

[ Exhibit Hall I ]

Abstract
Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments …
Poster
Shaobo Zhang · Yuhang Huang · Wanqing Zhao · Wei Zhao · Ziyu Guan · Jinye Peng

[ Exhibit Hall I ]

Abstract
This paper introduces EA6D, a novel diffusion-based framework for 6D pose estimation that operates effectively in any environment. Traditional pose estimation methods struggle with the variability and complexity of real-world scenarios, often leading to overfitting on controlled datasets and poor generalization to new scenes. To address these challenges, we propose a generative pose estimation paradigm that generates environment-independent object representations for pose estimation, which are robust to environmental variations such as illumination, occlusion, and background clutter. Specifically, we propose the novel Environment Decoupling Diffusion Models (EDDM) which separates object representations from environmental factors while enabling efficient few-step sampling by leveraging input image priors instead of pure noise initialization. We validate our approach on four standard benchmarks and a self-made dataset DiverseScenes. The results demonstrate that EA6D, trained using only synthetic data, can outperform the state-of-the-art methods with both synthetic and realistic data. In particular, for fair comparisons with synthetic data, we can exceed the previous SOTA by $18.1\%$ and $33.5\%$ on LINEMOD and Linemod-Occluded datasets respectively.
Poster
Peng-Hao Hsu · Ke Zhang · Fu-En Wang · Tao Tu · Ming-Feng Li · Yu-Lun Liu · Albert Y. C. Chen · Min Sun · Cheng-Hao Kuo

[ Exhibit Hall I ]

Abstract
Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy …
Poster
Duo Wu · Jinghe Wang · Yuan Meng · Yanning Zhang · Le Sun · Zhi Wang

[ Exhibit Hall I ]

Abstract
Utilizing large language models (LLMs) for tool planning has emerged as a promising avenue for developing general AI systems, where LLMs automatically schedule external tools (e.g., vision models) to tackle complex tasks based on task descriptions. To push this paradigm toward practical applications, it is crucial for LLMs to consider tool execution costs (e.g., execution time) for tool planning. Unfortunately, prior studies overlook the tool execution costs, leading to the generation of expensive plans whose costs outweigh their benefits in terms of task performance. To fill this gap, we propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework, which for the first time provides a coherent design to empower LLMs for cost-aware tool planning. Specifically, To facilitate efficient concurrent tool execution and cost reduction, we design a tool planning language to enhance the LLM for creating multi-branch non-sequential plans.Moreover, we propose a cost-aware offline reinforcement learning algorithm to fine-tune the LLM to optimize the performance-cost trade-off in tool planning. In the lack of public cost-related datasets, we further present OpenCATP, the first dataset for cost-aware planning, which comprises 11,100 evaluation samples from diverse tasks. Extensive experiments show that CATP-LLM outperforms GPT-4 even when using Llama2-7B as its backbone, with the …
Poster
Qiaole Dong · Yanwei Fu

[ Exhibit Hall I ]

Abstract
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video, even in the presence of occlusions. Traditional methods use optical flow models to directly estimate long-range motion, but they often suffer from appearance drifting without considering temporal consistency. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one, which is slow and less effective for long-range tracking. To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with $\textbf{S}$treaming memory for dense $\textbf{PO}$int $\textbf{T}$racking and online video processing. The $\textbf{SPOT}$ framework features three core components: a customized memory reading module for feature enhancement, a sensory memory for short-term motion dynamics modeling, and a visibility-guided splatting module for accurate information propagation. This combination enables SPOT to perform dense point tracking with state-of-the-art accuracy on the CVO benchmark, as well as comparable or superior performance to offline models on sparse tracking benchmarks such as TAP-Vid and RoboTAP. Notably, SPOT with 10$\times$ smaller parameter numbers operates at least 2$\times$ faster than previous state-of-the-art models while maintaining the best performance on …
Poster
Dadong Jiang · Zhi Hou · Zhihui Ke · Xianghui Yang · Xiaobo Zhou · Tie Qiu

[ Exhibit Hall I ]

Abstract
Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces.To address the above issue, we design a plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective.Specifically, TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians.Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed.Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer.Project Page: https://anonymous-create-ui.github.io/TimeFormer
Poster
Ryan Po · Yotam Nitzan · Richard Zhang · Berlin Chen · Tri Dao · Eli Shechtman · Gordon Wetzstein · Xun Huang

[ Exhibit Hall I ]

Abstract
Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.
Poster
Shaojie Ma · Yawei Luo · Wei Yang · Yi Yang

[ Exhibit Hall I ]

Abstract
3D reconstruction and simulation, although interrelated, have distinct objectives: reconstruction requires a flexible 3D representation that can adapt to diverse scenes, while simulation needs a structured representation to model motion principles effectively. This paper introduces the Mesh-adsorbed Gaussian Splatting (MaGS) method to address this challenge. MaGS constrains 3D Gaussians to roam near the mesh, creating a mutually adsorbed mesh-Gaussian 3D representation. Such representation harnesses both the rendering flexibility of 3D Gaussians and the structured property of meshes. To achieve this, we introduce RMD-Net, a network that learns motion priors from video data to refine mesh deformations, alongside RGD-Net, which models the relative displacement between the mesh and Gaussians to enhance rendering fidelity under mesh constraints. To generalize to novel, user-defined deformations beyond input video without reliance on temporal data, we propose MPE-Net, which leverages inherent mesh information to bootstrap RMD-Net and RGD-Net. Due to the universality of meshes, MaGS is compatible with various deformation priors such as ARAP, SMPL, and soft physics simulation. Extensive experiments on the D-NeRF, DG-Mesh, and PeopleSnapshot datasets demonstrate that MaGS achieves state-of-the-art performance in both reconstruction and simulation.
Poster
Yunwei Lan · Zhigao Cui · Xin Luo · Chang Liu · Nian Wang · Menglin Zhang · Yanzhao Su · Dong Liu

[ Exhibit Hall I ]

Abstract
Recent advancements in unpaired dehazing, particularly those using GANs, show promising performance in processing real-world hazy images. However, these methods tend to face limitations due to the generator's limited transport mapping capability, which hinders the full exploitation of their effectiveness in unpaired training paradigms. To address these challenges, we propose DehazeSB, a novel unpaired dehazing framework based on the Schrödinger Bridge. By leveraging optimal transport (OT) theory, DehazeSB directly bridges the distributions between hazy and clear images. This enables optimal transport mappings from hazy to clear images in fewer steps, thereby generating high-quality dehazed results. To ensure the consistency of structural information and localized details in the restored images, we introduce detail-preserving regularization, which enforces pixel-level alignment between hazy inputs and dehazed outputs. Furthermore, we propose a novel prompt learning to leverage pre-trained CLIP models in distinguishing hazy images and clear ones, by learning a haze-aware vision-language alignment. Extensive experiments on multiple real-world datasets demonstrate our method's superiority. Our code will be open-sourced.
Poster
Minwen Liao · Hao Dong · Xinyi Wang · Kurban Ubul · Ziyang Yan · Yihua Shao

[ Exhibit Hall I ]

Abstract
Low-light enhancement has wide applications in autonomous driving, 3D reconstruction, remote sensing, surveillance, and so on, which can significantly improve information utilization. However, most existing methods lack generalization and are limited to specific tasks such as image recovery. To address these issues, we propose Gated-Mechanism Mixture-of-Experts (GM-MoE), the first framework to introduce a mixture-of-experts network for low-light image enhancement. GM-MoE comprises a dynamic gated weight conditioning network and three sub-expert networks, each specializing in a distinct enhancement task. Combining a self-designed gated mechanism that dynamically adjusts the weights of the sub-expert networks for different data domains. Additionally, we integrate local and global feature fusion within sub-expert networks to enhance image quality by capturing multi-scale features. Experimental results demonstrate that the GM-MoE achieves superior generalization with respect to 25 compared approaches, reaching state-of-the-art performance on PSNR on 5 benchmarks and SSIM on 4 benchmarks, respectively.
Poster
Abhinav Kumar · Yuliang Guo · Zhihao Zhang · Xinyu Huang · Liu Ren · Xiaoming Liu

[ Exhibit Hall I ]

Abstract
Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R significantly improves generalization to unseen camera heights, achieving SoTA performance on the CARLA dataset. Our code, models, and extended datasets will be publicly available.
Poster
Tongshun Zhang · Pingping Liu · Yubing Lu · Mengen Cai · Zijian Zhang · Zhe Zhang · Qiuzhan Zhou

[ Exhibit Hall I ]

Abstract
Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. Specifically, our approach comprises two key components: 1) Inspired by the concept of intervention in causality, we adopt a causal reasoning perspective to reveal the underlying causal relationships in low-light enhancement. From a global perspective, we employ a metric learning strategy to ensure causal embeddings adhere to causal principles, separating them from non-causal confounding factors while focusing on the invariance of causal factors. At the local level, we introduce an instance-level CLIP semantic loss to precisely maintain causal factor consistency. 2) Based on our causal analysis, we present a wavelet transform-based backbone network that models high-frequency information through an SS2D scanning strategy aligned with high-frequency components, enabling precise recovery of high-frequency details, while complex modeling of low-frequency information is achieved by combining the advantages of Fast Fourier Convolution and wavelet convolution. Extensive experiments demonstrate that CWNet significantly outperforms current state-of-the-art methods across multiple datasets, showcasing its robust performance across diverse scenes.
Poster
Hongyi Zhou · Xiaogang Wang · Yulan Guo · Kai Xu

[ Exhibit Hall I ]

Abstract
Accurately analyzing the motion parts and their motion attributes in dynamic environments is crucial for advancing key areas such as embodied intelligence. Addressing the limitations of existing methods that rely on dense multi-view images or detailed part-level annotations, we propose an innovative framework that can analyze 3D mobility from monocular videos in a zero-shot manner. This framework can precisely parse motion parts and motion attributes only using a monocular video, completely eliminating the need for annotated training data. Specifically, our method first constructs the scene geometry and roughly analyzes the motion parts and their initial motion attributes combining depth estimation, optical flow analysis and point cloud registration method, then employs 2D Gaussian splatting for scene representation. Building on this, we introduce an end-to-end dynamic scene optimization algorithm specifically designed for articulated objects, refining the initial analysis results to ensure the system can handle ‘rotation’, ‘translation’, and even complex movements (‘rotation+translation’), demonstrating high flexibility and versatility. To validate the robustness and wide applicability of our method, we created a comprehensive dataset comprising both simulated and real-world scenarios. Experimental results show that our framework can effectively analyze articulated object motions in an annotation-free manner, showcasing its significant potential in future embodied intelligence …
Poster
Xinqi Fan · Xueli CHEN · Luoxiao Yang · Chuin Hong Yap · Rizwan Qureshi · Qi Dou · Moi Hoon Yap · Mubarak Shah

[ Exhibit Hall I ]

Abstract
Vision-language models (VLMs) have shown promise in test-time adaptation tasks due to their remarkable capabilities in understanding and reasoning about visual content through natural language descriptions. However, training VLMs typically demands substantial computational resources, and they often struggle to adapt efficiently to new domains or tasks. Additionally, dynamically estimating the test distribution from streaming data at test time remains a significant challenge. In this work, we propose a novel test-time retrieval-augmented adaption (TT-RAA) method that enables VLMs to maintain high performance across diverse visual recognition tasks without the need for task-specific training or large computational overhead. During inference, TT-RAA employs a streaming mixture of Gaussian database (SMGD) to continuously estimate test distributions, requiring minimal storage. Then, TT-RAA retrieves the most relevant information from the SMGD, enhancing the original VLM outputs. A key limitation of CLIP-based VLMs is their inter-modal vision-language optimization, which does not optimize vision-space similarity, leading to larger intra-modal variance. To address this, we propose a multimodal retrieval augmentation module that transforms the SMGD into a unified multimodal space, enabling retrieval that aligns both vision and language modalities. Extensive experiments across both cross-domain and out-of-distribution benchmarks comprising fourteen datasets demonstrate TT-RAA’s superior performance compared to state-of-the-art methods. Ablation …
Poster
Jinxi Li · Ziyang Song · Bo Yang

[ Exhibit Hall I ]

Abstract
In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural networks, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. In this paper, we propose a new framework named **TRACE** to model the motion physics of complex dynamic 3D scenes. The key novelty of our approach is that, by formulating each 3D point as a rigid particle with size and orientation in space, we choose to directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle's motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters. Our datasets and code will be released at https://github.com/.
Poster
Kevin Tandi · Xiang Dai · Chinmay Talegaonkar · Gal Mishne · Nicholas Antipa

[ Exhibit Hall I ]

Abstract
Compressive video capture encodes a short high-speed video into a single measurement using a low-speed sensor, then computationally reconstructs the original video. Prior implementations rely on expensive hardware and are restricted to imaging sparse scenes with empty backgrounds. We propose RnGCam, a system that fuses measurements from low-speed consumer-grade rolling-shutter (RS) and global-shutter (GS) sensors into video at kHz frame rates. The RS sensor is combined with a pseudorandom optic, called a diffuser, which spatially multiplexes scene information. The GS sensor is coupled with a conventional lens. The RS-diffuser provides low spatial detail and high temporal detail, complementing the GS-lens system's high spatial detail and low temporal detail. We propose a reconstruction method using implicit neural representations (INR) to fuse the measurements into a high-speed video. Our INR method separately models the static and dynamic scene components, while regularizing dynamics explicitly. In simulation, we show that our approach significantly outperforms previous RS compressive video methods, as well as state-of-the-art frame interpolators. We validate our approach in a dual-camera hardware setup, which generates 230 frames of video at 4,800 frames per second for dense scenes, using hardware that costs 10x less than previous compressive video systems.
Poster
Shengjie Lin · Jiading Fang · Muhammad Zubair Irshad · Vitor Campagnolo Guizilini · Rares Ambrus · Greg Shakhnarovich · Matthew Walter

[ Exhibit Hall I ]

Abstract
Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt's state-of-the-art performance and real-world practicality.
Poster
qiusheng huang · Xiaohui Zhong · Xu Fan · Hao Li

[ Exhibit Hall I ]

Abstract
Similar to conventional video generation, current deep learning-based weather prediction frameworks often lack explicit physical constraints, leading to unphysical outputs that limit their reliability for operational forecasting. Among various physical processes requiring proper representation, radiation plays a fundamental role as it drives Earth's weather and climate systems. However, accurate simulation of radiative transfer processes remains challenging for traditional numerical weather prediction (NWP) models due to their inherent complexity and high computational costs. Here, we propose FuXi-RTM, a hybrid physics-guided deep learning framework designed to enhance weather forecast accuracy while enforcing physical consistency. FuXi-RTM integrates a primary forecasting model (FuXi) with a fixed deep learning-based radiative transfer model (DLRTM) surrogate that efficiently replaces conventional radiation parameterization schemes. This represents the first deep learning-based weather forecasting framework to explicitly incorporate physical process modeling. Evaluated over a comprehensive 5-year dataset, FuXi-RTM outperforms its unconstrained counterpart in 88.51\% of 3320 variable and lead time combinations, with improvements in radiative flux predictions. By incorporating additional physical processes, FuXi-RTM paves the way for next-generation weather forecasting systems that are both accurate and physically consistent.
Poster
Chengbo Yuan · Geng Chen · Li Yi · Yang Gao

[ Exhibit Hall I ]

Abstract
Egocentric videos provide valuable insights into human interactions with the physical world, which has sparked growing interest in the computer vision and robotics communities. A critical challenge in fully understanding the geometry and dynamics of egocentric videos is dense scene reconstruction. However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. We introduce **EgoMono4D**, a novel model that unifies the estimation of multiple variables necessary for *Ego*centric *Mono*cular *4D* reconstruction, including camera intrinsic, camera poses, and video depth, all within a fast feed-forward framework. Starting from pretrained single-frame depth and intrinsic estimation model, we extend it with camera poses estimation and align multi-frame results on large-scale unlabeled egocentric videos. We evaluate EgoMono4D in both in-domain and zero-shot generalization settings, achieving superior performance in dense pointclouds sequence reconstruction compared to all baselines. EgoMono4D represents the first attempt to apply self-supervised learning for pointclouds sequence reconstruction to the label-scarce egocentric field, enabling fast, dense, and generalizable reconstruction. The code and trained models will be released in the future.
Poster
Ruofan Wang · Juncheng Li · Yixu Wang · Bo Wang · Xiaosen Wang · Yan Teng · Yingchun Wang · Xingjun Ma · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical. Recent studies have explored VLM robustness against jailbreak attacks—techniques that exploit model vulnerabilities to elicit harmful outputs. However, the limited availability of diverse multimodal data has constrained current approaches to rely heavily on adversarial or manually crafted images derived from harmful text datasets, which often lack effectiveness and diversity across different contexts. In this paper, we propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the insight that VLMs themselves could serve as powerful red team models for generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. Extensive experiments demonstrate IDEATOR’s high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Chameleon, respectively. Building on IDEATOR’s strong transferability and automated process, we introduce the VLBreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results on 11 recently released VLMs reveal significant …
Poster
Tatiana Zemskova · Dmitry Yudin

[ Exhibit Hall I ]

Abstract
A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects.
Poster
Qirui Wu · Denys Iliash · Daniel Ritchie · Manolis Savva · Angel Chang

[ Exhibit Hall I ]

Abstract
Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce better solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and real-world data to show we significantly outperform baselines from prior work. We also demonstrate generalization to real-world internet images and the text-to-scene task.
Poster
Tim Seizinger · Florin-Alexandru Vasluianu · Marcos Conde · Zongwei Wu · Radu Timofte

[ Exhibit Hall I ]

Abstract
Bokeh rendering methods play a key role in creating the visually appealing, softly blurred backgrounds seen in professional photography. While recent learning-based approaches show promising results, generating realistic Bokeh with controllable strength remains challenging. Existing methods require additional inputs and suffer from unrealistic Bokeh reproduction due to reliance on synthetic data. In this work, we propose Bokehlicious, a highly efficient network that provides intuitive control over Bokeh strength through an Aperture-Aware Attention mechanism, mimicking the physical lens aperture. To further address the lack of high-quality real-world data, we present RealBokeh, a novel dataset featuring 23,000 high-resolution (24-MP) images captured by professional photographers, covering diverse scenes with varied aperture and focal length settings. Evaluations on both our new RealBokeh and established Bokeh rendering benchmarks show that Bokehlicious consistently outperforms SOTA methods while significantly reducing computational cost and exhibiting strong zero-shot generalization. Our method and dataset further extend to defocus deblurring, achieving competitive results on the RealDOF benchmark. Our code and data will be public.
Poster
Yannick Burkhardt · Simon Schaefer · Stefan Leutenegger

[ Exhibit Hall I ]

Abstract
Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin. The source code and model weights will be published after acceptance.
Poster
Runyang Feng · Hyung Jin Chang · Tze Ho Elden Tse · Boeun Kim · Yi Chang · Yixing Gao

[ Exhibit Hall I ]

Abstract
Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the …
Poster
Xiaorong Qin · Xinhang Song · Sixian Zhang · Xinyao Yu · Xinmiao Zhang · Shuqiang Jiang

[ Exhibit Hall I ]

Abstract
Object navigation tasks require an agent to locate a target object using visual observations in unseen environments, where unfamiliar layouts and novel object appearances can hinder navigation. Most existing methods lack the adaptability needed to handle these uncertainties, as their navigation models remain fixed during testing. In this paper, we address this challenge by examining object-conditioned trajectory distribution shifts in navigation caused by changes in environmental dynamics. We propose learning a central conditional distribution as a prior that approximates the specific distributions of diverse environments. To retain environment-specific information during navigation, we allow each environment-specific distribution to approximate this central distribution rather than relying on it directly. To implement this, we introduce a meta-learning mechanism that integrates with traditional navigation methods, offering tailored solutions for various types of navigation approaches. Our approach, Learning on the Go (LOG), enables agents to learn on the go, allowing for flexible, adaptive, real-time learning during navigation. Our theoretical analysis highlights the benefits of learning a central distribution for effective generalization across environments, and empirical results confirm the proposed method’s effectiveness, demonstrating superior performance compared to existing approaches.
Poster
Zhihao ZHU · Yifan Zheng · Siyu Pan · Yaohui Jin · Yao Mu

[ Exhibit Hall I ]

Abstract
The fragmentation between high-level task semantics and low-level geometric features remains a persistent critical challenge in robotic manipulation. While vision-language models (VLMs) have demonstrated their potential in generating affordance-aware visual representations, the lack of semantic grounding in canonical spaces and reliance on manually annotated severely limit their ability to capture dynamic semantic-affordance relationships. To address these limitations, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). Extensive experiments demonstrate PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.
Poster
Yihong Cao · Jiaming Zhang · Xu Zheng · Hao Shi · Kunyu Peng · Hang Liu · Kailun Yang · Hui Zhang

[ Exhibit Hall I ]

Abstract
Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, ie, Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding SOTA scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available.
Poster
Mahmoud Ahmed · Junjie Fei · Jian Ding · Eslam Abdelrahman · Mohamed Elhoseiny

[ Exhibit Hall I ]

Abstract
In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we propose Kestrel, a part-aware 3D multimodal large language model that integrates an advanced language model for nuanced language comprehension with multi-level point feature propagation and query refinement mechanism to enhance spatial reasoning at the part level. The extensive experiments demonstrate that Kestrel effectively bridges the gap between part-aware language understanding and 3D segmentation grounding, paving the way for more robust and interpretable 3D object comprehension that meets the demands of real-world robotic applications.
Poster
Xiangyu Yin · Boyuan Yang · Weichen Liu · Qiyao Xue · Abrar Alamri · Goeran Fiedler · Wei Gao

[ Exhibit Hall I ]

Abstract
Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks.
Poster
Zhiyuan Yang · Anqi Cheng · Haiyue Zhu · Tianjiao Li · Pey Yuen Tao · Kezhi Mao

[ Exhibit Hall I ]

Abstract
Depth completion, the task of reconstructing dense depth maps from sparse depth and RGB images, plays a critical role in 3D scene understanding. However, existing methods often struggle to recover high-frequency details, such as regions with fine structures or weak signals, since depth sensors may fail to capture accurate depth maps in those regions, leading to imperfect supervision ground truth. To overcome this limitation, it is essential to introduce an alternative training source for the models. Emerging depth foundation models excel at producing high-frequency details from RGB images, yet their depth maps suffer from inconsistent scaling. Therefore, we propose a novel teacher-student framework that enhances depth completion by distilling high-frequency knowledge from depth foundation models across multiple scales. Our approach introduces two key innovations: Adaptive Local Wavelet Decomposition, which dynamically adjusts wavelet decomposition level based on local complexity for efficient feature extraction, and Topological Constraints, which apply persistent homology to enforce structural coherence and suppress spurious depth edges. Experiment results demonstrate that our method outperforms state-of-the-art methods, preserving high-frequency details and overall depth fidelity.
Poster
Miroslav Purkrabek · Jiri Matas

[ Exhibit Hall I ]

Abstract
Human pose estimation methods work well on isolated people but struggle with multiple-bodies-in-proximity scenarios. Previous work has addressed this problem by conditioning pose estimation by detected bounding boxes or keypoints, but overlooked instance masks. We propose to iteratively enforce mutual consistency of bounding boxes, instance masks, and poses. The introduced BBox-Mask-Pose (BMP) method uses three specialized models that improve each other's output in a closed loop. All models are adapted for mutual conditioning, which improves robustness in multi-body scenes. MaskPose, a new mask-conditioned pose estimation model, is the best among top-down approaches on OCHuman. BBox-Mask-Pose pushes SOTA on OCHuman dataset in all three tasks -- detection, instance segmentation, and pose estimation. It also achieves SOTA performance on COCO pose estimation. The method is especially good in scenes with large instances overlap, where it improves detection by 39% over the baseline detector. With small specialized models and faster runtime, BMP is an effective alternative to large human-centered foundational models. Code and models will be published.
Poster
Xinkuan Qiu · Meina Kan · Yongbin Zhou · Shiguang Shan

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in visual and language tasks. However, despite their impressive performance on standard datasets, these models encounter considerable robustness challenges when processing corrupted images, raising concerns about their reliability in safety-critical applications. To address this issue, we introduce the MLLM-IC benchmark, specifically designed to assess the performance of MLLMs under image corruption scenarios. MLLM-IC offers a more comprehensive evaluation of corruption robustness compared to existing benchmarks, enabling a multi-dimensional assessment of various MLLM capabilities across a broad range of corruption types. It includes 40 distinct corruption types and 34 low-level multimodal capabilities, each organized into a three-level hierarchical structure. Notably, it is the first corruption robustness benchmark designed to facilitate the evaluation of fine-grained MLLM capabilities. We further evaluate several prominent MLLMs and derive valuable insights into their characteristics. We believe the MLLM-IC benchmark will provide crucial insights into the robustness of MLLMs in handling corrupted images and contribute to the development of more resilient MLLMs.
Poster
Xinhang Liu · Jiawei Shi · Zheng Dang · Yuchao Dai

[ Exhibit Hall I ]

Abstract
We present MixRI, a lightweight network that solves the CAD-based novel object pose estimation problem in RGB images. It can be instantly applied to a novel object at test time without finetuning. We design our network to meet the demands of real-world applications, emphasizing reduced memory requirements and fast inference time. Unlike existing works that utilize many reference images and have large network parameters, we directly match points based on the multi-view information between the query and reference images with a lightweight network. Thanks to our reference image fusion strategy, we significantly decrease the number of reference images, thereby decreasing the time needed to process these images and the memory required to store them. Furthermore, with our lightweight network, our method requires less inference time. Though with fewer reference images, experiments on seven core datasets in the BOP challenge show that our method achieves comparable results with other methods requiring more reference images and larger network parameters.
Poster
Chenwei Lin · Hanjia Lyu · Xian Xu · Jiebo Luo

[ Exhibit Hall I ]

Abstract
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications and have shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain—characterized by rich application scenarios and abundant multimodal data—has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for 4 representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first hierarchical LVLMs benchmark tailored for the insurance domain. INS-MMBench encompasses 22 fundamental tasks, 12 meta-tasks and 5 scenario tasks—enabling a comprehensive and progressive assessment from basic tasks to real-world insurance scenarios. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like LLaVA. Our evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain …
Poster
ADEELA ISLAM · Stefano Fiorini · Stuart James · Pietro Morerio · ALESSIO DEL BUE

[ Exhibit Hall I ]

Abstract
The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 55% and 86% for RMSE Rotation and Translation, respectively.
Poster
Wenqi Wang · Reuben Tan · Pengyue Zhu · Jianwei Yang · Zhengyuan Yang · Lijuan Wang · Andrey Kolobov · Jianfeng Gao · Boqing Gong

[ Exhibit Hall I ]

Abstract
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models’s spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model’s spatial reasoning proficiency and its performance on an em bodied AI task. Code and data will be publicly available.
Poster
Shuai Jin · Yuhua Qian · Feijiang Li · Guoqing Liu · Xinyan Liang

[ Exhibit Hall I ]

Abstract
Unsupervised low-light image enhancement presents the challenge of preserving both local texture details and global illumination consistency. Existing methods often rely on uniform, predefined strategies within fixed neighborhoods (e.g., fixed convolution kernels or average pooling), which are limited in their ability to adaptively capture the dynamic interdependencies between pixels during the enhancement process. As a result, these methods may lead to oversaturation or the loss of fine details. To address these issues, we introduce PASD, a novel pixel-adaptive adjustment approach inspired by swarm dynamics. PASD establishes inter-pixel cooperative constraints that adjust pixel intensities based on dynamic neighborhood interactions, thereby forming a population dynamics system for image enhancement that ensures a balance between local enhancement and global consistency. Furthermore, a distributed multi-agent reinforcement learning mechanism is employed to optimize the interactions within the dynamic system, while a multi-scale coordination framework ensures strategy consistency and stability. Experimental results demonstrate that PASD significantly outperforms existing state-of-the-art methods, providing a more flexible and efficient solution for low-light image enhancement.
Poster
Alexander Mai · Peter Hedman · George Kopanas · Dor Verbin · David Futschik · Qiangeng Xu · Falko Kuester · Jonathan Barron · Yinda Zhang

[ Exhibit Hall I ]

Abstract
We present Exact Volumetric Ellipsoid Rendering (EVER), a method for real-time 3D reconstruction.EVER accurately blends an unlimited number of overlapping primitives together in 3D space, eliminating the popping artifacts that 3D Gaussian Splatting (3DGS) and other related methods exhibit.EVER represents a radiance field as a set of constant-density volumetric ellipsoids, which are raytraced by intersecting each primitive twice (once upon ray entrance and another on ray exit) and accumulating the derivatives of the densities and colors along the ray.Because EVER is built around ray tracing, it also enables effects such as defocus blur and fish-eye camera distortion, while still achieving frame rates of ~30 FPS at 720p on an NVIDIA RTX4090. We show that our method is more accurate on the challenging large-scale scenes from the Zip-NeRF dataset, where it achieves state of the art SSIM, even higher than Zip-NeRF.
Poster
Uranik Berisha · Jens Mehnert · Alexandru Condurache

[ Exhibit Hall I ]

Abstract
Increasingly expensive training of ever larger models such as Vision Transfomers motivate reusing the vast library of already trained state-of-the-art networks. However, their latency, high computational costs and memory demands pose significant challenges for deployment, especially on resource-constrained hardware. While structured pruning methods can reduce these factors, they often require costly retraining, sometimes for up to hundreds of epochs, or even training from scratch to recover the lost accuracy resulting from the structural modifications. Maintaining the provided performance of trained models after structured pruning and thereby avoiding extensive retraining remains a challenge. To solve this, we introduce Variance-Based Pruning, a simple and structured one-shot pruning technique for efficiently compressing networks, with minimal finetuning. Our approach first gathers activation statistics, which are then used to select neurons for pruning. Simultaneously the mean activations are integrated back into the model to preserve a high degree of performance. On ImageNet-1k recognition tasks, we demonstrate that directly after pruning DeiT-Base retains over 70% of its original performance and requires only 10 epochs of fine-tuning to regain 99% of the original accuracy while simultaneously reducing MACs by 35% and model size by 36%, thus speeding up the model by 1.44 times.
Poster
Zizhang Li · Hong-Xing Yu · Wei Liu · Yin Yang · Charles Herrmann · Gordon Wetzstein · Jiajun Wu

[ Exhibit Hall I ]

Abstract
WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. Our hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elasticity, and rigid bodies -- all using a single image input. Code will be made public.
Poster
Kaixuan Jiang · Yang Liu · Weixing Chen · Jingzhou Luo · Ziliang Chen · Ling Pan · Guanbin Li · Liang Lin

[ Exhibit Hall I ]

Abstract
Embodied Question Answering (EQA) is a challenging task in embodied intelligence that requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions. However, current EQA approaches suffer from critical limitations in exploration efficiency, dataset design, and evaluation metrics. Moreover, existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning, while frontier-based exploration strategies struggle in cluttered environments and fail to ensure fine-grained exploration of task-relevant areas. To address these challenges, we construct the EXPloration-awaRe Embodied queStion anSwering Benchmark (EXPRESS-Bench), the largest dataset designed specifically to evaluate both exploration and reasoning capabilities. EXPRESS-Bench consists of 777 exploration trajectories and 2,044 question-trajectory pairs. To improve exploration efficiency, we propose Fine-EQA, a hybrid exploration model that integrates frontier-based and goal-oriented navigation to guide agents toward task-relevant regions more effectively. Additionally, we introduce a novel evaluation metric, Exploration-Answer Consistency (EAC), which ensures faithful assessment by measuring the alignment between answer grounding and exploration reliability. Extensive experimental comparisons with state-of-the-art EQA models demonstrate the effectiveness of our EXPRESS-Bench in advancing embodied exploration and question reasoning.
Poster
Junwen Huang · Shishir Reddy Vutukur · Peter Yu · Nassir Navab · Slobodan Ilic · Benjamin Busam

[ Exhibit Hall I ]

Abstract
Typical template-based object pose pipelines first find the closest template and then align it to the current observation.The failure to find the closest template results in the wrong pose estimate. Instead, we reformulate object pose estimation with template images as a ray alignment problem where viewing directions from multiple posed template views need to mutually align with a non-posed object query.Inspired by recent advancements in denoising diffusion frameworks for camera pose estimation, we integrate this formulation into a diffusion transformer architecture capable of aligning a single query image of an object to a set of template views. Our method reparametrizes object rotation by introducing object-centered camera rays and object translation by extending Scale-Invariant Translation Estimation (SITE) to dense translation offsets. Our method leverages view priors from template images to enhance the model's ability to accurately infer query object poses. Using a coarse-to-fine training strategy with narrowed template sampling, our approach improves performance without modifying the network architecture, increasing robustness in 6D object pose estimation.Extensive evaluations on various benchmark datasets demonstrate the superiority of our method over state-of-the-art approaches in unseen object pose estimation. Our code will be made publicly available.
Poster
Mateusz Michalkiewicz · Xinyue Bai · Mahsa Baktashmotlagh · Varun Jampani · Guha Balakrishnan

[ Exhibit Hall I ]

Abstract
In this paper, we analyze the viewpoint stability of foundational models - specifically, their sensitivity to changes in viewpoint- and define instability as significant feature variations resulting from minor changes in viewing angle, leading to generalization gaps in 3D reasoning tasks. We investigate nine foundational models, focusing on their responses to viewpoint changes, including the often-overlooked accidental viewpoints where specific camera orientations obscure an object's true 3D structure. Our methodology enables recognizing and classifying accidental, stable and other viewpoints using feature representations alone, without accessing the actual images. Our findings indicate that while foundation models consistently encode accidental viewpoints, they vary in their interpretation of other viewpoints due to inherent biases, at times leading to object misclassifications based on geometric resemblance. Through quantitative and qualitative evaluations on three downstream tasks - classification, VQA, and 3D reconstruction - we illustrate the impact of viewpoint instability and underscore the importance of feature robustness across diverse viewing conditions.
Poster
Arindam Dutta · Meng Zheng · Zhongpai Gao · Benjamin Planche · Anwesa Choudhuri · Terrence Chen · Amit Roy-Chowdhury · Ziyan Wu

[ Exhibit Hall I ]

Abstract
Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of …
Poster
Yingying Zhang · Lixiang Ru · Kang Wu · Lei Yu · Lei Liang · Yansheng Li · Jingdong Chen

[ Exhibit Hall I ]

Abstract
The multi-modal remote sensing foundation model (MM-RSFM) has significantly advanced various Earth observation tasks, such as urban planning, environmental monitoring, and natural disaster management. However, most existing approaches generally require the training of separate backbone networks for each data modality, leading to redundancy and inefficient parameter utilization. Moreover, prevalent pre-training methods typically apply self-supervised learning (SSL) techniques from natural images without adequately accommodating the characteristics of remote sensing (RS) images, such as the complicated semantic distribution within a single RS image. In this work, we present SkySense V2, a unified MM-RSFM that employs a single transformer backbone to handle multiple modalities. This backbone is pre-trained with a novel SSL strategy tailored to the distinct traits of RS data. In particular, SkySense V2 incorporates an innovative adaptive patch merging module and learnable modality prompt tokens to address challenges related to varying resolutions and limited feature diversity across modalities. In additional, we incorporate the mixture of experts (MoE) module to further enhance the performance of the foundation model. SkySense V2 demonstrates impressive generalization abilities through an extensive evaluation involving 16 datasets over 7 tasks, outperforming SkySense by an average of 1.8 points.
Poster
Mengxue Qu · Yibo Hu · Kunyang Han · Yunchao Wei · Yao Zhao

[ Exhibit Hall I ]

Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have greatly improved their ability to understand both visual and text information. However, a common problem in LVLMs is confirmation bias, where models tend to repeat previous assumptions and follow earlier viewpoints instead of reflecting and correcting themselves. This problem is more common in smaller-scale LVLMs, as they are usually fine-tuned with training data that is mostly positive, focusing on generating coherent dialogue. To address this issue, we introduce ReCoT, a method designed to mitigate confirmation bias in smaller-scale LVLMs through Reflective Self-Correction Training.The method follows a two-stage SFT-DPO paradigm: the first SFT stage aims to cultivate the model's reflective correction abilities, while the DPO stage focuses on enhancing the consistency between answers and reflections. Specifically, we construct dialogue-based reflective samples, which serve as adversarial samples during SFT. In this process, the model is initially presented with a potentially incorrect answer, followed by a reflection and correction phase to generate the final answer. To enhance answer-reflection consistency, we propose the consistency direct preference optimization. To comprehensively evaluate the effectiveness of our ReCoT, we introduce a set of novel metrics to measure the accuracy of the reflection and correction process. Extensive experiments show that …
Poster
Xingyu Chen · Yue Chen · Yuliang Xiu · Andreas Geiger · Anpei Chen

[ Exhibit Hall I ]

Abstract
Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets.In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or fine-tuned on extensive dynamic datasets.
Poster
Ciyu Ruan · Ruishan Guo · Zihang GONG · Jingao Xu · Wenhan Yang · Xinlei Chen

[ Exhibit Hall I ]

Abstract
Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions.Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PRE-Mamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequency-domain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions. Code and dataset will be publicly available upon acceptance.
Poster
Tianhao Wu · Chuanxia Zheng · Frank Guan · Andrea Vedaldi · Tat-Jen Cham

[ Exhibit Hall I ]

Abstract
Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes.It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.
Poster
Sixiang Chen · Tian Ye · Yunlong Lin · Yeying Jin · Yijun Yang · Haoyu Chen · Jianyu Lai · Song Fei · Zhaohu Xing · Fugee Tsung · Lei Zhu

[ Exhibit Hall I ]

Abstract
Real-world image dehazing is crucial for enhancing visual quality in computer vision applications. However, existing physics-based haze generation paradigms struggle to model the complexities of real-world haze and lack controllability, limiting the performance of existing baselines on real-world images. In this paper, we introduce GenHaze, a pioneering haze generation framework that enables the one-step generation of high-quality, reference-controllable hazy images. GenHaze leverages the pre-trained latent diffusion model (LDM) with a carefully designed clean-to-haze generation protocol to produce realistic hazy images. Additionally, by leveraging its fast, controllable generation of paired high-quality hazy images, we illustrate that existing dehazing baselines can be unleashed in a simple and efficient manner. Extensive experiments indicate that GenHaze achieves visually convincing and quantitatively superior hazy images. It also {significantly improves} multiple existing dehazing models across 7 non-reference metrics with minimal fine-tuning epochs.Our work demonstrates that LDM possesses the potential to generate realistic degradations, providing an effective alternative to prior generation pipelines.
Poster
Junwei Luo · Yingying Zhang · Xue Yang · Kang Wu · Qi Zhu · Lei Liang · Jingdong Chen · Yansheng Li

[ Exhibit Hall I ]

Abstract
Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. The code and dataset will be made publicly available.
Poster
Florin-Alexandru Vasluianu · Tim Seizinger · Zongwei Wu · Radu Timofte

[ Exhibit Hall I ]

Abstract
Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed. In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images capturedunder multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts—such as illumination inconsistencies, texture leakage, and color distortion—primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity-luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasingenhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. Our code and dataset will be made public upon acceptance.
Poster
Ruiyang Zhang · Hu Zhang · Zhedong Zheng

[ Exhibit Hall I ]

Abstract
Unsupervised 3D object detection aims to identify objects of interest from unlabeled raw data, such as LiDAR points. Recent approaches usually adopt pseudo 3D bounding boxes (3D bboxes) from clustering algorithm to initialize the model training. However, pseudo bboxes inevitably contain noise, and such inaccuracies accumulate to the final model, compromising the performance. Therefore, in an attempt to mitigate the negative impact of inaccurate pseudo bboxes, we introduce a new uncertainty-aware framework for unsupervised 3D object detection, dubbed UA3D. In particular, our method consists of two phases: uncertainty estimation and uncertainty regularization. (1) In the uncertainty estimation phase, we incorporate an extra auxiliary detection branch alongside the original primary detector. The prediction disparity between the primary and auxiliary detectors could reflect fine-grained uncertainty at the box coordinate level. (2) Based on the assessed uncertainty, we adaptively adjust the weight of every 3D bbox coordinate via uncertainty regularization, refining the training process on pseudo bboxes. For pseudo bbox coordinate with high uncertainty, we assign a relatively low loss weight. Extensive experiments verify that the proposed method is robust against the noisy pseudo bboxes, yielding substantial improvements on nuScenes and Lyft compared to existing approaches, with increases of +3.9\% AP$_{BEV}$ and +1.5\% …
Poster
Phillip Y. Lee · Jihyeon Je · Chanho Park · Mikaela Uy · Leonidas Guibas · Minhyuk Sung

[ Exhibit Hall I ]

Abstract
We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking - the ability to perceive an environment or situation from an alternative viewpoint - is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, tested across various VLMs, demonstrate consistent improvements in perspective-aware reasoning with our framework, outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.
Poster
Jianzhe Gao · Rui Liu · Wenguan Wang

[ Exhibit Hall I ]

Abstract
Vision-language navigation (VLN) task requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works provide agents with various scene maps to enhance their spatial awareness, integrating 3D geometric priors and semantics into a unified map remains challenging. Moreover, these methods often neglect to account for the complex spatial relationships and the open nature of VLN scenarios in their map design, which limits their ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Gaussian Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors to boost spatial awareness. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world. These processes result in a unified 3D Gaussian Map that integrates geometric priors with open-set semantics. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist …
Poster
Guanxing Lu · Baoxiong Jia · Puhao Li · Yixin Chen · Ziwei Wang · Yansong Tang · Siyuan Huang

[ Exhibit Hall I ]

Abstract
Training robot policies within a learned world model is trending due to the inefficiency of real-world interactions. The established image-based world models and policies have shown prior success, but lack robust geometric information that requires consistent spatial and physical understanding of the three-dimensional world, even pre-trained on internet-scale video sources.To this end, we propose a novel branch of world model named **Gaussian World Model (GWM)** for robotic manipulation, which reconstructs the future state by inferring the propagation of Gaussian primitives under the effect of robot actions.At its core is a latent Diffusion Transformer (DiT) combined with a 3D variational autoencoder, enabling fine-grained scene-level future state reconstruction with Gaussian Splatting.GWM can not only enhance the visual representation for imitation learning agent by self-supervised future prediction training, but can serve as a neural simulator that supports model-based reinforcement learning.Both simulated and real-world experiments depict that GWM can precisely predict future scenes conditioned on diverse robot actions, and can be further utilized to train policies that outperform the state-of-the-art by impressive margins, showcasing the initial data scaling potential of 3D world model.
Poster
Haochen Wang · Yucheng Zhao · Tiancai Wang · Haoqiang Fan · Xiangyu Zhang · Zhaoxiang Zhang

[ Exhibit Hall I ]

Abstract
The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (ROSS3D), which integrates 3D aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird’s-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, ROSS3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data. The code will be made publicly available upon acceptance.
Poster
Yiming Zuo · Willow Yang · Zeyu Ma · Jia Deng

[ Exhibit Hall I ]

Abstract
Depth completion (DC) aims to predict a dense depth map from an RGB image and a sparse depth map. Existing DC methods generalize poorly to new datasets or unseen sparse depth patterns, limiting their real-world applications. We propose OMNI-DC, a highly robust DC model that generalizes well zero-shot to various datasets. The key design is a novel Multi-Resolution Depth Integrator, allowing our model to deal with very sparse depth inputs. We also introduce a novel Laplacian loss to model the ambiguity in the training process. Moreover, we train OMNI-DC on a mixture of high-quality datasets with a scale normalization technique and synthetic depth patterns. Extensive experiments on 7 datasets show consistent improvements over baselines, reducing errors by as much as 43%. Codes and checkpoints will be made public.
Poster
Yichen Shen · Yijin Li · Shuo Chen · Guanglin Li · Zhaoyang Huang · Hujun Bao · Zhaopeng Cui · Guofeng Zhang

[ Exhibit Hall I ]

Abstract
Event cameras, known for their high temporal resolution and ability to capture asynchronous changes, have gained significant attention for their potential in feature tracking, especially in challenging conditions. However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with grayscale images for high-frequency feature tracking. Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches. This approach improves single-modality tracking and effectively solves the data association and fusion from asynchronous event and image data. We also introduce new synthetic and augmented datasets to better evaluate our model. Experimental results indicate that BlinkTrack significantly outperforms existing methods, exceeding 80 FPS with multi-modality data and 100 FPS with preprocessed event data.
Poster
Regine Hartwig · Dominik Muhle · Riccardo Marin · Daniel Cremers

[ Exhibit Hall I ]

Abstract
Recent advancements in feature computation have revealed that self-supervised feature extractors can recognize semantic correspondences. However, these features often lack an understanding of objects' underlying 3D geometry. In this paper, we focus on learning features capable of semantically characterizing parts distinguished by their geometric properties, e.g., left/right eyes or front/back legs. We propose GECO, a novel, optimal-transport-based learning method that obtains features geometrically coherent, well-characterizing symmetric points. GECO uses a lightweight model architecture that results in a fast inference, capable of processing images at 30fps. Our method is interpretable and generalizes across datasets, achieving state-of-the-art performance on PFPascal, APK, and CUB datasets improving by 6.0%, 6.2%, and 4.1% respectively. We achieve a \final{speed-up of 98.2% compared to previous methods by using a smaller backbone and a more efficient training scheme. Finally, we find PCK insufficient to analyze the geometrical properties of the features. Hence, we expand our analysis, proposing novel metrics and insights that will be instrumental in developing more geometrically-aware methods.
Poster
Quankai Gao · Iliyan Georgiev · Tuanfeng Wang · Krishna Kumar Singh · Ulrich Neumann · Jae Shin Yoon

[ Exhibit Hall I ]

Abstract
3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate it's ability to …
Poster
Pradyumn Goyal · Dmitrii Petrov · Sheldon Andrews · Yizhak Ben-Shabat · Hsueh-Ti Derek Liu · Evangelos Kalogerakis

[ Exhibit Hall I ]

Abstract
We present GEOPARD, a transformer-based architecture for predicting articulation from a single static snapshot of a 3D shape. The key idea of our method is a pretraining strategy that allows our transformer to learn plausible candidate articulations for 3D shapes based on a geometric-driven searchwithout manual articulation annotation. The search automatically discovers physically valid part motions that do not cause detachments or collisions with other shape parts. Our experiments indicate that this geometric pretraining strategy, along with carefully designed choices in our transformer architecture, yields state-of-the-art results in articulation inference in the popular shape Part-Mobility dataset.
Poster
Zengyu Wan · Wei Zhai · Yang Cao · Zheng-Jun Zha

[ Exhibit Hall I ]

Abstract
Visual 3D motion estimation aims to infer the motion of 2D pixels in 3D space based on visual cues. The key challenge arises from depth variation induced spatio-temporal motion inconsistencies, disrupting the assumptions of local spatial or temporal motion smoothness in previous motion estimation frameworks. In contrast, event cameras offer new possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMoTive, a novel event-based framework that models spatio-temporal trajectories via event-guided non-uniform parametric curves, effectively characterizing locally heterogeneous spatio-temporal motion. Specifically, we first introduce Event Kymograph - an event projection method that leverages a continuous temporal projection kernel and decouples spatial observations to encode fine-grained temporal evolution explicitly. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance, coupled with a non-uniform rational curve parameterization framework to adaptively model heterogeneous trajectories. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion fields. To facilitate evaluation, we introduce CarlaEvent3D, a multi-dynamic synthetic dataset for comprehensive validation. Extensive experiments on both this dataset and a real-world benchmark demonstrate the effectiveness of the proposed method.
Poster
Philipp Wulff · Felix Wimbauer · Dominik Muhle · Daniel Cremers

[ Exhibit Hall I ]

Abstract
Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.
Poster
Chenhang Ying · Huiyu Yang · Jieyi Ge · Zhaodong Sun · Xu Cheng · Kui Ren · Xiaobai Li

[ Exhibit Hall I ]

Abstract
Remote physiological measurement using visible light cameras has emerged as a powerful tool for non-contact health monitoring, yet its reliability degrades under challenging conditions such as low-light environments or diverse skin tones. These limitations have motivated the exploration of alternative sensing modalities, such as near-infrared sensors and radar systems, which offer complementary physiological information due to their distinct sensing principles. While alternative modalities capture complementary physiological cues through distinct sensing principles, existing methods fail to holistically integrate these heterogeneous data. Our key insight is that while visible light, near-infrared, and radar operate on distinct physical principles, they all capture temporally dynamic physiological signatures that can be represented as time-varying signals reflecting underlying physiological processes. Based on this insight, we propose FusionPhys, a novel framework that implements an adaptive integration mechanism to refine physiological information across complementary modalities. We further introduce a sub-modality embedding technique that extends fusion principles to single-modality videos. Extensive experiments across five benchmark datasets demonstrate that FusionPhys achieves competitive performance in diverse sensing configurations, representing a significant advancement toward more reliable and versatile remote physiological measurement systems.
Poster
Yuanhong Yu · Xingyi He · Chen Zhao · Junhao Yu · Jiaqi Yang · Ruizhen Hu · Yujun Shen · Xing Zhu · Xiaowei Zhou · Sida Peng

[ Exhibit Hall I ]

Abstract
This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications. The code will be released for the reproducibility.
Poster
Shuo LIANG · Yiwu Zhong · Zi-Yuan Hu · Yeyao Tao · Liwei Wang

[ Exhibit Hall I ]

Abstract
Spatiotemporal video grounding aims to localize target entities in videos based on textual queries, yet existing studies predominantly focus on exocentric videos. In comparison, egocentric video grounding remains underexplored despite its wide applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. Further, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline which annotates referring expressions and object masks across short-, mid-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding.
Poster
Shunya Nagashima · Komei Sugiura

[ Exhibit Hall I ]

Abstract
Accurate, reliable solar flare prediction is crucial for mitigating potential disruptions to critical infrastructure, while predicting solar flares remains a significant challenge. Existing methods based on heuristic physical features often lack representation learning from solar images. On the other hand, end-to-end learning approaches struggle to model long-range temporal dependencies in solar images.In this study, we propose Deep Space Weather Model (Deep SWM), which is based on multiple deep state space models for handling both ten-channel solar images and long-range spatio-temporal dependencies. Deep SWM also features a sparse masked autoencoder, a novel pretraining strategy that employs a two-phase masking approach to preserve crucial regions such as sunspots while compressing spatial information.Furthermore, we built FlareBench, a new public benchmark for solar flare prediction covering a full 11-year solar activity cycle, to validate our method.Our method outperformed baseline methods and even human expert performance on standard metrics in terms of performance and reliability. The project page can be found at https://iccv25-6qrol.kinsta.page.
Poster
Youngho Kim · Hoonhee Cho · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motion-aware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. We will make our code publicly available.
Poster
James Amato · Yunan Xie · Leonel Medina-Varela · Ammar Aljerwi · Adam McCutcheon · T. Rippentrop · Kristian Gonzalez · Jacques Delabrouille · Mustapha Ishak · Nicholas Ruozzi

[ Exhibit Hall I ]

Abstract
The Cosmic Microwave Background (CMB) radiation is a pillar of modern cosmology. This GHz-range signal gives rise to better understanding of the fundamental parameters of the universe, but requires sophisticated signal separation. While the astrophysics community has developed computational methods, the adoption of computer-vision methods for these tasks has been proposed by several groups. Results are difficult to compare, as the underlying datasets and evaluations are inconsistent and have not been made publicly available. We propose CMB-ML, a dataset and library that integrates dataset creation, model inference, and result evaluation into a pipeline. The library and links for data are available on GitHub at https://github.com/iccv-author-5412/cmb-ml.
Poster
Yue Li · Meng Tian · Zhenyu Lin · Jiangtong Zhu · Dechang Zhu · Haiqiang Liu · Yueyi Zhang · Zhiwei Xiong · Xinhai Zhao

[ Exhibit Hall I ]

Abstract
Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $\textbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $\textbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources).The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.
Poster
Chanhwi Jeong · Inhwan Bae · Jin-Hwi Park · Hae-Gon Jeon

[ Exhibit Hall I ]

Abstract
Zero-shot depth completion with metric scales poses significant challenges, primarily due to performance limitations such as domain specificity and sensor characteristics. One recent emerging solution is to integrate monocular depth foundation models into depth completion frameworks, yet these efforts still face issues with suboptimal performance and often require further adaptation to the target task. Surprisingly, we find that a simple test-time training, which fine-tunes monocular depth foundation models on sparse depth measurements from sensors just as it is, yields reasonable results. However, this test-time training obviously incurs high computational costs and introduces biases towards specific conditions, making it impractical for real-world scenarios. In this paper, we introduce a new approach toward parameter-efficient zero-shot depth completion. Our key idea of this work is to leverage visual prompt tuning, achieving sensor-specific depth scale adaptation without forgetting foundational knowledge. Experimental results on diverse datasets demonstrate that our approach outperforms relevant state-of-the-art methods, showing superior generalization and efficiency. Our source code is available in the supplementary materials.
Poster
Liuyi Wang · Xinyuan Xia · Hui Zhao · Hanqing Wang · Tai Wang · Yilun Chen · Chengju Liu · Qijun Chen · Jiangmiao Pang

[ Exhibit Hall I ]

Abstract
Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. Our code will be publicly released.
Poster
Matan Kichler · Shai Bagon · Mark Sheinin

[ Exhibit Hall I ]

Abstract
Computer vision seeks to infer a wide range of information about scene objects and events. However, vision systems based on conventional imaging are limited to extracting information only from the visible surfaces of scene objects. For instance, a vision system can detect and identify a Coke can in the scene but cannot determine whether it is full or empty. In this paper, we seek to extend the scope of computer vision to include the novel task of inferring the hidden liquid levels of opaque containers by sensing the tiny vibrations on their surfaces. First, we propose a novel speckle-based vibration sensing system for capturing scene vibrations on a 2D grid of points, at once. We use our system to efficiently and remotely capture a dataset of vibration responses for a plurality of everyday liquid containers. Then, we develop a transformer-based approach for analyzing the captured vibrations and classifying the container type and its hidden liquid level at measurement time. Our architecture is invariant to the vibration source, yielding correct liquid level estimates for controlled and ambient scene sound sources. Moreover, we show that the model can generalize to unseen container instances and fluid levels. We demonstrate our method by recovering …
Poster
Ren-Jie Lu · Yu Zhou · hao cheng · Jingke Meng · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
Vision and Language Navigation(VLN) requires agents to navigate 3D environments by following natural language instructions. While existing methods predominantly assume access to panoramic observations, many practical robotics are equipped with monocular RGBD cameras, creating a significant configuration disparity. In this work, we address this critical gap by developing a novel 3DGS-based framework for monocular VLN agents, focusing on the intrinsic information incompleteness challenge. Our approach incorporates two key innovations: (1) implicit partial completion module for inferring representations of missing regions in incompletely rendered panoramic feature maps, and (2) an uncertainty-aware active perception strategy that enables the agent to actively acquire visual observation when uncertain about its decision. Extensive experiments on R2R-CE and RxR-CE datasets demonstrate that our monoVLN outperforms all existing monocular methods, significantly improve 8\% success rate on R2R-CE compared to previous monocular methods. We also validate our monoVLN in real-world environments, providing a practical solution for real-world VLN. Furthermore, our findings challenge the conventional wisdom regarding panoramic observations, suggesting they may not be the optimal configuration and providing insights for future research directions in VLN literature. Code will be released.
Poster
Seunggeun Chi · Pin-Hao Huang · Enna Sachdeva · Kwonjoon Lee

[ Exhibit Hall I ]

Abstract
Amodal completion, the task of inferring the complete appearance of objects despite partial occlusions, is crucial for understanding complex human–object interactions (HOI) in computer vision and robotics. Existing methods, including pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios due to their limited understanding of HOI. To address this challenge, we propose a novel approach that leverages physical prior knowledge alongside a specialized multi-regional inpainting technique tailored for HOI. By incorporating physical constraints derived from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to reside, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method employs customized denoising strategies across these regions within a diffusion model, thereby enhancing the accuracy and realism of generated completions in both shape and visual detail. Experimental results demonstrate that our approach substantially outperforms existing methods in HOI scenarios, advancing machine perception toward a more human-like understanding of dynamic environments. Furthermore, we show that our pipeline remains robust even without ground-truth contact annotations, broadening its applicability to tasks such as 3D reconstruction and novel view/pose synthesis. Code will be made publicly available upon acceptance.
Poster
HIroyasu Akada · Jian Wang · Vladislav Golyanik · Christian Theobalt

[ Exhibit Hall I ]

Abstract
Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward---a common motion in human activities. A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras in the HDM design for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Moreover, we introduce two new large-scale datasets, Ego4View-Syn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the …
Poster
Chengxuan Zhu · Qingnan Fan · Qi Zhang · Jinwei Chen · Huaqi Zhang · Chao Xu · Boxin Shi

[ Exhibit Hall I ]

Abstract
We introduce a novel lens blur rendering approach with the help of generative diffusion prior, to achieve physically accurate outcomes. Previous lens blur methods are bounded by the accuracy of depth estimation methods, thus introducing artifacts in depth discontinuities. Our method employs a physics-inspired self-attention module that aligns with the image formation process, incorporating depth-dependent circle of confusion constraint and self-occlusion effects. We adapt the diffusion model to the one-step inference scheme without introducing additional noise, and achieves results of high quality and fidelity. To address the lack of scalable paired training data, we propose to synthesize photorealistic foregrounds with transparency with diffusion models, balancing image authenticity and scene diversity.
Poster
Jiahao Wu · Rui Peng · Jianbo Jiao · Jiayu Yang · Luyang Tang · Kaiqiang Xiong · Jie Liang · Jinbo Yan · runling liu · Ronggang Wang

[ Exhibit Hall I ]

Abstract
Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce \ourname, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes:1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space.2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space.As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Code and models will be made publicly available.
Poster
Jiajin Tang · Zhengxuan Wei · Ge Zheng · Sibei Yang

[ Exhibit Hall I ]

Abstract
Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces TransLoop, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric, but also transfers back to enhance exocentric knowledge extraction. Within TransLoop, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images, while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.
Poster
Nahyuk Lee · Juhong Min · Junhong Lee · Chunghyun Park · Minsu Cho

[ Exhibit Hall I ]

Abstract
This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: 'identical surface shape' and 'opposite volume occupancy.' Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art.
Poster
Yihan Cao · Jiazhao Zhang · Zhinan Yu · Shuzhen Liu · Zheng Qin · Qin Zou · Bo Du · Kai Xu

[ Exhibit Hall I ]

Abstract
Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts. The code has been submitted …
Poster
Xinggang Hu · Chenyangguang Zhang · Mingyuan Zhao · Yuanze Gui · Xiangkui Zhang · Xiangyang Ji

[ Exhibit Hall I ]

Abstract
In dynamic scenes, achieving accurate camera localization and reconstructing a long-term consistent map containing only the static background are two major challenges faced by Visual Simultaneous Localization and Mapping (VSLAM). In current traditional dynamic VSLAM systems, the methods used to handle dynamic objects are primarily designed for localization; if applied to reconstruction, they are prone to introducing motion artifacts. Meanwhile, mask compensation strategies in NeRF- or 3DGS-based dynamic VSLAM systems also face challenges, such as the inability to completely eliminate dynamic object artifacts and low real-time performance. To address these issues, we leverage object detection to extract semantic information and propose a dynamic feature detection algorithm based on both geometry and appearance. This algorithm accurately identifies known and unknown moving objects and determines their actual motion states. To mitigate the issue of insufficient detection box coverage, we design a dynamic object box correction algorithm based on clustering and Gaussian mixture models to comprehensively identify moving object regions. Furthermore, to overcome the limitations of sparse features in texture-scarce environments, we introduce a feature densification strategy based on image texture complexity, enhancing reconstruction quality while maintaining real-time performance. Extensive experimental evaluations demonstrate that our system achieves state-of-the-art localization and reconstruction performance in …
Poster
Sivan Doveh · Nimrod Shabtay · Eli Schwartz · Leonid Karlinsky · Raja Giryes · Hilde Kuehne · Rogerio Feris · James Glass · Assaf Arbelle · Shimon Ullman · Muhammad Jehanzeb Mirza

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words.To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context …
Poster
Jiasheng Guo · Xin Gao · Yuxiang Yan · Guanghao Li · Jian Pu

[ Exhibit Hall I ]

Abstract
Low-light Object detection is crucial for many real-world applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equpped with content-aware adaptability and physics-informed priors, enabling automatic RAW-to-RGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline’s intrinsic cascade structure, we devise a self-boosting strategy that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.
Poster
Romain Thoreau · Valerio Marsocci · Dawa Derksen

[ Exhibit Hall I ]

Abstract
As large-scale heterogeneous data sets become increasingly available, adapting Foundation Models at low cost has become a key issue.Seminal works in natural language processing, e.g. Low-Rank Adaptation (LoRA), leverage the low "intrinsic rank" of parameter updates during adaptation. In this paper, we argue that stronger inductive biases on the data and on the models can improve the adaptation of Foundation Models pretrained on RGB satellite images to other sources of satellite data. The pretrained parameters of Geospatial Foundation Models (GFMs) indeed provide a strong prior on the spatial dimension of multispectral images. For this reason, we introduce DEFLECT (Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks), a novel strategy for adapting GFMs to multispectral satellite imagery with very few additional parameters. DEFLECT improves the representation capabilities of the extracted features, particularly enhancing spectral information, which is essential for geoscience and environmental-related tasks. We demonstrate the effectiveness of our method across three different GFMs and five diverse datasets, ranging from forest monitoring to marine environment segmentation. Compared to competing methods, DEFLECT achieves on-par or higher accuracy with 5-10x fewer parameters for classification and segmentation tasks. The code will be made publicly available.
Poster
Ran Zhao · Xinxin Dai · Pengpeng Hu · Vasile Palade · Adrian Munteanu

[ Exhibit Hall I ]

Abstract
While automatic anthropometric measurement extraction has witnessed growth in recent years, effective, non-contact, and precise measurement methods for dressed humans in arbitrary poses are still lacking, limiting the widespread application of this technology. The occlusion caused by clothing and the adverse influence of posture on body shape significantly increase the complexity of this task. Additionally, current methods often assume the availability of a complete 3D body mesh in a canonical pose (e.g., "A" or "T" pose), which is not always the case in practice. To address these challenges, we propose MeasureXpert, a novel learning-based model that requires only two unregistered, partial, and dressed body scans as input, and accommodates entirely independent and arbitrary poses for each scan. MeasureXpert computes a comprehensive representation of the naked body shape by synergistically fusing features from the front- and back-view partial point clouds. The comprehensive representation obtained is mapped onto a 3D undressed body shape space, assuming a canonical posture and incorporating predefined measurement landmarks. A point-based offset optimization is also developed to refine the reconstructed complete body shape, enabling accurate regression of measurement values. To train the proposed model, a new large-scale dataset, consisting of 300K samples, was synthesized. The proposed model was …
Poster
Dimitrije Antić · Georgios Paschalidis · Shashank Tripathi · Theo Gevers · Sai Kumar Dwivedi · Dimitrios Tzionas

[ Exhibit Hall I ]

Abstract
Recovering 3D object pose and shape from a single image is a challenging and highly ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intra- and inter-class shape variance, and lack of 3D ground truth for natural images. While existing methods train deep networks on synthetic datasets to predict 3D shapes, they often struggle to generalize to real-world scenarios, lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without explicitly considering pixel alignment. To this end, we make two key observations: (1) a robust solution requires a model that imposes a strong category-specific shape prior to constrain the search space, and (2) foundational models embed 2D images and 3D shapes in joint spaces; both help resolve ambiguities. Hence, we propose SDFit, a novel optimization framework that is built on three key innovations: First, we use a learned morphable signed-distance-function (mSDF) model that acts as a strong shape prior, thus constraining the shape space. Second, we use foundational models to establish rich 2D-to-3D correspondences between image features and the mSDF. Third, we develop a fitting pipeline that iteratively refines both shape and pose, aligning the mSDF to the image. We evaluate SDFit on …
Poster
Sanghun Jung · Jingjing Zheng · Ke Zhang · Nan Qiao · Albert Y. C. Chen · Lu Xia · Chi Liu · Yuyin Sun · Xiao Zeng · Hsiang-Wei Huang · Byron Boots · Min Sun · Cheng-Hao Kuo

[ Exhibit Hall I ]

Abstract
Unlike closed-vocabulary 3D instance segmentation that is trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200, S3DIS, and Replica across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
Poster
Gene Chou · Wenqi Xian · Guandao Yang · Mohamed Abdelfattah · Bharath Hariharan · Noah Snavely · Ning Yu · Paul Debevec

[ Exhibit Hall I ]

Abstract
A versatile video depth estimation model should be consistent and accurate across frames, produce high-resolution depth maps, and support real-time streaming. We propose a method, FlashDepth, that satisfies all three requirements, performing depth estimation for a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We validate our approach across multiple unseen datasets against state-of-the-art depth models, and find that our method outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as visual effects editing, and online decision-making, such as robotics.
Poster
Zhirui Gao · Renjiao Yi · Yuhang Huang · Wei Chen · Chenyang Zhu · Kai Xu

[ Exhibit Hall I ]

Abstract
Low-level 3D representations, such as point clouds, meshes, NeRFs and 3D Gaussians, are commonly used for modeling 3D objects and scenes. However, cognitive studies indicate that human perception operates at higher levels and interprets 3D environments by decomposing them into meaningful structural parts, rather than low-level elements like points or voxels. Structured geometric decomposition enhances scene interpretability and facilitates downstream tasks requiring component-level manipulation. In this work, we introduce $\textit{\textbf{PartGS}}$, a self-supervised part-aware reconstruction framework that integrates 2D Gaussians and superquadrics to parse objects and scenes into an interpretable decomposition, leveraging multi-view image inputs to uncover 3D structural information. Our method jointly optimizes superquadric meshes and Gaussians by coupling their parameters within a hybrid representation. On one hand, superquadrics enable the representation of a wide range of shape primitives, facilitating flexible and meaningful decomposition. On the other hand, 2D Gaussians capture detailed texture and geometric details, ensuring high-fidelity appearance and geometry reconstruction. Operating in a self-supervised manner, our approach demonstrates superior performance compared to state-of-the-art methods across extensive experiments on the DTU, ShapeNet, and real-world datasets.
Poster
Qianqian Wang · Vickie Ye · Hang Gao · Weijia Zeng · Jake Austin · Zhengqi Li · Angjoo Kanazawa

[ Exhibit Hall I ]

Abstract
Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos.We tackle the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we take advantage of off-the-shelf data-driven priors such as monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
Poster
Zhenyu Li · Mykola Lavreniuk · Jian Shi · Shariq Bhat · Peter Wonka

[ Exhibit Hall I ]

Abstract
Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scale-and-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and Amodal-DepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions. Experiments validate our design choices, demonstrating the …
Poster
Deepayan Das · Davide Talon · Yiming Wang · Massimiliano Mancini · Elisa Ricci

[ Exhibit Hall I ]

Abstract
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation butheavily rely on training procedures, that can be either costly or unpleasant to individual users.We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain of thought reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level:in case of a discrepancy between the scores, R2P refines the concept association viapairwise multimodal matching, where the retrieved fingerprints and their images aredirectly compared with the query.We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
Poster
Artem Zholus · Carl Doersch · Yi Yang · Skanda Koppula · Viorica Patraucean · Xu He · Ignacio Rocco · Mehdi S. M. Sajjadi · Sarath Chandar · Ross Goroshin

[ Exhibit Hall I ]

Abstract
Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.
Poster
Minghua Liu · Mikaela Uy · Donglai Xiang · Hao Su · Sanja Fidler · Nicholas Sharp · Jun Gao

[ Exhibit Hall I ]

Abstract
We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20\% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields.
Poster
Meiao Wang · Xuejing Kang · Yaxi Lu · Jie Xu

[ Exhibit Hall I ]

Abstract
Low-light video enhancement (LLVE) aims to restore videos degraded by insufficient illumination.While existing methods have demonstrated their effectiveness, they often face challenges with intra-frame noise, overexposure, and inter-frame inconsistency since they fail to exploit the temporal continuity across frames.Inspired by the progressive video understanding mechanism of human, we propose a novel end-to-end two-stage memory controller (MC) dominated network (RetinexMCNet). Specifically, we first define the overall optimization objective for Retinex-based LLVE, and accordingly design our framework.In stage one, aided by a dual-perspective Lightness-Texture Stability (LTS) loss, we perform per-frame enhancement without the MC, which uses a channel-aware Illumination Adjustment Module (IAM) and an illumination-guided Reflectance Denoising Module (RDM) based on Retinex theory to mitigate intra-frame noise and overexposure.In stage two, we activate the MC to simulate human temporal memory and integrate it with high-quality single frames for global consistency.Extensive qualitative and quantitative experiments on common low-light sRGB datasets demonstrate our method significantly outperforms state-of-the-art approaches. Code is available at xxx/xxx/xxx.
Poster
Ronggang Huang · Haoxin Yang · Yan Cai · Xuemiao Xu · Huaidong Zhang · Shengfeng He

[ Exhibit Hall I ]

Abstract
3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations.To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding.Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.
Poster
Shunsuke Yasuki · Taiki Miyanishi · Nakamasa Inoue · Shuhei Kurita · Koya Sakamoto · Daichi Azuma · Masato Taki · Yutaka Matsuo

[ Exhibit Hall I ]

Abstract
The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning …
Poster
Wajahat Khalid · Bin Liu · Xulin Li · MUHAMMAD WAQAS · MUHAMMAD SHER AFGAN

[ Exhibit Hall I ]

Abstract
Aerial-Ground Person Re-Identification (AG-ReID) is a practical yet challenging task that involves cross-platform matching between aerial and ground cameras. Existing person Re-Identification (Re-ID) methods are primarily designed for homogeneous camera settings, such as ground-to-ground or aerial-to-aerial matching. Therefore, these conventional Re-ID approaches underperform due to the significant viewpoint discrepancies introduced by cross-platform cameras in the AG-ReID task. To address this limitation, we propose a novel and efficient approach, termed View-Invariant Feature Learning for Aerial-Ground Person Re-Identification (VIF-AGReID), which explores view-invariant features without leveraging any auxiliary information. Our approach introduces two key components: (1) Patch-Level RotateMix (PLRM), an augmentation strategy that enhances rotational diversity within local regions of training samples, enabling the model to capture fine-grained view-invariant features, and (2) View-Invariant Angular Loss (VIAL), which mitigates the impact of perspective variations by imposing angular constraints that exponentially penalize large angular deviations, optimizing the similarity of positive pairs while enhancing dissimilarity for hard negatives. These components interact synergistically to drive view-invariant feature learning, enhancing robustness across diverse viewpoints. We conduct extensive experiments on benchmark AG-ReID datasets, including CARGO and AG-ReID, to evaluate the effectiveness of our proposed method. Experimental results demonstrate that VIF-AGReID significantly outperforms existing state-of-the-art methods, achieving superior performance in …
Poster
Jinming Li · Yichen Zhu · Zhibin Tang · Junjie Wen · Minjie Zhu · Xiaoyu Liu · Chengmeng Li · Ran Cheng · Yaxin Peng · Yan Peng · Feifei Feng

[ Exhibit Hall I ]

Abstract
Robot foundation models, particularly Vision-Language-Action (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot's generalization and robustness. OpenAI’s recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction?In this paper, we introduce \textbf{Chain-of-Affordance (CoA-VLA)}, a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: (1) \textit{object affordance} — what object to manipulate and where it is; (2) \textit{grasp affordance} — the specific object part to grasp; (3) \textit{spatial affordance} — the optimal space to place the object; and (4) \textit{movement affordance} — the collision-free path for movement. We further transform each affordance into two prompting formats: \textbf{\textit{visual affordance and textual affordance}}. We introduce a novel vision-language co-injection module that integrates this knowledge into the policy network. This allows the robot to leverage essential contextual information during action inference, resulting in improved precision …
Poster
Fengbo Lan · Chang Wen Chen

[ Exhibit Hall I ]

Abstract
Reflective flares are common artifacts in photography that degrade image quality, introducing in-focus flares, which appear as bright, regular spot patterns, and out-of-focus flares, which are diffuse and semi-transparent, obscuring the underlying scene. While previous methods have achieved some success in removing in-focus flares, they struggle with the diffuse nature of out-of-focus flares. The lack of an out-of-focus flare dataset has further hindered the development of effective flare removal models. In this work, we construct a large-scale out-of-focus flare dataset generated based on physical principles. We propose a novel color alignment approach using diffusion models to address the challenges of out-of-focus reflective flare removal. Rather than reconstructing flare-affected regions, our method adjusts the color distribution to reduce artifact visibility while preserving image content. Specifically, we introduce a differentiable histogram loss, derived from the Earth Mover's Distance (EMD), to effectively align color distributions. The proposed approach outperforms existing methods on both synthetic and real-world data, demonstrating improved performance in flare removal.
Poster
Baicheng Li · Zike Yan · Dong Wu · Hongbin Zha

[ Exhibit Hall I ]

Abstract
Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.
Poster
Tom Fischer · Xiaojie Zhang · Eddy Ilg

[ Exhibit Hall I ]

Abstract
Recognizing objects in images is a fundamental problem in computer vision. While detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results on RGB category-level pose on REAL275, outperforming the current state-of-the-art by 5.5\%, averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits significantly greater robustness compared to single-stage baselines.
Poster
Danila Rukhovich · Elona Dupont · Dimitrios Mallis · Kseniya Cherenkova · Anis Kacem · Djamila Aouada

[ Exhibit Hall I ]

Abstract
Computer-Aided Design (CAD) models are typically constructed by sequentially drawing parametric sketches and applying CAD operations to obtain a 3D model. The problem of 3D CAD reverse engineering consists of reconstructing the sketch and CAD operation sequences from 3D representations such as point clouds. In this paper, we address this challenge through novel contributions across three levels: CAD sequence representation, network design, and dataset. In particular, we represent CAD sketch-extrude sequences as Python code. The proposed CAD-Recode translates a point cloud into Python code that, when executed, reconstructs the CAD model. Taking advantage of the exposure of pre-trained Large Language Models (LLMs) to Python code, we leverage a relatively small LLM as a decoder for CAD-Recode and combine it with a lightweight point cloud projector. CAD-Recode is trained solely on a proposed synthetic dataset of one million diverse CAD sequences. CAD-Recode significantly outperforms existing methods across three datasets while requiring fewer input points. Notably, it achieves 10 times lower mean Chamfer distance than state-of-the-art methods on DeepCAD and Fusion360 datasets. Furthermore, we show that our CAD Python code output is interpretable by off-the-shelf LLMs, enabling CAD editing and CAD-specific question answering from point clouds.
Poster
Dmitrii Torbunov · Yihui Ren · Animesh Ghose · Odera Dim · Yonggang Cui

[ Exhibit Hall I ]

Abstract
Event-based cameras (EBCs) have emerged as a bio-inspired alternative to traditional cameras, offering advantages in power efficiency, temporal resolution, and high dynamic range.However, the development of image analysis methods for EBCs is challenging due to the sparse and asynchronous nature of the data.This work addresses the problem of object detection for EBC cameras.The current approaches to EBC object detection focus on constructing complex data representations and rely on specialized architectures.We introduce I2EvDet (Image-to-Event Detection), a novel adaptation framework that bridges mainstream object detection with temporal event data processing.First, we demonstrate that a Real-Time DEtection TRansformer, or RT-DETR, a state-of-the-art natural image detector, trained on a simple image-like representation of the EBC data achieves performance comparable to specialized EBC methods.Next, as part of our framework, we develop an efficient adaptation technique that transforms image-based detectors into event-based detection models by modifying their frozen latent representation space through minimal architectural additions.The resulting EvRT-DETR model reaches state-of-the-art performance on the standard benchmark datasets Gen1 (mAP $+2.3$) and 1Mpx/Gen4 (mAP $+1.4$).These results demonstrate a fundamentally new approach to EBC object detection through principled adaptation of mainstream architectures, offering an efficient alternative with potential applications to other temporal visual domains.
Poster
Connor Malone · Somayeh Hussaini · Tobias Fischer · Michael Milford

[ Exhibit Hall I ]

Abstract
Visual Place Recognition (VPR) enables coarse localization by comparing query images to a reference database of geo-tagged images. Recent breakthroughs in deep learning architectures and training regimes have led to methods with improved robustness to factors like environment appearance change, but with the downside that the required training and/or matching compute scales with the number of distinct environmental conditions encountered. Here, we propose Hyperdimensional One Place Signatures (HOPS) to simultaneously improve the performance, compute and scalability of these state-of-the-art approaches by fusing the descriptors from multiple reference sets captured under different conditions. HOPS scales to any number of environmental conditions by leveraging the Hyperdimensional Computing framework. Extensive evaluations demonstrate that our approach is highly generalizable and consistently improves recall performance across all evaluated VPR methods and datasets by large margins. Arbitrarily fusing reference images without compute penalty enables numerous other useful possibilities, three of which we demonstrate here: descriptor dimensionality reduction with no performance penalty, stacking synthetic images, and coarse localization to an entire traverse or environmental section.
Poster
Fangqi Zhu · Hongtao Wu · Song Guo · Yuxiao Liu · Chilam Cheang · Tao Kong

[ Exhibit Hall I ]

Abstract
World models allow autonomous agents to plan and explore by predicting the visual outcomes of different actions. However, for robot manipulation, it is challenging to accurately model the fine-grained robot-object interaction within the visual space using existing methods which overlooks precise alignment between each action and the corresponding frame.In this paper, we present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details, conditioned on historical observations and robot action trajectories.We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.Extensive experiments show that: (1) the quality of the videos generated by our method surpasses all the comparing baseline methods and scales effectively with increased model size and computation;(2) policy evaluations using IRASim exhibit a strong correlation with those using the ground-truth simulator, highlighting its potential to accelerate real-world policy evaluation; (3) testing-time scaling through model-based planning with IRASim significantly enhances policy performance, as evidenced by an improvement in the IoU metric on the Push-T benchmark from 0.637 to 0.961;(4) IRASim provides flexible action controllability, allowing virtual robotic arms in datasets to be controlled via a keyboard or VR controller. Video and code …
Poster
Zhiqiang Yuan · Ting Zhang · Yeshuang Zhu · Jiapei Zhang · Ying Deng · Zexi Jia · Peixiang Luo · Xiaoyue Duan · Jie Zhou · Jinchao Zhang

[ Exhibit Hall I ]

Abstract
Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people.With the recent progress of vision-language models (VLMs), applying VLMs to offer walking guidance has become popular. However, the existing methods of walking guidance are mainly based on self-curated question-answering datasets that are not publicly accessible, without a standardized benchmark for training or evaluation. Moreover, walking assistance often requires real-time streaming video analysis and the generation of concise yet informative reminders, making VLMs struggle due to excessive responses and low efficiency in inferences. In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs.
Poster
Lei Tian · Xiaomin Li · Liqian Ma · Hao Yin · Zirui Zheng · Hefei Huang · Taiqing Li · Huchuan Lu · Xu Jia

[ Exhibit Hall I ]

Abstract
Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding—a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations. These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. Specifically, our approach first employs a zero-shot tracker to align a set of 2D masks—provided by SAM—to reliably identify their corresponding categories. Next, we utilize CLIP to extract robust semantic encodings across views. Finally, our Contrastive Codebook Learning (CCL) module distills discriminative semantic features by enforcing intra-class compactness and inter-class distinctiveness. In contrast to previous methods that directly apply CLIP to imperfect masks, our framework explicitly resolves semantic conflicts while preserving category discriminability. Extensive experiments demonstrate CCL-LGS's superiority over previous state-of-the-art methods.
Poster
Jiaying Ying · Heming Du · Kaihao Zhang · Lincheng Li · Xin Yu

[ Exhibit Hall I ]

Abstract
Human pose estimation aims to predict the location of body keypoints and enable various practical applications.However, existing research focuses solely on individuals with full physical bodies and overlooks those with limb deficiencies. As a result, current pose estimation annotation formats cannot be generalized to individuals with limb deficiencies.In this paper, we introduce the \textbf{Limb-Deficient Pose Estimation task}, which not only predicts the locations of standard human body keypoints, but also estimates the endpoints of missing limbs.To support this task, we present \textbf{Limb-Deficient Pose (LDPose), the first-ever human pose estimation dataset for individuals with limb deficiencies}.LDPose comprises over 28k images for approximately 100k individuals across diverse limb deficiency types and ethnic backgrounds. The annotation process is guided by internationally accredited para-athletics classifiers to ensure high precision.In addition, we propose a \textbf{Limb-Deficient Loss (LDLoss)} to better distinguish residual limb keypoints by contrasting residual limb keypoints and intact limb keypoints.Furthermore, we design a \textbf{Limb-Deficient Metric (LD Metrics)} to quantitatively measure the keypoint predictions of both residual and intact limbs and benchmark our dataset using state-of-the-art human pose estimation methods.Experiment results indicate that LDPose is a challenging dataset, and we believe that it will foster further research and ultimately support individuals with limb deficiencies …