Poster
Sara Rojas Martinez · Matthieu Armando · Bernard Ghanem · Philippe Weinzaepfel · Vincent Leroy · Grégory Rogez
[ Exhibit Hall I ]
Abstract
Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we use a strong image encoder by distilling the ones from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment humans, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more holistic 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D.Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications.We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks containing diverse human-centric scenarios. Additionally, we validate its generalization to …
Poster
Mahmoud Afifi · Luxi Zhao · Abhijith Punnappurath · Mohamed Abdelsalam · Ran Zhang · Michael Brown
[ Exhibit Hall I ]
Abstract
Cameras rely on auto white balance (AWB) to correct undesirable color casts caused by scene illumination and the camera’s spectral sensitivity. This is typically achieved using an illuminant estimator that determines the global color cast solely from the color information in the camera's raw sensor image. Mobile devices provide valuable additional metadata---such as capture timestamp and geolocation---that offers strong contextual clues to help narrow down the possible illumination solutions. This paper proposes a lightweight illuminant estimation method that incorporates such contextual metadata, along with additional capture information and image colors, into a lightweight model ($\sim$5K parameters), achieving promising results, matching or surpassing larger models. To validate our method, we introduce a dataset of 3,224 smartphone images with contextual metadata collected at various times of day and under diverse lighting conditions. The dataset includes ground-truth illuminant colors, determined using a color chart, and user-preferred illuminants validated through a user study, providing a comprehensive benchmark for AWB evaluation.
Poster
Hanxue Zhang · Haoran Jiang · Qingsong Yao · Yanan SUN · Renrui Zhang · Hao Zhao · Hongyang Li · Hongzi Zhu · Zetong Yang
[ Exhibit Hall I ]
Abstract
Despite the success of deep learning in close-set 3D object detection, existing approaches struggle with zero-shot generalization to novel objects and camera configurations. We introduce DetAny3D, a promptable 3D detection foundation model capable of detecting any novel object under arbitrary camera configurations using only monocular inputs. Training a foundation model for 3D detection is fundamentally constrained by the limited availability of annotated 3D data, which motivates DetAny3D to leverage the rich prior knowledge embedded in extensively pre-trained 2D foundation models to compensate for this scarcity. To effectively transfer 2D knowledge to 3D, DetAny3D incorporates two core modules: the 2D Aggregator, which aligns features from different 2D foundation models, and the 3D Interpreter with Zero-Embedding Mapping, which mitigates catastrophic forgetting in 2D-to-3D knowledge transfer. Experimental results validate the strong generalization of our DetAny3D, which not only achieves state-of-the-art performance on unseen categories and novel camera configurations, but also surpasses most competitors on in-domain data. DetAny3D sheds light on the potential of the 3D foundation model for diverse applications in real-world scenarios, e.g., rare object detection in autonomous driving, and demonstrates promise for further exploration of 3D-centric tasks in open-world settings.
Poster
Marko Mihajlovic · Siwei Zhang · Gen Li · KAIFENG ZHAO · Lea Müller · Siyu Tang
[ Exhibit Hall I ]
Abstract
Parametric human body models play a crucial role in computer graphics and vision, enabling applications ranging from human motion analysis to understanding human-environment interactions. Traditionally, these models use surface meshes, which pose challenges in efficiently handling interactions with other geometric entities, such as objects and scenes, typically represented as meshes or point clouds. To address this limitation, recent research has explored volumetric neural implicit body models. However, existing works are either insufficiently robust for complex human articulations or impose high computational and memory costs, limiting their widespread use. To this end, we introduce VolumetricSMPL, a neural volumetric body model that leverages Neural Blend Weights (NBW) to generate compact, yet efficient MLP decoders. Unlike prior approaches that rely on large MLPs, NBW dynamically blends a small set of learned weight matrices using predicted shape- and pose-dependent coefficients, significantly improving computational efficiency while preserving expressiveness. VolumetricSMPL outperforms prior volumetric occupancy model COAP with 10× faster inference, 6× lower GPU memory usage, enhanced accuracy, and a Signed Distance Function (SDF) for efficient and differentiable contact modeling. We demonstrate VolumetricSMPL’s strengths across four challenging tasks: (1) reconstructing human-object interactions from in-the-wild images, (2) recovering human meshes in 3D scenes from egocentric views, (3) scene-constrained …
Poster
Maria-Paola Forte · Nikos Athanasiou · Giulia Ballardini · Jan Bartels · Katherine J. Kuchenbecker · Michael Black
[ Exhibit Hall I ]
Abstract
Capturing accurate 3D human pose in the wild would provide valuable data for training motion-generation and pose-estimation methods.While video-based capture methods are increasingly accurate, we observe that they often fail in cases involving self-contact, such as a hand touching the face. Natural human behavior frequently includes self-contact, but determining when it occurs is challenging from video alone. In contrast, wearable bioimpedance sensing can cheaply and unobtrusively measure ground-truth skin-to-skin contact. Consequently, we propose a novel approach that combines visual pose estimators with bioimpedance sensing to capture the 3D pose of people by taking self-contact into account. Our method, BioTUCH, initializes the pose using an off-the-shelf estimator and introduces contact-aware pose optimization that minimizes reprojection error and deviations from the input estimate while enforcing vertex proximity constraints based on the measured start and end of self-touch. We validate our approach using a new dataset of synchronized RGB video, bioimpedance measurements, and 3D motion capture, demonstrating an average of 18.5% improvement in reconstruction accuracy. Our framework enables efficient large-scale collection of contact-aware training data for improving pose estimation and generation. Code and data will be shared publicly.
Poster
Lin Bie · Siqi Li · Yifan Feng · Yue Gao
[ Exhibit Hall I ]
Abstract
Monocular depth estimation (MDE) is a fundamental problem in computer vision with wide-ranging applications in various downstream tasks. While multi-scale features are perceptually critical for MDE, existing transformer-based methods have yet to leverage them explicitly. To address this limitation, we propose a hypergraph-based multi-scale representation fusion framework, Hyper-Depth.The proposed Hyper-Depth incorporates two key components: a Semantic Consistency Enhancement (SCE) module and a Geometric Consistency Constraint (GCC) module. The SCE module, designed based on hypergraph convolution, aggregates global information and enhances the representation of multi-scale patch features. Meanwhile, the GCC module provides geometric guidance to reduce over-fitting errors caused by excessive reliance on local features. In addition, we introduce a Correlation-based Conditional Random Fields (C-CRFs) module as the decoder to filter correlated patches and compute attention weights more effectively.Extensive experiments demonstrate that our method significantly outperforms state-of-the-art approaches across all evaluation metrics on the KITTI and NYU-Depth-v2 datasets, achieving improvements of 6.21% and 3.32% on the main metric RMSE, respectively. Furthermore, zero-shot evaluations on the nuScenes and SUN-RGBD datasets validate the generalizability of our approach.
Poster
Varun Sundar · Tianyi Zhang · Sacha Jungerman · Mohit Gupta
[ Exhibit Hall I ]
Abstract
Quanta image sensors record individual photons, enabling capabilities like imaging in near-complete darkness and ultra-high-speed videography. Yet, most research on quanta sensors is limited to recovering image intensities. Can we go beyond just imaging, and develop algorithms that can extract high-level scene information from quanta sensors? This could unlock new possibilities in vision systems, offering reliable operation in extreme conditions. The challenge: raw photon streams captured by quanta sensors have fundamentally different characteristics than conventional images, making them incompatible with vision models. One approach is to first transform raw photon streams to conventional-like images, but this is prohibitively expensive in terms of compute, memory, and latency, making it impractical for most vision and robotics systems. We propose quanta neural networks (QNNs) that directly produce downstream task objectives from raw photon streams. Our core proposal is a trainable QNN layer that can seamlessly integrate with existing image- and video-based neural networks, producing quanta counterparts. By avoiding image reconstruction and allocating computational resources on a scene-adaptive basis, QNNs achieve $1$--$2$ orders of magnitude improvements across all efficiency metrics (compute, latency, readout bandwidth) as compared to reconstruction-based quanta vision, while maintaining high task accuracy across a wide gamut of challenging scenarios including low …
Poster
Haokai Zhu · Bo Qu · Si-Yuan Cao · Runmin Zhang · Shujie Chen · Bailin Yang · Hui-liang Shen
[ Exhibit Hall I ]
Abstract
Previous deep image registration methods that employ single homography, multi-grid homography, or thin-plate spline often struggle with real scenes containing depth disparities due to their inherent limitations. To address this, we propose an Exponential-Decay Free-Form Deformation Network (EDFFDNet), which employs free-form deformation with an exponential-decay basis function. This design achieves higher efficiency and performs well in scenes with depth disparities, benefiting from its inherent locality. We also introduce an Adaptive Sparse Motion Aggregator (ASMA), which replaces the MLP motion aggregator used in previous methods. By transforming dense interactions into sparse ones, ASMA reduces parameters and improves accuracy. Additionally, we propose a progressive correlation refinement strategy that leverages global-local correlation patterns for coarse-to-fine motion estimation, further enhancing efficiency and accuracy. Experiments demonstrate that EDFFDNet reduces parameters, memory, and total runtime by 70.5%, 32.6%, and 33.7%, respectively, while achieving a 0.5 dB PSNR gain over the state-of-the-art method. With an additional local refinement stage, EDFFDNet-2 further improves PSNR by 1.06 dB while maintaining lower computational costs. Our method also demonstrates strong generalization ability across datasets, outperforming previous deep learning methods.
Poster
Ruifei Zhang · Junlin Xie · Wei Zhang · Weikai Chen · Xiao Tan · Xiang Wan · Guanbin Li
[ Exhibit Hall I ]
Abstract
Effectively integrating Large Language Models (LLMs) into autonomous driving requires a balance between leveraging high-level reasoning and maintaining real-time efficiency. Existing approaches either activate LLMs too frequently, causing excessive computational overhead, or use fixed schedules, failing to adapt to dynamic driving conditions. To address these challenges, we propose AdaDrive, an adaptively collaborative slow-fast framework that optimally determines when and how LLMs contribute to decision-making. (1) \textbf{When} to activate the LLM: AdaDrive employs a novel adaptive activation loss that dynamically determines LLM invocation based on a comparative learning mechanism, ensuring activation only in complex or critical scenarios. (2) \textbf{How} to integrate LLM assistance: Instead of rigid binary activation, AdaDrive introduces an adaptive fusion strategy that modulates a continuous, scaled LLM influence based on scene complexity and prediction confidence, ensuring seamless collaboration with conventional planners.Through these strategies, AdaDrive provides a flexible, context-aware framework that maximizes decision accuracy without compromising real-time performance. Extensive experiments on language-grounded autonomous driving benchmarks demonstrate that AdaDrive state-of-the-art performance in terms of both driving accuracy and computational efficiency.
Poster
Yuhang Yang · Fengqi Liu · Yixing Lu · Qin Zhao · Pingyu Wu · Wei Zhai · Ran Yi · Yang Cao · Lizhuang Ma · Zheng-Jun Zha · Junting Dong
[ Exhibit Hall I ]
Abstract
3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation. All training code, models, …
Poster
Xuying Zhang · Yutong Liu · Yangguang Li · Renrui Zhang · Yufei Liu · Kai Wang · Wanli Ouyang · Zhiwei Xiong · Peng Gao · Qibin Hou · Ming-Ming Cheng
[ Exhibit Hall I ]
Abstract
We present TAR3D, a novel framework that consists of a 3D-aware Vector Quantized-Variational AutoEncoder (VQVAE) and a Generative Pre-trained Transformer (GPT) to generate high-quality 3D assets. The core insight of this work is to migrate the multimodal unification and promising learning capabilities of the next-token prediction paradigm to conditional 3D object generation. To achieve this, the3D VQ-VAE first encodes a wide range of 3D shapes into a compact triplane latent space and utilizes a set of discrete representations from a trainable codebook to reconstruct fine-grained geometries under the supervision of query point occupancy. Then, the 3D GPT, equipped with a custom triplane position embedding called TriPE, predicts the codebook index sequence with prefilling prompt tokensin an autoregressive manner so that the composition of 3D geometries can be modeled part by part. Extensive experiments on ShapeNet and Objaverse demonstrate that TAR3D can achieve superior generation quality over existing methods in text-to-3D and image-to-3D tasks.
Poster
Jinjing Zhu · Tianbo Pan · Zidong Cao · Yexin Liu · James Kwok · Hui Xiong
[ Exhibit Hall I ]
Abstract
With the superior sensitivity of event cameras to high-speed motion and extreme lighting conditions, event-based monocular depth estimation has gained popularity to predict structural information about surrounding scenes in challenging environments. However, the scarcity of labeled event data constrains prior supervised learning methods. Unleashing the promising potential of the existing RGB-based depth foundation model, DAM~\cite{yang2024depth}, we propose Depth Any Event stream (EventDAM) to achieve high-performance event-based monocular depth estimation in an annotation-free manner. EventDAM effectively combines paired dense RGB images with sparse event data by incorporating three key cross-modality components: Sparsity-aware Feature Mixture (SFM), Sparsity-aware Feature Distillation (SFD), and Sparsity-invariant Consistency Module (SCM). With the proposed sparsity metric, SFM mixes features from RGB images and event data to generate auxiliary depth predictions, while SFD facilitates adaptive feature distillation. Furthermore, SCM ensures output consistency across varying sparsity levels in event data, thereby endowing EventDAM with zero-shot capabilities across diverse scenes. Extensive experiments across a variety of benchmark datasets, compared to approaches using diverse input modalities, robustly substantiate the generalization and zero-shot capabilities of EventDAM. Project page: \url{http://}.
Poster
Bing Fan · Yunhe Feng · Yapeng Tian · James Liang · Yuewei Lin · Yan Huang · Heng Fan
[ Exhibit Hall I ]
Abstract
Egocentric visual query localization (EgoVQL) focuses on localizing the target of interest in space and time from first-person videos, given a visual query. Despite recent progressive, existing methods often struggle to handle severe object appearance changes and cluttering background in the video due to lacking sufficient target cues, leading to degradation. Addressing this, we introduce PRVQL, a novel Progressive knowledge-guided Refinement framework for EgoVQL. The core is to continuously exploit target-relevant knowledge directly from videos and utilize it as guidance to refine both query and video features for improving target localization. Our PRVQL contains multiple processing stages. The target knowledge from one stage, comprising appearance and spatial knowledge extracted via two specially designed knowledge learning modules, are utilized as guidance to refine the query and videos features for the next stage, which are used to generate more accurate knowledge for further feature refinement. With such a progressive process, target knowledge in PRVQL can be gradually improved, which, in turn, leads to better refined query and video features for localization in the final stage. Compared to previous methods, our PRVQL, besides the given object cues, enjoys additional crucial target information from a video as guidance to refine features, and hence enhances …
Poster
Yansong Guo · Jie Hu · Yansong Qu · Liujuan Cao
[ Exhibit Hall I ]
Abstract
Recent advances in interactive 3D segmentation from 2D images have demonstrated impressive performance. However, current models typically require extensive scene-specific training to accurately reconstruct and segment objects, which limits their applicability in real-time scenarios. In this paper, we introduce WildSeg3D, an efficient approach that enables the segmentation of arbitrary 3D objects across diverse environments using a feed-forward mechanism. A key challenge of this feed-forward approach lies in the accumulation of 3D alignment errors across multiple 2D views, which can lead to inaccurate 3D segmentation results. To address this issue, we propose Dynamic Global Aligning (DGA), a technique that improves the accuracy of global multi-view alignment by focusing on difficult-to-match 3D points across images, using a dynamic adjustment function. Additionally, for real-time interactive segmentation, we introduce Multi-view Group Mapping (MGM), a method that utilizes an object mask cache to integrate multi-view segmentations and respond rapidly to user prompts. WildSeg3D demonstrates robust generalization across arbitrary scenes, thereby eliminating the need for scene-specific training. Specifically, WildSeg3D not only attains the accuracy of state-of-the-art (SOTA) methods but also achieves a 40$\times$ speedup compared to existing SOTA models. Our code will be publicly available.
Poster
Junli Liu · Qizhi Chen · Zhigang Wang · Yiwen Tang · Yiting Zhang · Chi Yan · Dong Wang · Xuelong Li · Bin Zhao
[ Exhibit Hall I ]
Abstract
Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning.Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.
Poster
Weida Wang · Changyong He · Jin Zeng · Di Qiu
[ Exhibit Hall I ]
Abstract
Depth images captured by Time-of-Flight (ToF) sensors are prone to noise, requiring denoising for reliable downstream applications. Previous works either focus on single-frame processing, or perform multi-frame processing without considering depth variations at corresponding pixels across frames, leading to undesirable temporal inconsistency and spatial ambiguity. In this paper, we propose a novel ToF depth denoising network leveraging motion-invariant graph fusion to simultaneously enhance temporal stability and spatial sharpness. Specifically, despite depth shifts across frames, graph structures exhibit temporal self-similarity, enabling cross-frame geometric attention for graph fusion. Then, by incorporating an image smoothness prior on the fused graph and data fidelity term derived from ToF noise distribution, we formulate a maximum a posterior problem for ToF denoising. Finally, the solution is unrolled into iterative filters whose weights are adaptively learned from the graph-informed geometric attention, producing a high-performance yet interpretable network. Experimental results demonstrate that the proposed scheme achieves state-of-the-art performance in terms of accuracy and consistency on synthetic DVToF dataset and exhibits robust generalization on the real Kinectv2 dataset.
Poster
Suchisrit Gangopadhyay · Jung Hee Kim · Xien Chen · Patrick Rim · Hyoungseob Park · Alex Wong
[ Exhibit Hall I ]
Abstract
Monocular depth estimation (MDE) has advanced significantly with the introduction of transformer-based foundational vision models. However, their application to fisheye images, widely used in robotics, security systems, autonomous vehicles, and augmented reality due to their wide field of view, remains challenging due to severe radial distortions and calibration differences. Standard transformer-based models trained on perspective images fail to generalize effectively to fisheye inputs, resulting in poor depth predictions. To address this, we introduce \emph{calibration tokens}, a lightweight, token-based adaptation method that allows perspective-trained foundational models to handle fisheye distortions without retraining or fine-tuning the entire network. Calibration tokens learn to realign distorted fisheye features with the perspective latent distribution in a self-supervised manner using a novel inverse warping consistency loss. This training approach leverages existing perspective image datasets and pre-trained foundational models without requiring labeled fisheye images. Experiments demonstrate that our calibration tokens improve performance on real-world fisheye datasets for monocular depth estimation tasks, surpassing baselines while maintaining computational efficiency and inference-time simplicity.
Poster
WonJun Moon · Hyun Seok Seong · Jae-Pil Heo
[ Exhibit Hall I ]
Abstract
Facilitating an entity's interaction with objects requires accurately identifying parts that afford specific actions. Weakly supervised affordance grounding (WSAG) seeks to imitate human learning from third-person demonstrations, where humans intuitively grasp functional parts without needing pixel-level annotations. To achieve this, grounding is typically learned using a shared classifier across images from different perspectives, along with distillation strategies incorporating part discovery process. However, since affordance-relevant parts are not always easily distinguishable, models primarily rely on classification, often focusing on common but unaffordable features. To address this limitation, we move beyond isolated part-level learning by introducing selective prototypical and pixel contrastive objectives that adaptively learn affordance-relevant cues at both the part and object levels, depending on the granularity of the available information. Initially, we find the action-associated objects in both egocentric (object-focused) and exocentric (third-person example) images by leveraging CLIP. Then, cross-referencing the discovered objects of complementary views, we excavate the precise part-level affordance clues in each perspective. By consistently learning to distinguish affordance-relevant regions from affordance-irrelevant background context, our approach effectively shifts activation from irrelevant areas toward meaningful affordance cues. Experimental results demonstrate the effectiveness of our method.
Poster
Fan Pei · jinchen bai · Xiang Feng · Zoubin Bi · Kun Zhou · Hongzhi Wu
[ Exhibit Hall I ]
Abstract
We present OpenSubstance, a high-quality measured dataset with 1.8 million high-dynamic-range images of 151 objects with a wide variety in shape and appearance, captured under 270 camera views and 1,637 lighting conditions, including 1,620 one-light-at-a-time, 8 environment, 8 linear and 1 full-on illumination. For each image, the corresponding lighting condition, camera parameters and foreground segmentation mask are provided. High-precision 3D geometry is also acquired for rigid objects. It takes 1 hour on average to capture one object with our custom-built high-performance lightstage and a top-grade commerical 3D scanner. We perform comprehensive quantitative evaluation on state-of-the-art techniques across different tasks, including single- and multi-view photometric stereo, as well as relighting. The project is publicly available at ***anonymous link***.
Poster
peilin Tao · Hainan Cui · Diantao Tu · Shuhan Shen
[ Exhibit Hall I ]
Abstract
Multi-camera systems are increasingly vital in the environmental perception of autonomous vehicles and robotics. Their physical configuration offers inherent fixed relative pose constraints that benefit Structure-from-Motion (SfM). However, traditional global SfM systems struggle with robustness due to their optimization framework.We propose a novel global motion averaging framework for multi-camera systems, featuring two core components: a decoupled rotation averaging module and a hybrid translation averaging module.Our rotation averaging employs a hierarchical strategy by first estimating relative rotations within rigid camera units and then computing global rigid unit rotations.To enhance the robustness of translation averaging, we incorporate both camera-to-camera and camera-to-point constraints to initialize camera positions and 3D points with a convex distance-based objective function and refine them with an unbiased non-bilinear angle-based objective function.Experiments on large-scale datasets show that our system matches or exceeds incremental SfM accuracy while significantly improving efficiency.Our framework outperforms existing global SfM methods, establishing itself as a robust solution for real-world multi-camera SfM applications. We will share our system as an open-source implementation.
Poster
Haoyu Zhao · Hao Wang · Xingyue Zhao · Hao Fei · Hongqiu Wang · Chengjiang Long · Hua Zou
[ Exhibit Hall I ]
Abstract
Recent advancements in 3D generation models have opened new possibilities for simulating dynamic 3D object movements and customizing behaviors, yet creating this content remains challenging. Current methods often require manual assignment of precise physical properties for simulations or rely on video generation models to predict them, which is computationally intensive. In this paper, we rethink the usage of multi-modal large language model (MLLM) in physics-based simulation, and present PhysSplat, a physics-based approach that efficiently endows static 3D objects with interactive dynamics. We begin with detailed scene reconstruction and object-level 3D open-vocabulary segmentation, progressing to multi-view image in-painting. Inspired by human visual reasoning, we propose MLLM-based Physical Property Perception (MLLM-P3) to predict the mean physical properties of objects in a zero-shot manner. The Material Property Distribution Prediction model (MPDP) then estimates physical property distributions via geometry-conditioned probabilistic sampling of MLLM-P3 outputs, reformulating the problem as probability distribution estimation to reduce computational costs. Finally, we simulate objects in 3D scenes with particles sampled via the Physical-Geometric Adaptive Sampling (PGAS) strategy, efficiently capturing complex deformations and significantly reducing computational costs. Extensive experiments and user studies demonstrate that our PhysSplat achieves more realistic motion than state-of-the-art methods within 2 minutes on a single GPU.
Poster
Adam Harley · Yang You · Yang Zheng · Xinglong Sun · Nikhil Raghuraman · Sheldon Liang · Yunqi Gu · Wen-Hsuan Chu · Suya You · Achal Dave · Rares Ambrus · Katerina Fragkiadaki · Leonidas Guibas
[ Exhibit Hall I ]
Abstract
We introduce AllTracker: a method that estimates long-range point tracks by way of estimating the flow field between a query frame and every other frame of a video. Unlike existing point tracking methods, our approach delivers high-resolution and dense (all-pixel) correspondence fields, which can be visualized as flow maps. Unlike existing optical flow methods, our approach corresponds one frame to hundreds of subsequent frames, rather than just the next frame. We develop a new architecture for this task, blending techniques from existing work in optical flow and point tracking: the model performs iterative inference on low-resolution grids of correspondence estimates, propagating information spatially via 2D convolution layers, and propagating information temporally via pixel-aligned attention layers. The model is fast and parameter-efficient (16 million parameters), and delivers state-of-the-art point tracking accuracy at high resolution (i.e., tracking 768x1024 pixels, on a 40G GPU). A benefit of our design is that we can train jointly on flow datasets and point tracking datasets, and we find that doing so is crucial for top performance. We provide an extensive ablation study on our architecture details and training recipe, making it clear which details matter most. We will publicly release our code and model weights.
Poster
Christian Löwens · Thorben Funke · Jingchao Xie · Alexandru Condurache
[ Exhibit Hall I ]
Abstract
Online mapping approaches show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo labels generated from unlabeled sensor data. We derive those pseudo labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code will be made publicly available.
Poster
Zhuoguang Chen · Minghui Qin · Tianyuan Yuan · Zhe Liu · Hang Zhao
[ Exhibit Hall I ]
Abstract
Recent advancements in sparse multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streamiNG 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We introduce a refined decoder that facilitates coarse-to-fine interaction between memory and new observations using memory gating and a dual-source attention structure. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model’s performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments on multiple multi-view reconstruction datasets demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed.
Poster
JINPENG DONG · Chen Li · Yutong Lin · Jingwen Fu · Sanping Zhou · Nanning Zheng
[ Exhibit Hall I ]
Abstract
High-definition (HD) map is an important component to support navigation and planning for autonomous driving vehicles. Predicting map elements with high quality (high classification and localization scores) is crucial to the safety of autonomous driving vehicles. However, current methods perform poorly in high quality predictions. Two main factors are responsible for this: 1) inappropriate classification labels due to one-to-many matching queries shared labels, and 2) sub-optimal task features due to tasks shared sampling features.In this paper, we reveal two inherent defects in current methods and develop a novel HD map construction method named DAMap to address these problems. Specifically, DAMap consists of three components: Distance-aware Focal Loss (DAFL), Hybrid Loss Scheme (HLS), and Task Modulated Deformable Attention (TMDA). The DAFL is introduced to assign appropriate classification labels for one-to-many matching samples. The TMDA is proposed to obtain discriminative task-specific features. Furthermore, HLS is proposed to better utilize the advantages of the proposed DAFL. We perform extensive experiments and consistently achieve performance improvement on the NuScenes and Argoverse2 benchmarks under different metrics, baselines, splits, backbones, and schedules.
Poster
Zhu Yihang · Jinhao Zhang · Yuxuan Wang · Aming WU · Cheng Deng
[ Exhibit Hall I ]
Abstract
As an important direction of embodied intelligence, 3D Visual Grounding has attracted much attention, aiming to identify 3D objects matching the given language description. Most existing methods often follow a two-stage process, i.e., first detecting proposal objects and identifying the right objects based on the relevance to the given query. However, when the query is complex, it is difficult to leverage an abstract language representation to lock the corresponding objects accurately, affecting the grounding performance. In general, given a specific object, humans usually follow two clues to finish the corresponding grounding, i.e., attribute and location clues. To this end, we explore a new mechanism, attribute-to-location clue reasoning, to conduct accurate grounding. Particularly, we propose a VGMamba network that consists of an SVD-based attribute mamba, location mamba, and multi-modal fusion mamba. Taking a 3D point cloud scene and language query as the input, we first exploit SVD to make a decomposition of the extracted features. Then, a slidingwindow operation is conducted to capture attribute characteristics. Next, a location mamba is presented to obtain the corresponding location information. Finally, by means of multi-modal mamba fusion, the model could effectively localize the object that matches the given query. In the experiment, our method …
Poster
Mingquan Zhou · Chen He · Ruiping Wang · Xilin Chen
[ Exhibit Hall I ]
Abstract
Open-vocabulary 3D instance segmentation (OV-3DIS), which aims to segment and classify objects beyond predefined categories, is a critical capability for embodied AI applications. Existing methods rely on pre-trained 2D foundation models, focusing on instance-level features while overlooking contextual relationships, limiting their ability to generalize to rare or ambiguous objects. To address these limitations, we propose an OV-3DIS framework guided by contextual information. First, we employ a Class-agnostic Proposal Module, integrating a pre-trained 3D segmentation model with a SAM-guided segmenter to extract robust 3D instance masks. Subsequently, we design a Semantic Reasoning Module, which selects the best viewpoint for each instance and constructs three 2D context-aware representations. The representations are processed using Multimodal Large Language Models with Chain-of-Thought prompting to enhance semantic inference. Notably, our method outperforms state-of-the-art methods on the ScanNet200 and Replica datasets, demonstrating superior open-vocabulary segmentation capabilities. Moreover, preliminary implementation in real-world scenarios verifies our method's robustness and accuracy, highlighting its potential for embodied AI tasks such as object-driven navigation.
Poster
Boyi Sun · Yuhang Liu · Houxin He · Yonglin Tian · Fei-Yue Wang
[ Exhibit Hall I ]
Abstract
Manual annotation of 3D bounding boxes in large-scale 3D scenes is expensive and time-consuming. This motivates the exploration of annotation-free 3D object detection using unlabeled point cloud data. Existing unsupervised 3D detection frameworks predominantly identify moving objects via scene flow, which has significant limitations: (1) limited detection classes ($<3$), (2) difficulty in detecting stationary objects, and (3) reliance on high frame rates. To address these limitations, we propose AnnofreeOD, a novel Annotation-free Object Detection framework based on 2D-to-3D knowledge distillation. First, we explore an effective strategy to generate high-quality pseudo boxes using single-frame 2D knowledge. Second, we observe the noise from the previous step and introduce Noise-Resistant Regression (NRR) based on Box Augmentation (BA). AnnofreeOD achieves state-of-the-art performance across multiple experiments. On the nuScenes dataset, we established the first annotation-free 10-class object detection baseline, achieving 40\% of fully supervised performance. Furthermore, in 3-class and class-agnostic object detection tasks, our approach surpasses prior state-of-the-art methods by +9.3\% mAP (+12.2\% NDS) and +6.0\% AP (+7.2\% NDS), significantly improving precision. The code and model weights are provided in the supplementary material.
Poster
Longliang Liu · Miaojie Feng · Junda Cheng · Jijun Xiang · Xuan Zhu · Xin Yang
[ Exhibit Hall I ]
Abstract
Panoramic optical flow enables a comprehensive understanding of temporal dynamics across wide fields of view. However, severe distortions caused by sphere-to-plane projections, such as the equirectangular projection (ERP), significantly degrade the performance of conventional perspective-based optical flow methods, especially in polar regions. To address this challenge, we propose PriOr-Flow, a novel dual-branch framework that leverages the low-distortion nature of the orthogonal view to enhance optical flow estimation in these regions. Specifically, we introduce the Dual-Cost Collaborative Lookup (DCCL) operator, which jointly retrieves correlation information from both the primitive and orthogonal cost volumes, effectively mitigating distortion noise during cost volume construction. Furthermore, our Ortho-Driven Distortion Compensation (ODDC) module iteratively refines motion features from both branches, further suppressing polar distortions. Extensive experiments demonstrate that PriOr-Flow is compatible with various perspective-based iterative optical flow methods and consistently achieves state-of-the-art performance on publicly available panoramic optical flow datasets, setting a new benchmark for wide-field motion estimation.
Poster
Haoyu Zhen · Qiao Sun · Hongxin Zhang · Junyan Li · Siyuan Zhou · Yilun Du · Chuang Gan
[ Exhibit Hall I ]
Abstract
This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent's actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos.This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.
Poster
Fatemeh Saleh · Sadegh Aliakbarian · Charlie Hewitt · Lohit Petikam · Xiao Xian · Antonio Criminisi · Thomas J. Cashman · Tadas Baltrusaitis
[ Exhibit Hall I ]
Abstract
The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. We release our annotated synthetic dataset, SynthHuman, as well as our models, upon publication.
Poster
Massimiliano Viola · Kevin Qu · Nando Metzger · Bingxin Ke · Alexander Becker · Konrad Schindler · Anton Obukhov
[ Exhibit Hall I ]
Abstract
Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings, and tend to struggle when applied to images outside the training domain, as well as when the available depth measurements are sparse, irregularly distributed, or of varying density. Inspired by recent advances in monocular depth estimation, we reframe depth completion as image-conditional depth map generation guided by a sparse set of measurements. Our method, Marigold-DC, builds on a pretrained latent diffusion model (LDM) for depth estimation and injects the depth observations as test-time guidance, via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. The method exhibits excellent zero-shot generalization across a diverse range of environments and handles even extremely sparse guidance effectively. Our results suggest that contemporary monodepth priors greatly robustify depth completion: it may be better to view the task as recovering dense depth from (dense) image pixels, guided by sparse depth; rather than as inpainting (sparse) depth, guided by an image.
Poster
Qiaomu Miao · Vivek Golani · Jingyi Xu · Progga Paromita Dutta · Minh Hoai · Dimitris Samaras
[ Exhibit Hall I ]
Abstract
This paper presents a method that utilizes multiple camera views for the gaze target estimation (GTE) task. The approach integrates information from different camera views to improve accuracy and expand applicability, addressing limitations in existing single-view methods that face challenges such as face occlusion, target ambiguity, and out-of-view targets. Our method processes a pair of camera views as input, incorporating a Head Information Aggregation (HIA) module for leveraging head information from both views for more accurate gaze estimation, an Uncertainty-based Gaze Selection (UGS) for identifying the most reliable gaze output, and an Epipolar-based Scene Attention (ESA) module for cross-view background information sharing. This approach significantly outperforms single-view baselines, especially when the second camera provides a clear view of the person's face. Additionally, our method can estimate the gaze target in the first view using the image of the person in the second view only, a capability not possessed by single-view GTE methods. The paper also introduces a multi-view dataset for developing and evaluating multi-view GTE. Data and code will be made available.
Poster
Kent Gauen · Stanley Chan
[ Exhibit Hall I ]
Abstract
This paper presents an efficient method to compute space-time superpixels and an application of the superpixels called superpixel convolution. The space-time superpixel method extends a single-image Bayesian method named BASS. Our approach, named Bayesian-inspired Space-Time Superpixels (BIST), is inspired by hill-climbing to a local mode of a Dirichlet-Process Gaussian Mixture Model conditioned on the previous frame's superpixel information. The method is only Bayesian-inspired, rather than actually Bayesian, because the split/merge steps are treated as a classification problem rather than derived from a Gibbs sampling update step. However, this heuristic reduces the number of split/merge steps from several hundred per frame to only a few. BIST is over twice as fast as BASS and over 10 times faster than other space-time superpixel methods with favorable (and sometimes superior) quality. Additionally, to garner interest in superpixels, this paper demonstrates their use within deep neural networks. We present a superpixel-weighted convolution layer for single-image denoising that outperforms standard convolution by 1 dB PSNR.
Poster
Wen Jiang · BOSHU LEI · Katrina Ashton · Kostas Daniilidis
[ Exhibit Hall I ]
Abstract
We present an active mapping system that could plan for long-horizon exploration goals and short-term actions with a 3D Gaussian Splatting (3DGS) representation. Existing methods either did not take advantage of recent developments in multimodal Large Language Models (LLM) or did not consider challenges in localization uncertainty which is critical in embodied agents. We propose employing multimodal LLMs for long-horizon planning in conjunction with detailed motion planning using our information-based algorithm. By leveraging high-quality view synthesis from our 3DGS representation, our method employs a multimodal LLM as a zero-shot planner for long-horizon exploration goals from the semantic perspective. We also introduce an uncertainty-aware path proposal and selection algorithm that balances the dual objectives of maximizing the information gain for the environment while minimizing the cost of localization errors. Experiments conducted on the Gibson and Habitat-Matterport 3D datasets demonstrate state-of-the-art results of the proposed method.
Poster
SaiKiran Tedla · Junyong Lee · Beixuan Yang · Mahmoud Afifi · Michael Brown
[ Exhibit Hall I ]
Abstract
Multispectral (MS) images capture detailed scene information across a wide range of spectral bands, making them invaluable for applications requiring rich spectral data. Integrating MS imaging into multi-camera devices, such as smartphones, has the potential to enhance both spectral applications and RGB image quality. A critical step in processing MS data is demosaicing, which reconstructs color information from the mosaic MS images captured by the camera. This paper proposes a method for MS image demosaicing specifically designed for dual-camera setups where both RGB and MS cameras capture the same scene. Our approach leverages co-captured RGB images, which typically have higher spatial fidelity, to guide the demosaicing of lower-fidelity MS images. We introduce the Dual-camera RGB-MS Dataset -- a large collection of paired RGB and MS mosaiced images with ground-truth demosaiced outputs -- that enables training and evaluation of our method. Experimental results demonstrate that our method achieves state-of-the-art accuracy compared to existing techniques.
Poster
Yue-Jiang Dong · Wang Zhao · Jiale Xu · Ying Shan · Song-Hai Zhang
[ Exhibit Hall I ]
Abstract
Diffusion-based video depth estimation methods have achieved remarkable success with strong generalization ability. However, predicting depth for long videos remains challenging. Existing methods typically split videos into overlapping sliding windows, leading to accumulated scale discrepancies across different windows, particularly as the number of windows increases. Additionally, these methods rely solely on 2D diffusion priors, overlooking the inherent 3D geometric structure of video depths, which results in geometrically inconsistent predictions.In this paper, we propose DepthSync, a novel, training-free framework using diffusion guidance to achieve scale- and geometry-consistent depth predictions for long videos. Specifically, we introduce \textbf{scale guidance} to synchronize the depth scale \textbf{across windows} and \textbf{geometry guidance} to enforce geometric alignment \textbf{within windows} based on the inherent 3D constraints in video depths. These two terms work synergistically, steering the denoising process toward consistent depth predictions. Experiments on various datasets validate the effectiveness of our method in producing depth estimates with improved scale and geometry consistency, particularly for long videos.
Poster
Takehiko Ohkawa · Jihyun Lee · Shunsuke Saito · Jason Saragih · Fabian Prada · Yichen Xu · Shoou-I Yu · Ryosuke Furuta · Yoichi Sato · Takaaki Shiratori
[ Exhibit Hall I ]
Abstract
One can hardly model self-contact of human poses without considering underlying body shapes. For example, the pose of rubbing a belly for a person with a low BMI leads to penetration of the hand into the belly for a person with a high BMI. Despite its importance, existing self-contact datasets lack the variety of self-contact poses and precise body shapes, limiting conclusive analysis between self-contact and shapes. To address this, we begin by introducing the first extensive self-contact dataset with precise body shape registration, Goliath-SC, consisting of 383K self-contact poses across 130 subjects. Using this dataset, we propose generative modeling of a self-contact prior conditioned by body shape parameters, based on a body-part-wise latent diffusion with self-attention. We further incorporate this prior into single-view human pose estimation, while refining estimated poses to be in contact. Our experiments suggest that shape conditioning is vital to the successful modeling of self-contact pose distribution, hence improving pose estimation in self-contact from a single image.
Poster
Haihao Zhang · Yunjian Zhang · Jianing Li · Lin Zhu · Meng Lv · Yao Zhu · Yanwei Liu · Xiangyang Ji
[ Exhibit Hall I ]
Abstract
Accurate stereo matching under fast motion and extreme lighting conditions is a challenge for many vision applications. Event cameras have the advantages of low latency and high dynamic range, thus providing a reliable solution to this challenge. However, since events are sparse, this makes it an ill-posed problem to obtain dense disparity using only events. In this work, we propose a novel framework for event-based dense stereo via cross-sensor knowledge distillation. Specifically, a multi-level intensity-to-event distillation strategy is designed to maximize the potential of long-range information, local texture details, and task-related knowledge of the intensity images. Simultaneously, to enforce the cross-view consistency, an intensity-event joint left-right consistency module is proposed. With our framework, extensive dense and structural information contained in intensity images is distilled to the event branch, so retaining only the events can predict dense disparities during inference, preserving the low latency characteristics of the events. Adequate experiments conducted on the MVSEC and DSEC datasets demonstrate that our method exhibits superior stereo matching performance than baselines, both quantitatively and qualitatively.
Poster
Jie Zhu · Sungkil Lee
[ Exhibit Hall I ]
Abstract
Flare and glare are common nighttime artifacts that degrade image quality and hinder computer vision tasks. Existing synthetic datasets lack physical realism and diversity, while deep learning-based removal methods struggle in complex scenes, posing significant challenges. To address these issues, we introduce the high-quality annotated Physically-Based Flare and Glare (PBFG) dataset and a Flare and Glare Removal Network (FGRNet). PBFG comprises 2,600 flares and 4,000 glares using our computational rendering scheme with diverse lens systems and optical configurations. Our advanced streak synthesis enhances template fidelity and improves streak removal accuracy. FGRNet leverages spatial-frequency features for comprehensive local and global feature extraction. It introduces a Spatial-Frequency Enhanced Module with a Spatial Reconstruction Unit and a Frequency-Enhanced Unit to extract multi-scale spatial information and enhance frequency representation. This design effectively removes complex artifacts, including large-area glares, diverse flares, and multiple or off-screen-induced streaks. Additionally, a histogram-matching module ensures stylistic and visual consistency with ground truth. Extensive experiments confirm that PBFG accurately replicates real-world patterns, and FGRNet outperforms state-of-the-art methods both quantitatively and qualitatively, resulting in significant gains of PSNRs (up to 2.3 dB and 3.14 dB in an image and its glare regions, respectively).
Poster
Shaohan Li · Hao Yang · Min Chen · Xiaolin Qin
[ Exhibit Hall I ]
Abstract
The increasing frequency of extreme weather events due to global climate change urges accurate weather prediction. Recently, great advances are made by the \textbf{end-to-end methods}, thanks to deep learning techniques, but they face limitations of \textit{representation inconsistency} in multivariable integration and struggle to effectively capture the dependency between variables, which is required in complex weather systems. Treating different variables as distinct modalities and applying a \textbf{two-stage training approach} from multimodal models can partially alleviate this issue, but due to the inconformity in training tasks between the two stages, the results are often suboptimal. To address these challenges, we propose an implicit two-stage training method, configuring separate encoders and decoders for each variable. In detailed, in the first stage, the Translator is frozen while the Encoders and Decoders learn a shared latent space, in the second stage, the Encoders and Decoders are frozen, and the Translator captures inter-variable interactions for prediction. Besides, by introducing a self-attention mechanism for multivariable fusion in the latent space, the performance achieves further improvements. Empirically, extensive experiments shows state-of-the-art performance of our method. In specific, it reduces the MSE for near-surface air temperature and relative humidity predictions by 28.82% and 23.39%, respectively.
Poster
Yifan Jiao · Yunhao Li · Junhua Ding · Qing Yang · Song Fu · Heng Fan · Libo Zhang
[ Exhibit Hall I ]
Abstract
In this paper, we present a novel benchmark, GSOT3D, that aims at facilitating development of generic 3D single object tracking (SOT) in the wild. Specifically, GSOT3D offers 620 sequences with 123K frames, and covers a wide selection of 54 object categories. Each sequence is offered with multiple modalities, including the point cloud (PC), RGB image, and depth. This allows GSOT3D to support various 3D tracking tasks, such as single-modal 3D SOT on PC and multi-modal 3D SOT on RGB-PC or RGB-D, and thus greatly broadens research directions for 3D object tracking. To provide highquality per-frame 3D annotations, all sequences are labeled manually with multiple rounds of meticulous inspection and refinement. To our best knowledge, GSOT3D is the largest benchmark dedicated to various generic 3D object tracking tasks. To understand how existing 3D trackers perform and to provide comparisons for future research on GSOT3D, we assess eight representative point cloud-based tracking models. Our evaluation results exhibit that these models heavily degrade on GSOT3D, and more efforts are required for robust and generic 3D object tracking. Besides, to encourage future research, we present a simple yet effective generic 3D tracker, named PROT3D, that localizes the target object via a progressive spatial-temporal network …
Poster
Tongtong Cheng · Rongzhen Li · Yixin Xiong · Tao Zhang · Jing Wang · Kai Liu
[ Exhibit Hall I ]
Abstract
Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications.
Poster
Xuejian Gou · Fang Liu · Licheng Jiao · Shuo Li · Lingling Li · Hao Wang · Xu Liu · Puhua Chen · wenping ma
[ Exhibit Hall I ]
Abstract
In real-world scenarios, objects and their parts inherently possess both coarse-grained differences and intricate fine-grained structural relationships. These characteristics can be formalized as knowledge, leveraged for fine-grained part comprehension. However, existing part segmentation models consistently fail to capture these complex inter-part relationships, treating parts as independent entities and disregarding object-level distinctions. To address these limitations, we propose a novel Knowledge-Guided Part Segmentation (KPS) framework. Our approach automatically extracts structural relationships between parts using a large language model (LLM) and integrates them into a knowledge graph. Subsequently, a structural knowledge guidance module employs a graph convolutional network (GCN) to model these relationships. Furthermore, a coarse-grained object guidance module captures object-specific distinctions and integrates them as visual guidance. The integrated insights from the part structure and object differentiation guide the fine-grained part segmentation. Our KPS achieves notable improvements in segmentation performance, with a 4.96\% mIoU gain on PartImageNet and a 3.73\% gain on Pascal-Part. Moreover, in the open-vocabulary setting on Pascal-Part-116, it improves hIoU by 3.25\%, highlighting the effectiveness of knowledge guidance in enhancing fine-grained part segmentation.
Poster
Hayeon Kim · Ji Ha Jang · Se Young Chun
[ Exhibit Hall I ]
Abstract
Due to limited 3D data, recent prior arts in 3D editing rely mainly on the Score Distillation Sampling (SDS) loss that edits and segments in 2D rendered views using pre-trained diffusion priors and then projects back onto 3D space to update the model. While these approaches are effective for 3D instance-level editing, they struggle with 3D part-level editing especially for Gaussian splatting due to inconsistent multi-view 2D part segmentations and inherently ambiguous SDS loss with localized nature of Gaussians. To address these limitations, we propose RoMaP, a novel local 3D Gaussian editing that enables drastic part-level changes. Firstly, we propose 3D-geometry aware label prediction (3D-GALP) exploiting the uncertainty in soft-label segmentations. Secondly, we propose a regularized SDS loss with masks that consists of a usual SDS loss with the predicted 3D mask and an L1 regularizer as an anchor loss for high-quality part-edited 2D images using our proposed scheduled latent mixing and part editing (SLaMP) method. Our SDS loss improves flexibility in local editing by removing 3D masked regions, allowing changes beyond existing context. SLaMP uses the projected 2D mask of the predicted 3D mask to confine modifications to the target region while preserving contextual coherence. Experimental results demonstrate that …
Poster
Siqi Zhang · Yanyuan Qiao · Qunbo Wang · Zike Yan · Qi Wu · Zhihua Wei · Jing Liu
[ Exhibit Hall I ]
Abstract
Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the **co**mbination of **s**elective **m**em**o**rization (COSMO).Specifically, COSMO integrates state-space modules and transformer modules, and incorporates two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions.Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs.
Poster
Ilya A. Petrov · Riccardo Marin · Julian Chibane · Gerard Pons-Moll
[ Exhibit Hall I ]
Abstract
Modeling 3D human-object interaction (HOI) is a problem of great interest for computer vision and a key enabler for virtual and mixed-reality applications. Existing methods work in a one-way direction: some recover plausible human interactions conditioned on a 3D object; others recover the object pose conditioned on a human pose. Instead, we provide the first unified model - TriDi which works in any direction. Concretely, we generate Human, Object, and Interaction modalities simultaneously with a new three-way diffusion process, allowing to model seven distributions with one network. We implement TriDi as a transformer attending to the various modalities' tokens, thereby discovering conditional relations between them. The user can control the interaction either as a text description of HOI or a contact map. We embed these two representations onto a shared latent space, combining the practicality of text descriptions with the expressiveness of contact maps. Using a single network, TriDi unifies all the special cases of prior work and extends to new ones modeling a family of seven distributions. Remarkably, despite using a single model, TriDi generated samples surpass one-way specialized baselines on GRAB and BEHAVE in terms of both qualitative and quantitative metrics, and demonstrating better diversity. We show applicability …
Poster
Xuan Yao · Junyu Gao · Changsheng Xu
[ Exhibit Hall I ]
Abstract
Vision-and-Language Navigation in Continuous Environments (VLN-CE) requires agents to execute sequential navigation actions in complex environments guided by natural language instructions. Current approaches often struggle with generalizing to novel environments and adapting to ongoing changes during navigation. Inspired by human cognition, we present NavMorph, a self-evolving world model framework that enhances environmental understanding and decision-making in VLN-CE tasks. NavMorph employs compact latent representations to model environmental dynamics, equipping agents with foresight for adaptive planning and policy refinement. By integrating a novel Contextual Evolution Memory, NavMorph leverages scene-contextual information to support effective navigation while maintaining online adaptability. Extensive experiments demonstrate that our method achieves notable performance improvements on popular VLN-CE benchmarks. Code is available in the Supplementary Material.
Poster
Shani Gamrian · Hila Barel · Feiran Li · Masakazu Yoshimura · Daisuke Iso
[ Exhibit Hall I ]
Abstract
Object detection models are typically applied to standard RGB images processed through Image Signal Processing (ISP) pipelines, which are designed to enhance sensor-captured RAW images for human vision. However, these ISP functions can lead to a loss of critical information that may be essential in optimizing for computer vision tasks, such as object detection. In this work, we introduce Raw Adaptation Module (RAM), a module designed to replace the traditional ISP, with parameters optimized specifically for RAW object detection. Inspired by the parallel processing mechanisms of the human visual system, RAM departs from existing learned ISP methods by applying multiple ISP functions in parallel rather than sequentially, allowing for a more comprehensive capture of image features. These processed representations are then fused in a specialized module, which dynamically integrates and optimizes the information for the target task. This novel approach not only leverages the full potential of RAW sensor data but also enables task-specific pre-processing, resulting in superior object detection performance. Our approach outperforms RGB-based methods and achieves state-of-the-art results across diverse RAW image datasets under varying lighting conditions and dynamic ranges.
Poster
Xiao Chen · Tai Wang · Quanyi Li · Tao Huang · Jiangmiao Pang · Tianfan Xue
[ Exhibit Hall I ]
Abstract
Generalizable active mapping in complex unknown environments remains a critical challenge for mobile robots. Existing methods, constrained by limited training data and conservative exploration strategies, struggle to generalize across scenes with diverse layouts and complex connectivity. To enable scalable training and reliable evaluation, we present GLEAM-Bench, the first large-scale benchmark with 1,152 diverse 3D scenes from synthetic and real datasets. In this work, we propose GLEAM, a generalizable exploration policy for active mapping. Its superior generalizability comes from our semantic representations, long-term goal, and randomized strategies. It significantly outperforms state-of-the-art methods, achieving 68.16\% coverage (+11.41\%) with efficient trajectories, and improved mapping accuracy on 128 unseen complex scenes.
Poster
Gencer Sumbul · Chang Xu · Emanuele Dalsasso · Devis Tuia
[ Exhibit Hall I ]
Abstract
From optical sensors to microwave radars, leveraging the complementary strengths of remote sensing (RS) sensors is of great importance for achieving dense spatio-temporal monitoring of our planet. In contrast, recent deep learning models—task-specific or foundational—are often specific to single sensors or to fixed combinations: adapting such models to different sensory inputs requires both architectural changes and re-training, limiting scalability and generalization across multiple RS sensors. On the contrary, a single model able to modulate its feature representations to accept diverse sensors as input would pave the way to agile and flexible multi-sensor RS data processing. To address this, we introduce SA-MAE, a generic and versatile foundation model lifting sensor-specific/dependent efforts and enabling scalability and generalization to diverse RS sensors: SA-MAE projects data from heterogeneous sensors into a shared spectrum-aware space, enabling the usage of arbitrary combinations of bands—a key discriminative property for RS—both for training and inference. To obtain sensor-agnostic representations, we train a single, unified transformer model reconstructing masked multi-sensor data with cross-sensor token mixup. On both single- and multi-modal tasks across diverse sensors, SA-MAE outperforms previous models that rely on sensor-specific pretraining.
Poster
Björn Braun · Rayan Armani · Manuel Meier · Max Moebus · Christian Holz
[ Exhibit Hall I ]
Abstract
Egocentric vision systems aim to understand the spatial surroundings and the wearer's behavior inside it, including motions, activities, and interaction with objects. Meta's Project Aria 2 recently added a heart rate (HR) contact sensor to additionally capture the wearer's cardiac activity, which can impact the person's attention and situational responses. In this paper, we propose egoPPG, a novel non-contact-based method to recover cardiac activity from the eye-tracking cameras in previous egocentric vision systems. Our method continuously estimates the person's photoplethysmogram (PPG) from areas around the eyes and fuses motion cues from the headset's inertial measurement unit to track HR values. We demonstrate egoPPG's downstream benefit for existing egocentric datasets on EgoExo4D, where we find that augmenting existing models with tracked HR values improves proficiency estimation by 14%. To train and validate egoPPG, we collected a dataset of 13+ hours of eye-tracking videos from Project Aria and contact-based blood volume pulse signals as well as an electrocardiogram (ECG) for ground-truth HR values. 25 participants performed diverse everyday activities such as office work, cooking, dancing, and exercising, which induced significant natural motion and HR variation (44 - 164 bpm). Our model robustly estimates HR (MAE=7.67 bpm) and captures patterns (r=0.85). Our results …
Poster
Fu-Jen Tsai · Yan-Tsung Peng · Yen-Yu Lin · Chia-Wen Lin
[ Exhibit Hall I ]
Abstract
Image dehazing aims to remove unwanted hazy artifacts in images. Although previous research has collected paired real-world hazy and haze-free images to improve dehazing models' performance in real-world scenarios, these models often experience significant performance drops when handling unseen real-world hazy images due to limited training data. This issue motivates us to develop a flexible domain adaptation method to enhance dehazing performance during testing. Observing that predicting haze patterns is generally easier than recovering clean content, we propose the Physics-guided Haze Transfer Network (PHATNet) which transfers haze patterns from unseen target domains to source-domain haze-free images, creating domain-specific fine-tuning sets to update dehazing models for effective domain adaptation. Additionally, we introduce a Haze-Transfer-Consistency loss and a Content-Leakage Loss to enhance PHATNet's disentanglement ability. Experimental results demonstrate that PHATNet significantly boosts state-of-the-art dehazing models on benchmark real-world image dehazing datasets.
Poster
Sitao Zhang · Hongda Mao · Qingshuang Chen · Yelin Kim
[ Exhibit Hall I ]
Abstract
Visual place recognition is crucial for autonomous navigation and robotic mapping. Current methods struggle with perceptual aliasing and computational inefficiency. We present SemVPR, a novel approach integrating multimodal semantic knowledge into VPR. By leveraging a pre-trained vision-language model as a teacher during the training phase, SemVPR learns local visual and semantic descriptors simultaneously, effectively mitigating perceptual aliasing through semantic-aware aggregation without extra inference cost. The proposed nested descriptor learning strategy generates a series of ultra-compact global descriptors, reduced by approximately compared to state-of-the-art methods, in a coarse-to-fine manner, eliminating the need for offline dimensionality reduction or training multiple models. Extensive experiments across various VPR benchmarks demonstrate that SemVPR consistently outperforms state-of-the-art methods with significantly lower computational costs, rendering its feasibility for latency-sensitive scenarios in real-world applications.
Poster
CHANGHEE YANG · Hyeonseop Song · Seokhun Choi · Seungwoo Lee · Jaechul Kim · Hoseok Do
[ Exhibit Hall I ]
Abstract
Despite considerable efforts to enhance the generalization of 3D pose estimators without costly 3D annotations, existing data augmentation methods struggle in real-world scenarios with diverse human appearances and complex poses. We propose PoseSyn, a novel data synthesis framework that transforms abundant in-the-wild 2D pose dataset into diverse 3D pose–image pairs. PoseSyn comprises two key components: Error Extraction Module (EEM), which identifies challenging poses from the 2D pose datasets, and Motion Synthesis Module (MSM), which synthesizes motion sequences around the challenging poses. Then, by generating realistic 3D training data via a human animation model--aligned with challenging poses and appearances--PoseSyn boosts the accuracy of various 3D pose estimators by up to 14% across real-world benchmarks including various backgrounds and occlusions, challenging poses, and multi-view scenarios. Extensive experiments further confirm that PoseSyn is a scalable and effective approach for improving generalization without relying on expensive 3D annotations, regardless of the pose estimator's model size or design.
Poster
Yun Li · Yiming Zhang · Tao Lin · Xiangrui Liu · Wenxiao Cai · Zheng Liu · Bo Zhao
[ Exhibit Hall I ]
Abstract
The use of Multimodal Large Language Models (MLLMs) as an end-to-end solution for Embodied AI and Autonomous Driving has become a prevailing trend. While MLLMs have been extensively studied for visual semantic understanding tasks, their ability to perform precise and quantitative spatial-temporal understanding in real-world applications remains largely unexamined, leading to uncertain prospects. To address this gap, we introduce ST-Bench, a benchmark designed to evaluate MLLMs' spatial-temporal understanding through challenging tasks such as estimating and predicting the appearance, pose, displacement, and motion of objects. Our benchmark encompasses a wide range of robot and vehicle operations across desktop, indoor, and outdoor scenarios. The extensive experiments reveals that the state-of-the-art MLLMs still struggle in real-world spatial-temporal understanding, especially in tasks requiring precise distance estimation and motion analysis.
Poster
Anna-Maria Halacheva · Yang Miao · Jan-Nico Zaech · Xi Wang · Luc Gool · Danda Pani Paudel
[ Exhibit Hall I ]
Abstract
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. Providing a solution to these applications requires a multifaceted approach that covers scene-centric, object-centric, as well as interaction-centric capabilities. While there exist numerous datasets approaching the former two problems, the task of understanding interactable and articulated objects is underrepresented and only partly covered in the research field. In this work, we address this shortcoming by introducing: (1) Articulate3D, an expertly curated 3D dataset featuring high-quality manual annotations on 280 indoor scenes. Articulate3D provides 8 types of annotations for articulated objects, covering parts and detailed motion information,all stored in a standardized scene representation format designed for scalable 3D content creation, exchange and seamless integration into simulation environments. (2) USDNet, a novel unified framework capable of simultaneously predicting part segmentation along with a full specification of motion attributes for articulated objects. We evaluate USDNet on Articulate3D as well as two existing datasets, demonstrating the advantage of our unified dense prediction approach. Furthermore, we highlight the value of Articulate3D through cross-dataset and cross-domain evaluations and showcase its applicability in downstream tasks such as scene editing through LLM prompting and robotic …
Poster
Khurram Azeem Hashmi · Karthik Suresh · Didier Stricker · Muhammad Zeshan Afzal
[ Exhibit Hall I ]
Abstract
Low-light conditions significantly degrade the performance of high-level vision tasks. Existing approaches either enhance low-light images without considering normal illumination scenarios, leading to poor generalization or are tailored to specific tasks. We propose **TorchAdapt**, a real-time adaptive feature enhancement framework that generalizes robustly across varying illumination conditions without degrading performance in well-lit scenarios. TorchAdapt consists of two complementary modules: the **Torch** module enhances semantic features beneficial for downstream tasks, while the **Adapt** module dynamically modulates these enhancements based on input content. Leveraging a novel light-agnostic learning strategy, TorchAdapt aligns feature representations of enhanced and well-lit images to produce powerful illumination-invariant features. Extensive experiments on multiple high-level vision tasks, including object detection, face detection, instance segmentation, semantic segmentation, and video object detection, demonstrate that TorchAdapt consistently outperforms state-of-the-art low-light enhancement and task-specific methods in both low-light and light-agnostic settings. TorchAdapt thus provides a unified, flexible solution for robust visual perception across diverse lighting conditions. Code and models are provided as supplementary.
Poster
Christopher Xie · Armen Avetisyan · Henry Howard-Jenkins · Yawar Siddiqui · Julian Straub · Richard Newcombe · Vasileios Balntas · Jakob Engel
[ Exhibit Hall I ]
Abstract
We present a novel human-in-the-loop approach to estimate 3D scene layout that uses human feedback from an egocentric standpoint. We study this approach through introduction of a novel local correction task, where users identify local errors and prompt a model to automatically correct them. Building on SceneScript, a state-of-the-art framework for 3D scene layout estimation that leverages structured language, we propose a solution that structures this problem as "infilling", a task studied in natural language processing. We train a multi-task version of SceneScript that maintains performance on global predictions while significantly improving its local correction ability. We integrate this into a human-in-the-loop system, enabling a user to iteratively refine scene layout estimates via a low-friction "one-click fix'' workflow. Our system enables the final refined layout to diverge from the training distribution, allowing for more accurate modelling of complex layouts.
Poster
Matteo Poggi · Fabio Tosi
[ Exhibit Hall I ]
Abstract
We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8× lower compared to most recent methods, and still achieves the best cross-dataset generalization on Sintel Final and KITTI with a relative improvement of 10 and 15% over the previous state-of-the-art, as well as on Spring and LayeredFlow datasets representing a solid step towards more responsible hardware use.
Poster
Songyan Zhang · Yongtao Ge · Jinyuan Tian · Guangkai Xu · Hao Chen · Chen Lv · Chunhua Shen
[ Exhibit Hall I ]
Abstract
3D reconstruction in dynamic scenes primarily relies on the combination of geometry estimation and matching modules where the latter task is pivotal for distinguishing dynamic regions which can help to mitigate the interference introduced by moving objects. Furthermore, the matching module explicitly models object motion, enabling the tracking of specific targets and advancing motion understanding in complex scenarios. Recently, the proposed representation of pointmap in DUSt3R suggests a potential solution to unify both geometry estimation and matching in 3D space, but it still struggles with ambiguous matching in dynamic regions, which may hamper further improvement. In this work, we present POMATO, a unified framework for dynamic 3D reconstruction by marrying pointmap matching with temporal motion. Specifically, our method first learns an explicit matching relationship by mapping corresponding RGB pixels across different views to 3D pointmaps within a shared coordinate system. Furthermore, we introduce a temporal motion module for dynamic motions that ensures scale consistency across different frames and enhances performance in tasks requiring both precise geometry and reliable matching, most notably 3D point tracking. We show the effectiveness of the proposed pointmap matching and temporal fusion paradigm by demonstrating the remarkable performance across multiple downstream tasks, including video depth estimation, …
Poster
Yongsheng Yuan · Jie Zhao · Dong Wang · Huchuan Lu
[ Exhibit Hall I ]
Abstract
Modern visual trackers have achieved robust performance with precisely initialized target bounding boxes. However, providing high-precision initial annotations is a process both labor-intensive and error-prone in real-world scenarios. Interactive initialization (e.g., click-based, scribble-based) presents a more practical alternative. In this paper, we introduce a unified Click-and-Track (CAT) framework for full-process tracking, eliminating the need for auxiliary models or complex initializing pipelines. We present a novel fine-tuning paradigm that bridges the information gap inherent in click-based initialization through two key innovations: 1) The proposed click-based location and joint spatial-visual prompt refinement are sequentially performed to remedy the geometric information loss (e.g., boundary ambiguity, shape uncertainty) inherent in click-based initialization. 2) We design a parameter-efficient module called CTMoE to leverages the tracker's inherent capabilities when fine-tuning. The proposed CTMoE enable the foundation model to learn different matching patterns, unifying click-based initialization and tracking within a unified architecture. Extensive experimental results demonstrate state-of-the-art performance of our click-based tracking method on the LaSOT benchmark (70.5\% AUC) while maintaining parameter efficiency, surpassing existing click-based tracking frameworks by a large margin and even outperforming some bounding-box-initialized trackers.
Poster
Yapeng Meng · Yihan Lin · Taoyi Wang · Yuguo Chen · Lijian Wang · Rong Zhao
[ Exhibit Hall I ]
Abstract
Recording and reconstructing high-speed scenes poses a significant challenge. The high bandwidth of high-speed cameras makes continuous recording unsustainable, while the frame interpolation methods using traditional RGB cameras (typically 30 fps) introduce artifacts and are affected by motion blur. Leveraging sensors inspired by the human visual system, such as event cameras, provides high-speed parse temporal variation or spatial variation data to alleviate the ill-conditioned problem of high-speed reconstructing with traditional RGB cameras. However, existing methods still suffer from RGB blur, temporal aliasing, and loss of event information. To overcome the above challenges, we leverage a novel dual-pathway complementary vision sensor, which outputs high-speed, sparse spatio-temporal differential frames between two RGB frames as reconstruction conditions. Further, we propose a cascaded bi-directional recurrent diffusion model (CBRDM) that can achieve accurate, sharp, color-rich video frames reconstruction results. Our method improves the LPIPS metric by 37.6% over state-of-the-art RGB interpolation algorithms and achieves superior performance in real-world comparisons with event cameras. Our code and dataset will be publicly available.
Poster
Sihang Li · Zeyu Jiang · Grace Chen · Chenyang Xu · Siqi Tan · Xue Wang · Irving Fang · Kristof Zyskowski · Shannon McPherron · Radu Iovita · Chen Feng · Jing Zhang
[ Exhibit Hall I ]
Abstract
3D reassembly is a challenging spatial intelligence task with broad applications across scientific domains. While large-scale synthetic datasets have fueled promising learning-based approaches, their generalizability to different domains is limited. Critically, it remains uncertain whether models trained on synthetic datasets can generalize to real-world fractures where breakage patterns are more complex. To bridge this gap, we propose \acronym{}, a \textbf{g}eneralizable 3D re\textbf{a}ssembly framework for \textbf{r}eal-world \textbf{f}ractures. \acronym{} leverages fracture-aware pretraining to learn fracture features from individual fragments, while flow matching enables precise 6-DoF alignments. At inference time, we introduce one-step preassembly, improving robustness to unseen objects and varying numbers of fractures. In collaboration with archaeologists, paleoanthropologists, and ornithologists, we curate \dataset{}, a diverse dataset for vision and learning communities, featuring real-world fracture types across ceramics, bones, eggshells, and lithics. Comprehensive experiments have demonstrated our approach consistently outperforms state-of-the-art methods on both synthetic and real-world datasets, achieving 82.87\% lower rotation error and 25.15\% higher part accuracy. This work sheds light on training on synthetic data to advance real-world 3D puzzle solving, showcasing its strong generalization across unseen object shapes and diverse fracture types.
Poster
Qingcheng Zhao · Xiang Zhang · Haiyang Xu · Zeyuan Chen · Jianwen Xie · Yuan Gao · Zhuowen Tu
[ Exhibit Hall I ]
Abstract
We propose DePR, a novel depth-guided single-view scene reconstruction framework that integrates instance-level diffusion priors. Our approach follows a compositional reconstruction paradigm, where individual objects are first generated before being arranged into a coherent scene. Unlike previous methods that solely use depth for object layout estimation during inference—thus underutilizing its rich geometric information—DePR leverages depth throughout both training and inference. Specifically, we introduce depth-guided conditioning to effectively encode shape priors into image-conditioned diffusion models. During inference, depth further aids in layout optimization and guided DDIM sampling, ensuring better alignment between reconstructed objects and the input image. Despite being trained on limited synthetic data, DePR achieves state-of-the-art performance and strong generalizability in single-view scene reconstruction, as demonstrated through evaluations on both synthetic and real-world datasets.
Poster
Yuedong Tan · Zongwei Wu · Yuqian Fu · Zhuyun Zhou · Guolei Sun · Eduard Zamfir · Chao Ma · Danda Pani Paudel · Luc Gool · Radu Timofte
[ Exhibit Hall I ]
Abstract
Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a classifier with weak loss tasked with distinguishing between modalities. More specifically, if the classifier "fails" to accurately identify the modality of the given sample, this signals an opportunity for cross-modal knowledge sharing. Intuitively, knowledge transfer is facilitated whenever a sample from one modality is sufficiently close and aligned with another. Technically, we achieve this by routing samples from one modality to the expert of the others, within a mixture-of-experts framework designed for multimodal video object tracking. During the inference, the expert of the respective modality is chosen, which …
Poster
Emery Pierson · Lei Li · Angela Dai · Maks Ovsjanikov
[ Exhibit Hall I ]
Abstract
Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation …
Poster
Seunghyun Lee · Tae-Kyun Kim
[ Exhibit Hall I ]
Abstract
Latest diffusion models have shown promising results in category-level 6D object pose estimation by modeling the conditional pose distribution with depth image input. The existing methods, however, suffer from slow convergence during training, learning its encoder with the diffusion denoising network in end-to-end fashion, and require an additional network that evaluates sampled pose hypotheses to filter out low-quality pose candidates. In this paper, we propose a novel pipeline that tackles these limitations by two key components. First, the proposed method pre-trains the encoder with the direct pose regression head, and jointly learns the networks via the regression head and the denoising diffusion head, significantly accelerating training convergence while achieving higher accuracy. Second, sampling guidance via time-dependent score scaling is proposed s.t. the exploration-exploitation trade-off is effectively taken, eliminating the need for the additional evaluation network. The sampling guidance maintains multi-modal characteristics of symmetric objects at early denoising steps while ensuring high-quality pose generation at final steps. Extensive experiments on multiple benchmarks including REAL275, HouseCat6D, and ROPE, demonstrate that the proposed method, simple yet effective, achieves state-of-the-art accuracies even with single-pose inference, while being more efficient in both training and inference.
Poster
Fangwei Zhong · Kui Wu · Churan Wang · Hao Chen · Hai Ci · Zhoujun Li · Yizhou Wang
[ Exhibit Hall I ]
Abstract
We introduce UnrealZoo, a rich collection of 100 photo-realistic 3D virtual worlds built on Unreal Engine, designed to reflect the complexity and variability of open worlds with scales up to $16 km^2$ landscapes. Additionally, we offer a rich variety of playable entities including humans, animals, robots, and vehicles for embodied AI. We extend UnrealCV with optimized Python APIs and tools for data collection, environment augmentation, distributed training, and benchmarking, achieving significant improvements in the efficiency of rendering and communication, to support advanced applications, such as multi-agent interactions. Our experimental evaluation across complex navigation and tracking tasks reveals two key insights: first, the substantial benefits of the diversity of environments for developing generalizable reinforcement learning (RL) agents; second, the persistent challenges that current embodied agents face in open-world settings. These challenges include transferring to a new embodiment at test time, managing latency in closed-loop control systems for dynamic environments, and effectively reasoning about complex 3D spatial structures in unstructured terrain. UnrealZoo thus provides both a powerful testing ground and a pathway toward more capable embodied AI systems for real-world deployment.
Poster
Valter Piedade · Chitturi Sidhartha · José Gaspar · Venu Madhav Govindu · Pedro Miraldo
[ Exhibit Hall I ]
Abstract
Outliers are ubiquitous in geometric vision contexts such as pose estimation and mapping, leading to inaccurate estimates. While robust loss functions tackle outliers, it is challenging to make the estimation robust to the choice of initialization and estimate the appropriate robust loss shape parameter that allows distinguishing inliers from outliers. Graduated non-convexity (GNC) often mitigates these issues. However, typical GNC uses a fixed annealing factor to update the shape parameter, which can lead to low-quality or inefficient estimates. This paper proposes a novel approach to adaptively anneal the shape parameter within a GNC framework. We developed a search strategy that incorporates a sampling of annealing choices and model scorings to select the most promising shape parameter at each GNC iteration. Additionally, we propose new stopping criteria and an initialization technique that improves performance for diverse data, and we show the benefits of combining discrete and continuous robust estimation strategies. We evaluate our method using synthetic and real-world data in two problems: 3D registration and pose graph optimization in SLAM sequences. Our results demonstrate greater efficiency and robustness compared to previous GNC schemes.
Poster
Zhengbo Zhang · Lin Geng Foo · Hossein Rahmani · Jun Liu · De Wen Soh
[ Exhibit Hall I ]
Abstract
Single image defocus deblurring (SIDD) is a challenging task that aims to recover an all-in-focus image from a defocused one. In this paper, we make the observation that a defocused image can be viewed as a blend of illuminated blobs based on fundamental imaging principles, and the defocus blur in the defocused image is caused by large illuminated blobs intermingling with each other. Thus, from a novel perspective, we perform SIDD by adjusting the shape and opacity of the illuminated blobs that compose the defocused image. With this aim, we design a novel 2D Gaussian blob representation for illuminated blobs and a differentiable rasterization method to obtain the parameters of the 2D Gaussian blobs that compose the defocused image. Additionally, we propose a blob deblurrer to adjust the parameters of the 2D Gaussian blobs corresponding to the defocused image, thereby obtaining a sharp image. We also explore incorporating prior depth information via our depth-based regularization loss to regularize the size of Gaussian blobs, further improving the performance of our method. Extensive experiments on five widely-used datasets validate the effectiveness of our proposed method.
Poster
Guowei Shi · Zian Mao · Peisen Huang
[ Exhibit Hall I ]
Abstract
Ultra-precision measurement of 6DoF pose is essential in applications such as semiconductor manufacturing and nanoscale manipulation. Conventional vision‐based techniques are often hampered by sensitivity to defocus, limited number of periods when using images of periodical patterns, etc. In this paper, we propose a novel two-dimensional interpolated Discrete Fourier Transform (2D-IpDFT) method for robust 6DoF pose estimation using periodic patterns. We further develop a mathematical framework that links image parameters—phase and frequency—to 6DoF pose, which is applicable to both orthographic and quasi-orthographic imaging systems. Extensive experiments on a low-cost setup, featuring an industrial camera and etched periodic patterns, demonstrate nanometer-level translational accuracy and microradian-level rotational precision.
Poster
Gabriele Berton · Alex Stoken · Carlo Masone
[ Exhibit Hall I ]
Abstract
Thousands of photos of Earth are taken every day by astronauts from the International Space Station. The localization of these photos, which has been performed manually for decades, has recently been approached through image retrieval solutions: given an astronaut photo, the goal is to find its most similar match among a large database of geo-tagged satellite images, in a task called Astronaut Photography Localization (APL). Yet, existing APL approaches are trained only using satellite images, without taking advantage of the millions open-source astronaut photos. In this work we present the first APL pipeline capable of leveraging astronaut photos for training. We first produce full localization information for 300,000 manually weakly labeled astronaut photos through an automated pipeline, and then use these images to train a model, called AstroLoc. AstroLoc learns a robust representation of Earth's surface features through two objective functions: pairing astronaut photos with their matching satellite counterparts in a pairwise loss, and a second loss on clusters of satellite imagery weighted by their relevance to astronaut photography through unsupervised mining. AstroLoc achieves a staggering 35% average improvement in recall@1 over previous SOTA, reaching a recall@100 consistently over 99% for existing datasets. Moreover, without fine-tuning, AstroLoc provides excellent results …
Poster
Hanwen Jiang · Hao Tan · Peng Wang · Haian Jin · Yue Zhao · Sai Bi · Kai Zhang · Fujun Luan · Kalyan Sunkavalli · Qixing Huang · Georgios Pavlakos
[ Exhibit Hall I ]
Abstract
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing.
Poster
Weirong Chen · Ganlin Zhang · Felix Wimbauer · Rui Wang · Nikita Araslanov · Andrea Vedaldi · Daniel Cremers
[ Exhibit Hall I ]
Abstract
Traditional SLAM systems, which rely on bundle adjustment, often struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates.This work proposes a novel approach that leverages a 3D point tracker to decouple the static and dynamic motion, effectively separating the camera-induced motion from the motion of dynamic objects.Bundle adjustment can therefore operate reliably considering only the camera-induced component of the observed motion. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps.Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end.By integrating motion decomposition, bundle adjustment, and depth refinement into a unified framework, our method accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.
Poster
Hanwen Jiang · Qixing Huang · Georgios Pavlakos
[ Exhibit Hall I ]
Abstract
Training single-view Large Reconstruction Models (LRMs) follows the fully supervised route, requiring multi-view supervision. However, the multi-view data typically comes from synthetic 3D assets, which are hard to scale further and are not representative of the distribution of real-world object shapes. To address these limitations, we introduce Real3D, the first LRM that uses single-view real images for training, benefiting from their scalability and capturing the real-world shape distribution. Real3D introduces a novel self-training framework, including unsupervised losses at the pixel- and semantic-level, enabling LRMs to learn from these single-view images without multi-view supervision. Simultaneously, to deal with the noise of real data, Real3D also presents an automatic data curation approach to gather high-quality examples that have positive impact on training. Our experiments show that Real3D consistently outperforms prior work in diverse evaluation settings that include real and synthetic data, as well as both in-domain and out-of-domain shapes.
Poster
Olaf Dünkel · Thomas Wimmer · Christian Theobalt · Christian Rupprecht · Adam Kortylewski
[ Exhibit Hall I ]
Abstract
Finding correspondences between semantically similar points across images and object instances is one of the everlasting challenges in computer vision.While large pre-trained vision models have recently been demonstrated as effective priors for semantic matching, they still suffer from ambiguities for symmetric objects or repeated object parts.We propose to improve semantic correspondence estimation via 3D-aware pseudo-labeling. Specifically, we refine off-the-shelf features using pseudo ground truth obtained via 3D-aware chaining, filtering wrong labels through relaxed cyclic consistency, and 3D spherical prototype mapping constraints.While reducing the need for dataset specific annotations compared to prior work, we set a new state-of-the-art on SPair-71k by over 4\% absolute gain and by over 7\% against methods with similar supervision requirements.The generality of our proposed approach simplifies extension of training to other data sources, which we demonstrate in our experiments.
Poster
Pengjie Zhang · Lin Zhu · Xiao Wang · Lizhi Wang · Hua Huang
[ Exhibit Hall I ]
Abstract
Event cameras have shown promise in vision applications like optical flow estimation and stereo matching with many specialized architectures. However, existing works only focus event data within the confines of task-specific domains, overlooking the correlations between tasks across the temporal and spatial domains. In this paper, we propose a novel matching-based framework for event cameras to estimate flow and disparity simultaneously in a shared representation space, reformulating them as a unified pixel-wise correspondence matching problem. Specifically, our method utilizes a Temporal Recurrent Network to aggregate asynchronous event features across temporal or spatial domains, and a Spatial Contextual Attention to enhance knowledge transfer across event flows via temporal or spatial interactions. By utilizing a shared pixel-wise feature similarities module, our network performs optical flow estimation from temporal event segments and stereo matching from spatial event segments simultaneously. Our unified model inherently supports multi-task unification and cross-task transfer, which facilitate training and streamline deployment. Without the need for retraining on specific tasks, our model can effectively handle both event-based flow and stereo estimation, achieving state-of-the-art performance on both tasks. Our code will be released upon acceptance.
Poster
Lojze Zust · Yohann Cabon · Juliette Marrie · Leonid Antsfeld · Boris Chidlovskii · Jerome Revaud · Gabriela Csurka
[ Exhibit Hall I ]
Abstract
Panoptic segmentation of 3D scenes, which consists in isolating object instances in a dense 3D reconstruction of a scene, is challenging given only unposed images. Existing approaches typically extract 2D panoptic segmentations for each image using an off-the-shelf model, before optimizing an implicit geometric representation (often NeRF-based) that integrates and fuses 2D panoptic constraints. Not only this requires camera parameters and costly test-time optimization for each scene, but we argue that performing 2D panoptic segmentation despite the problem at hand being fundamentally 3D and multi-view, is likely suboptimal. In this work, we instead propose a simple integrated and unified approach. Our novel network, named PanSt3R, jointly predicts the 3D geometry and panoptic segmentation without any test-time optimization in a single forward pass. PanSt3R builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, which we entail with semantic knowledge and panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach. Overall, the proposed PanSt3R is simple, fast and scalable. We conduct extensive experiments on multiple benchmarks and show that our method yields state of-the-art results while being orders of magnitude faster.
Poster
Pingchuan Ma · Ming Gui · Johannes Schusterbauer · Xiaopei Yang · Olga Grebenkova · Vincent Tao Hu · Björn Ommer
[ Exhibit Hall I ]
Abstract
Generative probabilistic models have rapidly advanced and are now widely used in content creation. They have achieved impressive results in generating artwork and demonstrated an understanding of different styles. However, their understanding of art primarily remains at the level of individual pieces, limiting their ability to reveal broader stylistic trends and transitions over time. To analyze how art evolves, a distributional perspective is required, as single-instance observations do not capture the relation between them, which is essential for such a study. In this work, we introduce a diverse and high-quality dataset of over $656{,}536$ artworks spanning various genres, including paintings, illustrations, and other art forms, along with relevant metadata and annotations.Building on this dataset, we present a method that models the evolution of art as an optimal transport problem with stochastic interpolant to examine stylistic changes over time without requiring paired data. This approach allows us to study and understand the historical progression of art, uncovering the transitions and stylistic shifts that have occurred over centuries. Our code and dataset will be released upon publication.
Poster
Matteo Dunnhofer · Zaira Manigrasso · Christian Micheloni
[ Exhibit Hall I ]
Abstract
Visual object tracking and segmentation are becoming fundamental tasks for understanding human activities in egocentric vision. Recent research has benchmarked state-of-the-art methods and concluded that first person egocentric vision presents challenges compared to previously studied domains. However, these claims are based on evaluations conducted across significantly different scenarios. Many of the challenging characteristics attributed to egocentric vision are also present in third person videos of human-object activities. This raises a critical question: how much of the observed performance drop stems from the unique first person viewpoint inherent to egocentric vision versus the domain of human-object activities? To address this question, we introduce a new benchmark study designed to disentangle such factors. Our evaluation strategy enables a more precise separation of challenges related to the first person perspective from those linked to the broader domain of human-object activity understanding. By doing so, we provide deeper insights into the true sources of difficulty in egocentric tracking and segmentation, facilitating more targeted advancements in this field.
Poster
Junyan Ye · Honglin Lin · Leyan Ou · Dairong Chen · Zihao Wang · Qi Zhu · Conghui He · Weijia Li
[ Exhibit Hall I ]
Abstract
Cross-view geo-localization identifies the locations of street-view images by matching them with geo-tagged satellite images or OSM. However, most existing studies focus on image-to-image retrieval, with fewer addressing text-guided retrieval, a task vital for applications like pedestrian navigation and emergency response.In this work, we introduce a novel task for cross-view geo-localization with natural language descriptions, which aims to retrieve corresponding satellite images or OSM database based on scene text descriptions. To support this task, we construct the CVG-Text dataset by collecting cross-view data from multiple cities and employing a scene text generation approach that leverages the annotation capabilities of Large Multimodal Models to produce high-quality scene text descriptions with localization details. Additionally, we propose a novel text-based retrieval localization method, CrossText2Loc, which improves recall by 10\% and demonstrates excellent long-text retrieval capabilities. In terms of explainability, it not only provides similarity scores but also offers retrieval reasons. More information can be found at https://cvg-text.github.io/CVG-Text/.
Poster
Gongwei Chen · Xurui Zhou · Rui Shao · Yibo Lyu · Kaiwen Zhou · Shuai Wang · WenTao Li · Yinchuan Li · Zhongang Qi · Liqiang Nie
[ Exhibit Hall I ]
Abstract
The research focus of GUI agents is shifting from text-dependent to pure-vision-based approaches, which, though promising, prioritize comprehensive pre-training data collection while neglecting contextual modeling challenges. We probe the characteristics of element and history contextual modeling in GUI agent and summarize: **1) the high-density and loose-relation of element context** highlight the existence of many unrelated elements and their negative influence; **2) the high redundancy of history context** reveals the inefficient history modeling in current GUI agents. In this work, we propose a context-aware simplification framework for building an efficient and effective GUI Agent, termed **SimpAgent**. To mitigate potential interference from numerous unrelated elements, we introduce a **masking-based element pruning** method that circumvents the intractable relation modeling through an efficient masking mechanism. To reduce the redundancy in historical information, we devise a **consistency-guided history compression** module, which enhances implicit LLM-based compression through innovative explicit guidance, achieving an optimal balance between performance and efficiency. With the above components, SimpAgent reduces 27\% FLOPs and achieves superior GUI navigation performances. Comprehensive navigation experiments across diverse web and mobile environments demonstrate the effectiveness and potential of our agent.
Poster
Jungdae Lee · Taiki Miyanishi · Shuhei Kurita · Koya Sakamoto · Daichi Azuma · Yutaka Matsuo · Nakamasa Inoue
[ Exhibit Hall I ]
Abstract
Vision-and-language navigation (VLN) aims to develop agents capable of navigating in realistic environments. While recent cross-modal training approaches have significantly improved navigation performance in both indoor and outdoor scenarios, aerial navigation over real-world cities remains underexplored primarily due to limited datasets and the difficulty of integrating visual and geographic information. To fill this gap, we introduce CityNav, the first large-scale real-world dataset for aerial VLN. Our dataset consists of 32,637 human demonstration trajectories, each paired with a natural language description, covering 4.65 km^2 across two real cities: Cambridge and Birmingham. In contrast to existing datasets composed of synthetic scenes such as AerialVLN, our dataset presents a unique challenge because agents must interpret spatial relationships between real-world landmarks and the navigation destination, making CityNav an essential benchmark for advancing aerial VLN. Furthermore, as an initial step toward addressing this challenge, we provide a methodology of creating geographic semantic maps that can be used as an auxiliary modality input during navigation. In our experiments, we compare performance of three representative aerial VLN agents (Seq2seq, CMA and AerialVLN models) and demonstrate that the semantic map representation significantly improves their navigation performance.
Poster
Ruifei Zhang · Wei Zhang · Xiao Tan · Sibei Yang · Xiang Wan · Xiaonan Luo · Guanbin Li
[ Exhibit Hall I ]
Abstract
Recent advancements in language-grounded autonomous driving have been significantly promoted by the sophisticated cognition and reasoning capabilities of large language models (LLMs). However, current LLM-based approaches encounter critical challenges: (1) Failure analysis reveals that frequent collisions and obstructions, stemming from limitations in visual representations, remain primary obstacles to robust driving performance. (2) The substantial parameters of LLMs poses considerable deployment hurdles. To address these limitations, we introduce VLDrive, a novel approach featuring a lightweight MLLM architecture with enhanced vision components. VLDrive achieves compact visual tokens through innovative strategies, including cycle-consistent dynamic visual pruning and memory-enhanced feature aggregation. Furthermore, we propose a distance-decoupled instruction attention mechanism to improve joint visual-linguistic feature learning, particularly for long-range visual tokens. Extensive experiments conducted in the CARLA simulator demonstrate VLDrive's effectiveness. Notably, VLDrive achieves state-of-the-art driving performance while reducing parameters by 81\% (from 7B to 1.3B), yielding substantial driving score improvements of \textbf{15.4}\%, \textbf{16.8}\%, and \textbf{7.6}\% at tiny, short, and long distances, respectively, in closed-loop evaluations.
Poster
Zhengyao Lyu · Tianlin Pan · Chenyang Si · Zhaoxi Chen · Wangmeng Zuo · Ziwei Liu · Kwan-Yee K. Wong
[ Exhibit Hall I ]
Abstract
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our code will be made publicly available.
Poster
Yizhou Zhao · Haoyu Chen · Chunjiang Liu · Zhenyang Li · Charles Herrmann · Junhwa Hur · Yinxiao Li · Ming-Hsuan Yang · Bhiksha Raj · Min Xu
[ Exhibit Hall I ]
Abstract
System identification from videos aims to recover object geometry and governing physical laws. Existing methods integrate differentiable rendering with simulation but rely on predefined material priors, limiting their ability to handle unknown ones. We introduce MASIV, the first vision-based framework for material-agnostic system identification. Unlike existing approaches that depend on hand-crafted constitutive laws, MASIV employs learnable neural constitutive models, inferring object dynamics without assuming a scene-specific material prior. However, the absence of full particle state information imposes unique challenges, leading to unstable optimization and physically implausible behaviors. To address this, we introduce dense geometric guidance by reconstructing continuum particle trajectories, providing temporally rich motion constraints beyond sparse visual cues. Comprehensive experiments show that MASIV achieves state-of-the-art performance in geometric accuracy, rendering quality, and generalization ability.
Poster
SHIBO WANG · Haonan He · Maria Parelli · Christoph Gebhardt · Zicong Fan · Jie Song
[ Exhibit Hall I ]
Abstract
We interact with objects everyday, making the holistic 3D reconstruction of hands and objects from videos essential for applications like robotic in-hand manipulation. While most RGB-based methods rely on object templates, existing template-free approaches depend heavily on image observations, assuming full visibility of the object in the video. However, this assumption often does not hold in real-world scenarios, where cameras are fixed and objects are held in a static grip. As a result, parts of the object may remain unobserved, leading to unrealistic reconstructions when the object is under-observed. To this end, we introduce MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited views. Our key insight is that, although paired 3D hand-object data is extremely scarce, large-scale diffusion models like image-to-3D models offer abundant object supervision. This additional supervision can act as a prior to help regularize unseen object regions during hand interactions. Leveraging this insight, MagicHOI incorporates an existing image-to-3D diffusion model into a hand-object reconstruction framework. We then refine hand poses by incorporating hand-object interaction constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art template-free hand-object reconstruction methods. We also show that image-to-3D diffusion priors effectively regularize unseen object …
Poster
Meiqi Cao · Xiangbo Shu · Xin Jiang · Rui Yan · Yazhou Yao · Jinhui Tang
[ Exhibit Hall I ]
Abstract
While event cameras excel in capturing microsecond temporal dynamics, they suffer from sparse spatial representations compared to traditional RGB data. Thus, multi-modal event-based action recognition approaches aim to synergize complementary strengths by independently extracting and integrating paired RGB-Event features. However, this paradigm inevitably introduces additional data acquisition costs, while eroding the inherent privacy advantages of event-based sensing. Drawing inspiration from event-to-image reconstruction, texture-enriched visual representation directly reconstructed from asynchronous event streams is a promising solution. In response, we propose an Enhanced Multimodal Perceptual (EMP) framework that hierarchically explores multimodal cues~(\eg, edges and textures) from raw event streams through two synergistic innovations spanning representation to feature levels. Specifically, we introduce Cross-Modal Frequency Enhancer (CFE) that leverages complementary frequency characteristics between reconstructed frames and stacked frames to refine event representations. Furthermore, to achieve unified feature encoding across modalities, we develop High-Frequency Guided Selector (HGS) for semantic consistency token selection guided by dynamic edge features while suppressing redundant multimodal information interference adaptively. Extensive experiments on four benchmark datasets demonstrate the superior effectiveness of our proposed framework.
Poster
Runmin Zhang · Zhu Yu · Si-Yuan Cao · Lingyu Zhu · Guangyi Zhang · Xiaokai Bai · Hui-liang Shen
[ Exhibit Hall I ]
Abstract
This work presents SGCDet, a novel multi-view indoor object detection framework based on adaptive 3D volume construction. Unlike previous approaches that restrict the receptive field of voxels to fixed locations on images, we introduce a geometry and context aware aggregation module to integrate geometric and contextual information within an adaptive region, enhancing the representation capability of voxel features. Furthermore, we propose a sparse volume construction strategy that adaptively identifies and selects voxels with a high occupancy probability for feature refinement, minimizing redundant computation in free space. Benefiting from the above designs, our framework achieves effective and efficient volume construction in an adaptive way. Better still, our network can be supervised using only 3D bounding boxes, eliminating the dependence on ground-truth scene geometry. Experimental results demonstrate that SGCDet achieves state-of-the-art performance on the ScanNet and ARKitScenes datasets. Compared to the previous state-of-the-art approach, our SGCDet reduces training memory, training time, inference memory, and inference time by 42.9\%, 47.2\%, 50\%, and 40.8\%, respectively, while achieving notable improvements in mAP@0.50 of 3.9 on ScanNet and 3.3 on ARKitScenes.
Poster
Xiao Lin · Yun Peng · Liuyi Wang · xianyou zhong · Minghao Zhu · Jingwei Yang · Yi Feng · Chengju Liu · Qijun Chen
[ Exhibit Hall I ]
Abstract
In the effort to achieve robust and generalizable category-level object pose estimation, recent methods primarily focus on learning fundamental representations from data. However, the inherent biases within the data are often overlooked: the repeated training samples and similar environments may mislead the models to over-rely on specific patterns, hindering models' performance on novel instances. In this paper, we present CleanPose, a novel method that mitigates the data biases to enhance category-level pose estimation by integrating causal learning and knowledge distillation. By incorporating key causal variables (structural information and hidden confounders) into causal modeling, we propose the causal inference module based on front-door adjustment, which promotes unbiased estimation by reducing potential spurious correlations. Additionally, to further confront the data bias at the feature level, we devise a residual-based knowledge distillation approach to transfer unbiased semantic knowledge from 3D foundation model, providing comprehensive causal supervision. Extensive experiments across multiple benchmarks (REAL275, CAMERA25 and HouseCat6D) hightlight the superiority of proposed CleanPose over state-of-the-art methods. Code will be released.
Poster
Younjoon Chung · Hyoungseob Park · Patrick Rim · Xiaoran Zhang · Jihe He · Ziyao Zeng · Safa Cicek · Byung-Woo Hong · James Duncan · Alex Wong
[ Exhibit Hall I ]
Abstract
We propose a method of adapting pretrained depth completion models to test time data in an unsupervised manner. Depth completion models are (pre)trained to produce dense depth maps from pairs of RGB image and sparse depth maps in ideal capture conditions (source domain), e.g., well-illuminated, high signal-to-noise. When models are transferred to capture conditions out of ideal case (target domain), they produce erroneous output dense depth maps due to the covariate shift. To identify cases of out-of-distribution errors, we propose an learn an energy model in the source domain that assigns scalars that represent the likelihood of the output dense depth maps. This energy model is further used to adapt the pretrained depth completion models at test time, leading to our method: Energy-based Test-time Adaptation (ETA). ETA can localize regions of errors as high energy; test-time adaptation involves updating the model weights to minimize the energy, which in turn mitigates the covariate shift. We evaluate ETA across 3 indoor and 3 outdoor datasets, where ETA improves over the previous state of the art by an average of 6.94% on outdoor and 10.23% on indoor settings.
Poster
Nikita Karaev · Iurii Makarov · Jianyuan Wang · Natalia Neverova · Andrea Vedaldi · Christian Rupprecht
[ Exhibit Hall I ]
Abstract
We introduce CoTracker3, a new state-of-the-art point tracker. With CoTracker3, we revisit the design of recent trackers, removing components and reducing the number of parameters while also improving performance. We also explore the interplay of synthetic and real data. Recent trackers are trained on synthetic videos due to the difficulty of collecting tracking annotations for real data. However, this can result in suboptimal performance due to the statistical gap between synthetic and real videos. We thus suggest using off-the-shelf trackers as teachers, annotating real videos with pseudo-labels. Compared to other recent attempts at using real data for learning trackers, this scheme is much simpler and achieves better results using 1,000 times less data. CoTracker3 is available in online (causal) and offline variants and is particularly robust to occlusions.
Poster
Yunpeng Bai · Qixing Huang
[ Exhibit Hall I ]
Abstract
Monocular Depth Estimation (MDE) is a fundamental 3D vision problem with numerous applications such as 3D scene reconstruction, autonomous navigation, and AI content creation. However, robust and generalizable MDE remains challenging due to limited real-world labeled data and distribution gaps between synthetic datasets and real data. Existing methods often struggle on real-world test data with low efficiency, reduced accuracy, and lack of detail. To address these issues, we propose an efficient MDE approach named FiffDepth. The key feature of FiffDepth is its use of diffusion priors. It transforms diffusion-based image generators into a feed-forward architecture for detailed depth estimation. FiffDepth preserves key generative features and integrates the strong generalization capabilities of models like DINOv2. Through benchmark evaluations, we demonstrate that FiffDepth achieves exceptional accuracy, stability, and fine-grained detail, offering significant improvements in MDE performance against state-of-the-art MDE approaches.
Poster
Inwoo Hwang · Bing Zhou · Young Min Kim · Jian Wang · chuan guo
[ Exhibit Hall I ]
Abstract
Modeling human-scene interactions (HSI) is essential for understanding and simulating everyday human behaviors. Recent approaches utilizing generative modeling have made progress in this domain; however, they are limited in controllability and flexibility for real-world applications. To address these challenges, we propose reformulating the HSI modeling problem as Scene-aware Motion In-betweening---a more tractable and practical task. We introduce SceneMI, a framework that supports several practical applications, including keyframe-guided character animation in 3D scenes and enhancing the motion quality of imperfect HSI data. SceneMI employs dual scene descriptors to comprehensively encode global and local scene context. Furthermore, our framework leverages the inherent denoising nature of diffusion models to generalize on noisy keyframes.Experimental results demonstrate SceneMI's effectiveness in scene-aware keyframe in-betweening and generalization to the real-world GIMO dataset, where motions and scenes are acquired by noisy IMU sensors and smartphones. We further showcase SceneMI's applicability in HSI reconstruction from monocular videos.
Poster
Zerui Chen · Rolandos Alexandros Potamias · Shizhe Chen · Cordelia Schmid
[ Exhibit Hall I ]
Abstract
Reconstructing hand-held objects in 3D from monocular images remains a significant challenge in computer vision. Most existing approaches rely on implicit 3D representations, which produce overly smooth reconstructions and are time-consuming to generate explicit 3D shapes. While more recent methods directly reconstruct point clouds with diffusion models, the multi-step denoising makes high-resolution reconstruction inefficient. To address these limitations, we propose a transformer-based model to efficiently reconstruct dense 3D point clouds of hand-held objects. Our method follows a coarse-to-fine strategy, first generating a sparse point cloud from the image and progressively refining it into a dense representation using pixel-aligned image features. To enhance reconstruction accuracy, we integrate image features with 3D hand geometry to jointly predict the object point cloud and its pose relative to the hand. Our model is trained end-to-end for optimal performance. Experimental results on both synthetic and real datasets demonstrate that our method achieves state-of-the-art accuracy with much faster inference speed, while generalizing well to in-the-wild images.
Poster
Marvin Burges · Philipe Dias · Dalton Lunga · Carson Woody · Sarah Walters
[ Exhibit Hall I ]
Abstract
Object detection in remote sensing demands extensive, high-quality annotations—a process that is both labor-intensive and time-consuming. In this work, we introduce a real-time active learning and semi-automated labeling framework that leverages foundation models to streamline dataset annotation for object detection in remote sensing imagery. For example, by integrating a Segment Anything Model (SAM), our approach generates mask-based bounding boxes that serve as the basis for dual sampling: (a) uncertainty estimation to pinpoint challenging samples, and (b) diversity assessment to ensure broad data coverage. Furthermore, our Dynamic Box Switching Module (DBS) addresses the well-known cold start problem for object detection models by replacing its suboptimal initial predictions with SAM-derived masks, thereby enhancing early-stage localization accuracy. Extensive evaluations on multiple remote sensing datasets plus a real-world user study, demonstrate that our framework not only reduces annotation effort, but also significantly boosts detection performance compared to traditional active learning sampling methods. The code for training and the user interface will be made available.
Poster
Fengyuan Yang · Kerui Gu · Ha Linh Nguyen · Tze Ho Elden Tse · Angela Yao
[ Exhibit Hall I ]
Abstract
Accurate camera motion estimation is essential for recovering global human motion in world coordinates from RGB video inputs. While SLAM is widely used for estimating camera trajectory and point cloud, monocular SLAM does so only up to an unknown scale factor. Previous works estimate the scale factor through optimization, but this is unreliable and time-consuming. This paper presents an optimization-free scale calibration framework, Human as Checkerboard (HAC). HAC explicitly leverages the human body predicted by human mesh recovery model as a calibration reference. Specifically, it innovatively uses the absolute depth of human-scene contact joints as references to calibrate the corresponding relative scene depth from SLAM. HAC benefits from geometric priors encoded in human mesh recovery models to estimate the SLAM scale and achieves precise global human motion estimation. Simple yet powerful, our method sets a new state-of-the-art performance for global human mesh estimation tasks. It reduces motion errors by 50\% over prior local-to-global methods while using 100$\times$ less post-SLAM inference time than optimization-based methods. Our code will be made public.
Poster
Xin Qiao · Matteo Poggi · Xing Wei · Pengchao Deng · Yanhui Zhou · Stefano Mattoccia
[ Exhibit Hall I ]
Abstract
Under-display ToF imaging aims to both achieve precise depth sensing and maximize user experience by embedding a ToF camera beneath a screen panel. However, multiple complex degradations may occur during the imaging process, resulting in significant degradation of depth quality. To alleviate this drawback, we introduce a hybrid framework, named Learnable Fractional Reaction-Diffusion Dynamics (LFRD$^2$), which integrates the robust feature representation capabilities of neural networks with the interpretability of physical models. Specifically, we design a neural module implementing the time-fractional reaction-diffusion equation, which allows for iterative refinement to enhance depth quality, whose differential orders are generated dynamically. This module can correlate the current state of the predicted depth with any preceding states, keeping track of the long-term memory of the system itself. Furthermore, we propose a novel approach to construct an efficient continuous convolution operator based on coefficient prediction and repeated differentiation, further enhancing the final quality. Experimental results illustrate the effectiveness of our framework on four benchmark datasets. The code will be made available upon acceptance.
Poster
Lena Wild · Rafael Valencia · Patric Jensfelt
[ Exhibit Hall I ]
Abstract
Reliable integration of prior information is crucial for self-verifying and self-updating HD maps. However, no public dataset includes the required triplet of prior maps, current maps, and sensor data. As a result, existing methods must rely on synthetic priors, which create inconsistencies and lead to a significant sim2real gap. To address this, we introduce ArgoTweak, the first dataset to complete the triplet with realistic map priors. At its core, ArgoTweak employs a bijective mapping framework, breaking down large-scale modifications into fine-grained atomic changes at the map element level, thus ensuring interpretability. This paradigm shift enables accurate change detection and integration while preserving unchanged elements with high fidelity. Experiments show that training models on ArgoTweak significantly reduces the sim2real gap compared to synthetic priors. Extensive ablations further highlight the impact of structured priors and detailed change annotations. By establishing a benchmark for explainable, prior-aided HD mapping, ArgoTweak advances scalable, self-improving mapping solutions. Code, dataset, and our map modification toolbox will be made available at [URL].
Poster
Jijun Xiang · Xuan Zhu · Xianqi Wang · Yu Wang · Hong Zhang · Fei Guo · Xin Yang
[ Exhibit Hall I ]
Abstract
Depth enhancement, which uses RGB images as guidance to convert raw signals from dToF into high-precision, dense depth maps, is a critical task in computer vision. Although existing super-resolution-based methods show promising results on public datasets, they often rely on idealized assumptions like accurate region correspondences and reliable dToF inputs, overlooking calibration errors that cause misalignment and anomaly signals inherent to dToF imaging, limiting real-world applicability. To address these challenges, we propose a novel completion-based method, named DEPTHOR, featuring advances in both the training strategy and model architecture. First, we propose a method to simulate real-world dToF data from the accurate ground truth in synthetic datasets to enable noise-robust training. Second, we design a novel network that incorporates monocular depth estimation (MDE), leveraging global depth relationships and contextual information to improve prediction in challenging regions. On the ZJU-L5 dataset, our training strategy significantly enhances depth completion models, achieving results comparable to depth super-resolution methods, while our model achieves state-of-the-art results, improving Rel and RMSE by 27\% and 18\%, respectively. On a more challenging set of dToF samples we collected, our method outperforms SOTA methods on preliminary stereo-based GT, improving Rel and RMSE by 23\% and 22\%, respectively. Our code, trained …
Poster
Jingxi Liao · Shijie Hao · Richang Hong · Meng Wang
[ Exhibit Hall I ]
Abstract
Low-light image enhancement (LLIE) aims to improve the visual quality of images captured under poor lighting conditions. In supervised LLIE tasks, there exists a significant yet often overlooked inconsistency between the overall brightness of an enhanced image and its ground truth counterpart, referred to as $\textit{brightness mismatch}$ in this study. Brightness mismatch negatively impact supervised LLIE models by misleading model training. However, this issue is largely neglected in current research. In this context, we propose the $ \textit{GT-mean loss}$, a simple yet effective loss function directly modeling the mean values of images from a probabilistic perspective.The GT-mean loss is flexible, as it extends existing supervised LLIE loss functions into the GT-mean form with minimal additional computational costs. Extensive experiments demonstrate that the incorporation of the GT-mean loss results in consistent performance improvements across various methods and datasets.
Poster
Li Mi · Manon Béchaz · Zeming Chen · Antoine Bosselut · Devis Tuia
[ Exhibit Hall I ]
Abstract
Active Geo-localization (AGL) is the task of localizing a goal, represented in various modalities (e.g., aerial images, ground-level images, or text), within a predefined search area. Current methods approach AGL as a goal-reaching reinforcement learning (RL) problem with a distance-based reward. They localize the goal by learning to estimate the relative distance from it. However, when distance estimation becomes challenging or when encountering unseen targets and environments, the agent exhibits reduced robustness and generalization ability due to the less reliable exploration strategy learned during training. In this paper, we propose GeoExplorer, an AGL agent that incorporates curiosity-driven exploration through intrinsic rewards. Unlike distance-based rewards, our curiosity-driven reward is goal-agnostic, enabling robust, diverse, and contextually relevant exploration based on effective environment modeling. These capabilities have been proven through extensive experiments across four AGL benchmarks, demonstrating the effectiveness and generalization ability of GeoExplorer in diverse settings, particularly in localizing unfamiliar targets and environments.
Poster
Anurag Ghosh · Shen Zheng · Robert Tamburo · Khiem Vuong · Juan Alvarez-Padilla · Hailiang Zhu · Nicholas Dunn · Michael Cardei · Christoph Mertz · Srinivasa Narasimhan
[ Exhibit Hall I ]
Abstract
Perceiving and navigating autonomously through work zones is a challenging and underexplored problem. Open datasets for developing algorithms for this long-tailed scenario are scarce. We propose the ROADWork dataset to learn to recognize, observe, analyze, and drive through work zones. State-of-the-art foundation models perform poorly when applied to work zones. Fine-tuning models on our dataset significantly improves perception and navigation in work zones. With ROADWork, we discover new work zone images with higher precision (+32.5%) at a much higher rate (12.8×) around the world. Open-vocabulary methods fail on work zones, whereas detectors fine-tuned on our data improve performance (+32.2 AP). Vision-Language Models (VLMs) struggle to describe work zones, but fine-tuning substantially improves performance (+36.7 SPICE). Beyond fine-tuning, we show the value of simple techniques: Video label propagation provides additional gains (+2.6 AP). While reading work zone signs, composing a work zone detector and text spotter through crop-scaling improves performance (+14.2% 1-NED). Composing work zone detections to provide context further reduces hallucinations (+3.9 SPICE) in VLMs. We compute drivable paths from work zone navigation videos and predict navigational goals and pathways. Incorporating road work semantics ensures 53.6% goals have angular error (AE) < 0.5 degrees (+9.9%) and 75.3% pathways have AE < 0.5 …
Poster
Rangel Daroya · Elijah Cole · Oisin Mac Aodha · Grant Horn · Subhransu Maji
[ Exhibit Hall I ]
Abstract
Species distributions encode valuable ecological and environmental information, yet their potential for guiding representation learning in remote sensing remains underexplored. We introduce WildSAT, which pairs satellite images with millions of geo-tagged wildlife observations readily-available on citizen science platforms. WildSAT employs a contrastive learning approach that jointly leverages satellite images, species occurrence maps, and textual habitat descriptions to train or fine-tune models. This approach significantly improves performance on diverse satellite image recognition tasks, outperforming both ImageNet-pretrained models and satellite-specific baselines. Additionally, by aligning visual and textual information, WildSAT enables zero-shot retrieval, allowing users to search geographic locations based on textual descriptions. WildSAT surpasses recent cross-modal learning methods, including approaches that align satellite images with ground imagery or wildlife photos, demonstrating the advantages of our approach. Finally, we analyze the impact of key design choices and highlight the broad applicability of WildSAT to remote sensing and biodiversity monitoring.
Poster
Lily Goli · Sara Sabour · Mark Matthews · Marcus Brubaker · Dmitry Lagun · Alec Jacobson · David Fleet · Saurabh Saxena · Andrea Tagliasacchi
[ Exhibit Hall I ]
Abstract
There has been extensive progress in the reconstruction and generation of 4D scenes from monocular casually-captured video. Estimating accurate camera poses from videos through structure-from-motion (SfM) relies on robustly separating static and dynamic parts of a video. We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame. Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model. It outperforms unsupervised baselines for motion segmentation as well as supervised baselines trained from synthetic data. More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
Poster
Hao Zhou · Zhanning Gao · Zhili Chen · Maosheng Ye · Qifeng Chen · Tongyi Cao · Honggang Qi
[ Exhibit Hall I ]
Abstract
In light of the dynamic nature of autonomous driving environments and stringent safety requirements, general MLLMs combined with CLIP alone often struggle to accurately represent driving-specific scenarios, particularly in complex interactions and long-tail cases. To address this, we propose the Hints of Prompt (HoP) framework, which introduces three key enhancements: Affinity hint to emphasize instance-level structure by strengthening token-wise connections, Semantic hint to incorporate high-level information relevant to driving-specific cases, such as complex interactions among vehicles and traffic signs, and Question hint to align visual features with the query context, focusing on question-relevant regions. These hints are fused through a Hint Fusion module, enriching visual representations by capturing driving-related representations with limited domain data, ensuring faster adaptation to driving scenarios. Extensive experiments confirm the effectiveness of the HoP framework, showing that it significantly outperforms previous state-of-the-art methods in all key metrics.
Poster
Hoonhee Cho · Yuhwan Jeong · Kuk-Jin Yoon
[ Exhibit Hall I ]
Abstract
With advancements in sensor and display technologies, high-resolution imagery is becoming increasingly prevalent in diverse applications. As a result, optical flow estimation needs to adapt to larger image resolutions, where even moderate movements lead to substantial pixel displacements, making long-range motion estimation more critical than ever. However, existing datasets primarily focus on short-range flow in low-resolution settings, limiting the generalization of models to high-resolution scenarios with large displacements. Additionally, there is a lack of suitable datasets for evaluating model capacity in long-range motion estimation, further hindering progress in this area. To address this, we introduce RelayFlow-4K, high-resolution 4K optical flow dataset designed to capture diverse motion patterns, including long-range intermediate frame flows. While such datasets provide valuable training resources, long-range estimation remains challenging due to increased matching ambiguity. Simply incorporating these datasets does not inherently improve performance. To this end, we propose a novel training framework that integrates matching cost distillation and incremental time-step learning to refine cost volume estimation and stabilize training. Additionally, we leverage the distance map, which measures the distance from unmatched regions to their nearest matched pixels, improving occlusion handling. Our approach significantly enhances long-range optical flow estimation in high-resolution settings. Our datasets and code will …
Poster
Ze Li · Feng Zhang · Xiatian Zhu · Zhang Meng · Yanghong Zhou · P.Y. Mok
[ Exhibit Hall I ]
Abstract
Synthesizing normal-light novel views from low-light multiview images remains a challenging yet practical task due to the low visibility and high ISO noise challenges. Existing low-light enhancement methods often struggle to preprocess these images effectively due to their inability to structurally correlate multiple views. While state-of-the-art approaches have advanced by manipulating illumination-related components during rendering, they often introduce color distortions and artifacts. Moreover, they rely solely on NeRF’s multi-view optimization, which offers limited denoising effectiveness. In this paper, we propose a novel Robust Low-light Scene Restoration framework termed (RoSe), which enables novel-view synthesis under normal lighting from low-light multiview images. Inspired by the 2D Retinex theory, we frame this task as an illuminance transition estimation problem in 3D space, further conceptualizing it as a specialized rendering task. This multiview-consistent illuminance transition field establishes a robust connection between low-light and normal-light conditions. By further exploiting the inherent low-rank property of illumination to constrain the transition representation, we achieve more effective denoising without complex 2D techniques or explicit noise modeling. To this end, we design a concise dual-branch architecture and propose a low-rank denoising module. Experiments demonstrate that RoSe significantly outperforms state-of-the-art models in both rendering quality and multiview consistency on standard …
Poster
Dongyoung Kim · Mahmoud Afifi · Dongyun Kim · Michael Brown · Seon Joo Kim
[ Exhibit Hall I ]
Abstract
Computational color constancy, or white balancing, is a key module in a camera’s image signal processor (ISP) that corrects color casts from scene lighting. Because this operation occurs in the camera-specific raw color space, white balance algorithms must adapt to different cameras. This paper introduces a learning-based method for cross-camera color constancy that generalizes to new cameras without retraining. Our method leverages pre-calibrated color correction matrices (CCMs) available on ISPs that map the camera’s raw color space to a standard space (e.g., CIE XYZ). Our method uses these CCMs to transform predefined illumination colors (i.e., along the Planckian locus) into the test camera's raw space. The mapped illuminants are encoded into a compact _camera fingerprint embedding_ (CFE) that enables the network to adapt to unseen cameras. To prevent overfitting due to limited cameras and CCMs during training, we introduce a data augmentation technique that interpolates between cameras and their CCMs. Experimental results across multiple datasets and backbones show that our method achieves state-of-the-art cross-camera color constancy while remaining lightweight and relying only on data readily available in camera ISPs.
Poster
Jie Feng · Shengyuan Wang · Tianhui Liu · Yanxin Xi · Yong Li
[ Exhibit Hall I ]
Abstract
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data, such as structured geospatial data, trajectory data, satellite image data, and street view image data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In UrbanLLaVA, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we design an effective multi-stage training pipeline to ensure the training stability and compatibility across various urban tasks. We also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that UrbanLLaVA outperforms open source and commercial MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. UrbanLLaVA sheds lights …
Poster
Hao Zheng · Yuting Zheng · Hanbo Huang · Chaofan Sun · Enhui Liao · Lin Liu · Yi Han · Hao Zhou · Shiyu Liang
[ Exhibit Hall I ]
Abstract
Reconstructing atmospheric surface $\text{CO}_2$ is crucial for understanding climate dynamics and informing global mitigation strategies. Traditional inversion models achieve precise global $\text{CO}_2$ reconstruction but rely heavily on uncertain prior estimates of fluxes and emissions. Inspired by recent advances in data-driven weather forecasting, we explore whether data-driven models can reduce reliance on these priors. However, $\text{CO}_2$ reconstruction presents unique challenges, including complex spatio-temporal dynamics, periodic patterns and sparse observations. We propose $\text{CO}_2$-Net, a data-driven model that addresses these challenges without requiring extensive prior data. We formulate $\text{CO}_2$ reconstruction as solving a constrained advection-diffusion equation and derive three key components: physics-informed spatio-temporal factorization for capturing complex transport dynamics, wind-based embeddings for modeling periodic variations and a semi-supervised loss for integrating sparse $\text{CO}_2$ observations with dense meteorological data. $\text{CO}_2$-Net is designed in three sizes---small (S), base (B) and large (L)---to balance performance and efficiency. On CMIP6 reanalysis data, $\text{CO}_2$-Net (S) and (L) reduce RMSE by {11\%} and {71\%}, respectively, when compared to the best data-driven baseline. On real observations, $\text{CO}_2$-Net (L) achieves RMSE comparable to inversion models. The ablation study shows that the effectiveness of wind-based embedding and semi-supervised loss stems from their compatibility with our spatio-temporal factorization.
Poster
Pattaramanee Arsomngern · Sasikarn Khwanmuang · Matthias Nießner · Supasorn Suwajanakorn
[ Exhibit Hall I ]
Abstract
One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing retrieve-and-alignmethods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose an unsupervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss.Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space.We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA unsupervised baselines by +4.3% mean alignment accuracy and is the only unsupervised approach to surpass the supervised ROCA by +2.7%.To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.
Poster
Yahao Liu · Qin Wang · Lixin Duan · Wen Li
[ Exhibit Hall I ]
Abstract
Regression is fundamental in computer vision and is widely used in various tasks including age estimation, depth estimation, target localization, \etc However, real-world data often exhibits imbalanced distribution, making regression models perform poorly especially for target values with rare observations (known as the imbalanced regression problem). In this paper, we reframe imbalanced regression as an imbalanced generalization problem. To tackle that, we look into the loss sharpness property for measuring the generalization ability of regression models in the observation space. Namely, given a certain perturbation on the model parameters, we check how model performance changes according to the loss values of different target observations. We propose a simple yet effective approach called Balanced Sharpness-Aware Minimization (BSAM) to enforce the uniform generalization ability of regression models for the entire observation space. In particular, we start from the traditional sharpness-aware minimization and then introduce a novel targeted reweighting strategy to homogenize the generalization ability across the observation space, which guarantees a theoretical generalization bound. Extensive experiments on multiple vision regression tasks, including age and depth estimation, demonstrate that our BSAM method consistently outperforms existing approaches. The code will be available soon.
Poster
CHEN LIANG · Zhicheng Shi · Wenguan Wang · Yi Yang
[ Exhibit Hall I ]
Abstract
Language-based human motion understanding focuses on describing human motions using natural language descriptions. Conversely, human motion generation aims to generate human motions from textual inputs. Despite significant progress in both fields, further advancements are hindered by two primary challenges: (i) Both tasks heavily rely on vast amounts of paired motion-language data for model training. However, human labeling is costly, making it increasingly unsustainable as model scales increase. (ii) Existing models often learn the two tasks in parallel. The strong reciprocity between them has not been fully explored. In response, this work proposes Dual Reciprocal Learning (DRL) for language-based human motion understanding and generation. DRL establishes a symmetric learning framework where both tasks collaboratively evolve in a closed-loop, bootstrapping manner, effectively leveraging the reciprocity between them. In DRL, the tasks serve as evaluators for each other, enabling the generation of informative feedback signals even with easily acquired unidirectional motion or language data. Furthermore, to mitigate dataset-specific bias in existing evaluations, we propose a generalized protocol that extends evaluation to a general-domain cross-modal feature space. Experimental results on standard benchmarks demonstrate that DRL achieves remarkable performance boosts over state-of-the-art models in both tasks. Our code will be made publicly available.
Poster
Junseong Shin · Seungwoo Chung · Yunjeong Yang · Tae Hyun Kim
[ Exhibit Hall I ]
Abstract
Dehazing involves removing haze or fog from images to restore clarity and improve visibility by estimating atmospheric scattering effects. While deep learning methods show promise, the lack of paired real-world training data and the resulting domain gap hinder generalization to real-world scenarios.In this context, physics-grounded learning becomes crucial; however, traditional methods based on the Atmospheric Scattering Model (ASM) often fall short in handling real-world complexities and diverse haze patterns.To solve this problem, we propose HazeFlow, a novel ODE-based framework that reformulates ASM as an ordinary differential equation (ODE). Inspired by Rectified Flow (RF), HazeFlow learns an optimal ODE trajectory to map hazy images to clean ones, enhancing real-world dehazing performance with only a single inference step. Additionally, we introduce a non-homogeneous haze generation method using Markov Chain Brownian Motion (MCBM) to address the scarcity of paired real-world data. By simulating realistic haze patterns through MCBM, we enhance the adaptability of HazeFlow to diverse real-world scenarios. Through extensive experiments, we demonstrate that HazeFlow achieves state-of-the-art performance across various real-world dehazing benchmark datasets.
Poster
Yuan Wang · Yuxin Chen · Zhongang Qi · Lijun Liu · Jile Jiao · Xuetao Feng · Yujia Liang · Ying Shan · Zhipeng Zhang
[ Exhibit Hall I ]
Abstract
3D vision-language (3D-VL) reasoning, connecting natural language with 3D physical world, represents a milestone in advancing spatial intelligence. While transformer-based methods dominate 3D-VL research, their quadratic complexity and simplistic positional embedding mechanisms severely limits effective modeling of long-range 3D-VL dependencies and spatial relationships in 3D-VL tasks. State Space Models (SSM) have emerged as promising linear-complexity alternatives for sequential data processing, while inherent selection mechanism offers notable capability for spatial modeling. Despite its potential, straightforward adoption of Mamba to 3D-VL tasks encounters two obstacles: (1) how to perceive the position of 3D objects and understand complex spatial relationships, and (2) how to achieve thorough synergies of multi-modal features. In this paper, we propose Mamba-3VL, a pioneering 3D-VL framework to model complex intra- and inter-modality correlations and enhance spatial relation reasoning, while guaranteeing top-tier performance, high efficiency, and generalization potential for 3D-VL tasks. Specifically, Mamba Mixer explicitly models 3D-VL interaction via channel twisting and relation-prioritized spatial scanning policy. It maximally retain spatial relation of object-centric features. To further provide precise spatial encoding for mamba, we develop Instance-aware Dynamic Position Adapter (IDPA) to dynamically adjust instance-specific positional embeddings and enhance local spatial relation of 3D objects. Extensive results validate Mamba-3VL trumps other competitors …
Poster
Hai Wu · Hongwei Lin · Xusheng Guo · Xin Li · Mingming Wang · Cheng Wang · Chenglu Wen
[ Exhibit Hall I ]
Abstract
The performance of unsupervised 3D object classification and bounding box regression relies heavily on the quality of initial pseudo-labels. Traditionally, the labels of classification and regression are represented by \textbf{a single set} of candidate boxes generated by motion or geometry heuristics. However, due to the similarity of many objects to the background in shape or lack of motion, the labels often fail to achieve high accuracy in two tasks simultaneously. Using these labels to directly train the network results in decreased detection performance. To address this challenge, we introduce Motal that performs unsupervised 3D object detection by Modality and task-specific knowledge transfer. Motal decouples the pseudo-labels into two sets of candidates, from which Motal discovers classification knowledge by motion and image appearance prior, and discovers box regression knowledge by geometry prior, respectively. Motal finally transfers all knowledge to a single student network by a TMT (Task-specific Masked Training) scheme, attaining high performance in both classification and regression. Motal can greatly enhance various unsupervised methods by about 2x mAP. For example, on the WOD test set, Motal improves the state-of-the-art CPD by 21.56% mAP L1 (from 20.54% to 42.10%) and 19.90% mAP L2 (from 18.18% to 38.08%). These achievements highlight the …
Poster
Rui Wang · Quentin Lohmeyer · Mirko Meboldt · Siyu Tang
[ Exhibit Hall I ]
Abstract
Reconstructing clean, distractor-free 3D scenes from real-world captures remains a significant challenge, particularly in highly dynamic and cluttered settings such as egocentric videos. To tackle this problem, we introduce DeGauss, a simple and robust self-supervised framework for dynamic scene reconstruction based on a decoupled dynamic-static Gaussian Splatting design. DeGauss models dynamic elements with foreground Gaussians and static content with background Gaussians, using a probabilistic mask to coordinate their composition and enable independent yet complementary optimization. DeGauss generalizes robustly across a wide range of real-world scenarios, from casual image collections to long, dynamic egocentric videos, without relying on complex heuristics or extensive supervision. Experiments on benchmarks including NeRF-on-the-go, ADT, AEA, Hot3D, and EPIC-Fields demonstrate that DeGauss consistently outperforms existing methods, establishing a strong baseline for generalizable, distractor-free 3D reconstruction in highly dynamic, interaction-rich environments.
Poster
Jiahao Zhang · Anoop Cherian · Cristian Rodriguez-Opazo · Weijian Deng · Stephen Gould
[ Exhibit Hall I ]
Abstract
Assembling furniture amounts to solving the discrete-continuous optimization task of selecting the furniture parts to assemble and estimating their connecting poses in a physically realistic manner. The problem is hampered by its combinatorially large yet sparse solution space thus making learning to assemble a challenging task for current machine learning models. In this paper, we attempt to solve this task by leveraging the assembly instructions provided in diagrammatic manuals that typically accompany the furniture parts. Our key insight is to use the cues in these diagrams to split the problem into discrete and continuous phases. Specifically, we present Manual-PA, a transformer-based instruction Manual-guided 3D Part Assembly framework that learns to semantically align 3D parts with their illustrations in the manuals using a contrastive learning backbone towards predicting the assembly order and infers the 6D pose of each part via relating it to the final furniture depicted in the manual. To validate the efficacy of our method, we conduct experiments on the benchmark PartNet dataset. Our results show that using the diagrams and the order of the parts lead to significant improvements in assembly performance against the state of the art. Further, Manual-PA demonstrates strong generalization to real-world IKEA furniture assembly …
Poster
Pingrui Zhang · Xianqiang Gao · Yuhan Wu · Kehui Liu · Dong Wang · Zhigang Wang · Bin Zhao · Yan Ding · Xuelong Li
[ Exhibit Hall I ]
Abstract
In mobile manipulation, navigation and manipulation are often treated as separate problems, resulting in a significant gap between merely approaching an object and engaging with it effectively. Many navigation approaches primarily define success by proximity to the target, often overlooking the necessity for optimal positioning that facilitates subsequent manipulation. To address this, we introduce \ours, a benchmark dataset comprising over 100k samples that provide training data for models to learn optimal final navigation positions for seamless transition to manipulation. Our dataset includes affordance-grounded floor labels collected from diverse kitchen environments, in which robotic mobile manipulators of different models attempt to grasp target objects amidst clutter. Using a fully automated pipeline, we simulate diverse real-world scenarios and generate affordance labels for optimal manipulation positions. Visual data are collected from RGB-D inputs captured by a first-person view camera mounted on the robotic arm, ensuring consistency in viewpoint during data collection. We also develop a lightweight baseline model, \ourmodel, for navigation affordance grounding that demonstrates promising performance on the \ours benchmark. Our approach enables models to learn affordance-based final positioning that accommodates different arm types and platform heights, thereby paving the way for more robust and generalizable integration of navigation and manipulation in …
Poster
Peiran Xu · Xicheng Gong · Yadong Mu
[ Exhibit Hall I ]
Abstract
In this work we concentrate on the task of goal-oriented Vision-and-Language Navigation (VLN). Existing methods often make decisions based on historical information, overlooking the future implications and long-term outcomes of the actions. In contrast, we aim to develop a foresighted agent. Specifically, we draw upon Q-learning to train a Q-model using large-scale unlabeled trajectory data, in order to learn the general knowledge regarding the layout and object relations within indoor scenes. This model can generate a Q-feature, analogous to the Q-value in traditional Q-network, for each candidate action, which describes the potential future information that may be observed after taking the specific action. Subsequently, a cross-modal future encoder integrates the task-agnostic Q-feature with navigation instructions to produce a set of action scores reflecting future prospects. These scores, when combined with the original scores based on history, facilitate an A*-style searching strategy to effectively explore the regions that are more likely to lead to the destination. Extensive experiments conducted on widely used goal-oriented VLN datasets validate the effectiveness of the proposed method.
Poster
Yue Fan · Xiaojian Ma · Rongpeng Su · Jun Guo · Rujie Wu · Xi Chen · Qing Li
[ Exhibit Hall I ]
Abstract
This paper investigates the problem of understanding dynamic 3D scenes from egocentric observations, a key challenge in robotics and embodied AI. Unlike prior studies that explored this as long-form video understanding and utilized egocentric video only, we instead propose an LLM-based agent, Embodied VideoAgent, which constructs scene memory from both egocentric video and embodied sensory inputs (e.g. depth and pose sensing). We further introduce a VLM-based approach to automatically update the memory when actions or activities over objects are perceived. Embodied VideoAgent attains significant advantages over counterparts in challenging reasoning and planning tasks in 3D scenes, achieving gains of 6.5% on Ego4D-VQ3D, 2.6% on OpenEQA, and 15.3% on EnvQA. We have also demonstrated its potential in various embodied AI tasks including generating embodied interactions and perception for robot manipulation. The code and demo will be made public.
Poster
Shr-Ruei Tsai · Wei-Cheng Chang · Jie-Ying Lee · Chih-Hai Su · Yu-Lun Liu
[ Exhibit Hall I ]
Abstract
Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution.
Poster
Tianyi Zhao · Boyang Liu · Yanglei Gao · Yiming Sun · Maoxun Yuan · Xingxing Wei
[ Exhibit Hall I ]
Abstract
Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem that the decreased feature extraction capability in multi-modal joint learning. This leads to an unreasonable but prevalent phenomenon--Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct an novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors.
Poster
Phillip Mueller · Talip Ünlü · Sebastian Schmidt · Marcel Kollovieh · Jiajie Fan · Stephan Günnemann · Lars Mikelsons
[ Exhibit Hall I ]
Abstract
Precise geometric control in image generation is essential for fields like engineering \& product design and creative industries to control 3D object features accurately in 2D image space. Traditional 3D editing approaches are time-consuming and demand specialized skills, while current image-based generative methods lack accuracy in geometric conditioning. To address these challenges, we propose GeoDiffusion, a training-free framework for accurate and efficient geometric conditioning of 3D features in image generation. GeoDiffusion employs a class-specific 3D object as a geometric prior to define keypoints and parametric correlations in 3D space. We ensure viewpoint consistency through a rendered image of a reference 3D object, followed by style transfer to meet user-defined appearance specifications. At the core of our framework is GeoDrag, improving accuracy and speed of drag-based image editing on geometry guidance tasks and general instructions on DragBench. Our results demonstrate that GeoDiffusion enables precise geometric modifications across various iterative design workflows.
Poster
Heng Su · Mengying Xie · Nieqing Cao · Yan Ding · Beichen Shao · Xianlei Long · Fuqiang Gu · Chao Chen
[ Exhibit Hall I ]
Abstract
In recent years, affordance detection has become essential for robotic manipulation in real-world scenes, where robots must autonomously interpret commands and perform actions. Current methods often focus on individual point cloud objects or simple semantic queries, limiting their effectiveness in diverse scenes and complex instructions. To address this, we introduce OVA-Fields, a framework for affordance detection in 3D scenes with complex semantics. By integrating multilevel geometric encoding and enhanced semantic affordance embeddings, OVA-Fields maps user commands directly to operational parts, embedding enriched affordance information into the 3D scene. Experimental results demonstrate that OVA-Fields achieves 52.4\% mIoU on complex semantic real-world scenes and 90\% success rate in real-world robot manipulation tasks (e.g., "take out some food from the refirgerator") using RGB-D sensing. Our approach enables the precise identification of operational parts, transforming natural language queries into targeted manipulations in real-world environments.
Poster
Jianhua Sun · Yuxuan Li · Jiude Wei · Xu Longfei · Wang Nange · Yining Zhang · Cewu Lu
[ Exhibit Hall I ]
Abstract
The acquisition of substantial volumes of 3D articulated object data is expensive and time-consuming, and consequently the scarcity of 3D articulated object data becomes an obstacle for deep learning methods to achieve remarkable performance in various articulated object understanding tasks. Meanwhile, pairing these object data with detailed annotations to enable training for various tasks is also difficult and labor-intensive to achieve. In order to expeditiously gather a significant number of 3D articulated objects with comprehensive and detailed annotations for training, we propose Articulated Object Procedural Generation toolbox, a.k.a. Arti-PG toolbox. Arti-PG toolbox consists of i) descriptions of articulated objects by means of a generalized structure program along with their analytic correspondence to the objects’ point cloud, ii) procedural rules about manipulations on the structure program to synthesize large-scale and diverse new articulated objects, and iii) mathematical descriptions of knowledge (e.g. affordance, semantics, etc.) to provide annotations to the synthesized object. Arti-PG has two appealing properties for providing training data for articulated object understanding tasks: i) objects are created with unlimited variations in shape through program-oriented structure manipulation, ii) Arti-PG is widely applicable to diverse tasks by easily providing comprehensive and detailed annotations. Arti-PG now supports the procedural generation of 26 …
Poster
Xiaoding Yuan · Prakhar Kaushik · Guofeng Zhang · Artur Jesslen · Adam Kortylewski · Alan Yuille
[ Exhibit Hall I ]
Abstract
Deep learning algorithms for object classification and 3D object pose estimation lack robustness to out-of-distribution factors such as synthetic stimuli, changes in weather conditions, and partial occlusion. Recently, a class of Neural Mesh Models have been developed where objects are represented in terms of 3D meshes with learned features at the vertices. These models have shown robustness in small-scale settings, involving 10 objects, but it is unclear that they can be scaled up to 100s of object classes. The main problem is that their training involves contrastive learning among the vertices of all object classes, which scales quadratically with the number of classes. We present a strategy which exploits the compositionality of the objects, i.e. the independence of the feature vectors of the vertices, which greatly reduces the training time while also improving the performance of the algorithms. We first restructure the per-vertex contrastive learning into contrasting within class and between classes. Then we propose a process that dynamically decouples the contrast between classes which are rarely confused, and enhances the contrast between the vertices of classes that are most confused. Our large-scale 3D compositional model not only achieves state-of-the-art performance on the task of predicting classification and pose estimation …
Poster
Yufeng Zhong · Chengjian Feng · Feng yan · Fanfan Liu · Liming Zheng · Lin Ma
[ Exhibit Hall I ]
Abstract
In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents must possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we introduce \textbf{P3Nav}, a unified framework that integrates \textbf{P}erception, \textbf{P}lanning, and \textbf{P}rediction capabilities through \textbf{Multitask Collaboration} on navigation and embodied question answering (EQA) tasks, thereby enhancing navigation performance. Furthermore, P3Nav employs an \textbf{Adaptive 3D-aware History Sampling} strategy to effectively and efficiently utilize historical observations. By leveraging the large language models (LLM), P3Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. P3Nav achieves a 75\% success rate in object goal navigation on the $\mathrm{CHORES}$-$\mathbb{S}$ benchmark, setting a new state-of-the-art performance.
Poster
Chen-Liang Fan · Mingpei Cao · Chih-Chien Hung · Yuesheng Zhu
[ Exhibit Hall I ]
Abstract
Autofocus (AF) is essential for imaging systems, particularly in industrial applications such as automated optical inspection (AOI), where achieving precise focus is critical. Conventional AF methods rely on peak-searching algorithms that require dense focal sampling, making them inefficient in small depth-of-field (DoF) scenarios. Deep learning (DL)-based AF methods, while effective in general imaging, have a limited working range in small DoF conditions due to defocus uncertainty.In this work, we propose a novel AF framework that integrates an optical model-based sharpness indicator with a deep learning approach to predict sharpness from defocused images. We leverage sharpness estimation as a reliable focus measure and apply an adaptive adjustment algorithm to adjust the focus position based on the sharpness-to-distance mapping. This method effectively addresses defocus uncertainty and enables robust autofocus across a 35× DoF range.Experimental results on an AOI system demonstrate that our approach achieves reliable autofocus even from highly defocused starting points and remains robust across different textures and illumination conditions. Compared to conventional and existing DL-based approaches, our method offers improved precision, efficiency, and adaptability, making it suitable for industrial applications and small DoF scenarios.
Poster
Samuel Clarke · Suzannah Wistreich · Yanjie Ze · Jiajun Wu
[ Exhibit Hall I ]
Abstract
Understanding objects through multiple sensory modalities is fundamental to human perception, enabling cross-sensory integration and richer comprehension. For AI and robotic systems to replicate this ability, access to diverse, high-quality multi-sensory data is critical. Existing datasets are often limited by their focus on controlled environments, simulated objects, or restricted modality pairings. We introduce X-Capture, an open-source, portable, and cost-effective device for real-world multi-sensory data collection, capable of capturing correlated RGBD images, tactile readings, and impact audio. With a build cost under $1,000, X-Capture democratizes the creation of multi-sensory datasets, requiring only consumer-grade tools for assembly. Using X-Capture, we curate a sample dataset from 3,000 total points on 500 everyday objects from diverse, real-world environments, offering both richness and breadth. Our experiments demonstrate the value of both the quantity and the sensory breadth of our data for both pretraining and fine-tuning multi-modal representations for object-centric tasks such as cross-sensory retrieval and reconstruction. X-Capture lays the groundwork for advancing human-like sensory representations in AI, emphasizing scalability, accessibility, and real-world applicability.
Poster
Chen Lin · Weizhi Du · Zhixiang Min · Baochen She · Enrique Dunn · Sonya Hanson
[ Exhibit Hall I ]
Abstract
We explore a quaternion adjugate matrix-based representation for rotational motion in the Perspective-n-Point (PnP) problem. Leveraging quadratic quaternion terms within a Determinant Ratio Matrix (DRaM) estimation framework, we extend its application to perspective scenarios, providing a robust and efficient initialization for iterative PnP pose estimation. Notably, by solving the orthographic projection least-squares problem, DRaM provides a reliable initialization that enhances the accuracy and stability of iterative PnP solvers. Experiments on synthetic and real data demonstrate its efficiency, accuracy, and robustness, particularly under high noise conditions. Furthermore, our non-minimal formulation ensures numerical stability, making it effective for real-world applications.
Poster
Minchao Jiang · Shunyu Jia · Jiaming Gu · Xiaoyuan Lu · Guangming Zhu · Anqi Dong · zhang liang
[ Exhibit Hall I ]
Abstract
3D Gaussian Splatting (3DGS) has become horsepower in high-quality, real-time rendering for novel view synthesis of 3D scenes. However, existing methods focus primarily on geometric and appearance modeling, lacking deeper scene understanding while also incurring high training costs that complicate the originally streamlined differentiable rendering pipeline. To this end, we propose VoteSplat, a novel 3D scene understanding framework that integrates Hough voting with 3DGS. Specifically, Segment Anything Model (SAM) is utilized for instance segmentation, extracting objects, and generating 2D vote maps. We then embed spatial offset vectors into Gaussian primitives. These offsets construct 3D spatial votes by associating them with 2D image votes, while depth distortion constraints refine localization along the depth axis. For open-vocabulary object localization, VoteSplat maps 2D image semantics to 3D point clouds via voting points, reducing training costs associated with high-dimensional CLIP features while preserving semantic unambiguity. Extensive experiments demonstrate VoteSplat’s effectiveness in open-vocabulary 3D instance localization, 3D point cloud understanding, click-based 3D object localization, hierarchical segmentation, and ablation studies.
Poster
Siyuan Yao · Rui Zhu · Ziqi Wang · Wenqi Ren · Yanyang Yan · Xiaochun Cao
[ Exhibit Hall I ]
Abstract
Visual object tracking has gained promising progress in past decades. Most of the existing approaches focus on learning target representation in well-conditioned daytime data, while for the unconstrained real-world scenarios with adverse weather conditions, e.g. nighttime or foggy environment, the tremendous domain shift leads to significant performance degradation. In this paper, we propose UMDATrack, which is capable of maintaining high-quality target state prediction under various adverse weather conditions within a unified domain adaptation framework. Specifically, we first use a controllable scenario generator to synthesize a small amount of unlabeled videos (less than 2% frames in source daytime datasets) in multiple weather conditions under the guidance of different text prompts. Afterwards, we design a simple yet effective domain-customized adapter (DCA), allowing the target objects' representation to rapidly adapt to various weather conditions without redundant model updating. Furthermore, to enhance the localization consistency between source and target domains, we propose a target-aware confidence alignment module (TCA) following optimal transport theorem. Extensive experiments demonstrate that UMDATrack can surpass existing advanced visual trackers and lead new state-of-the-art performance by a significant margin.
Poster
Pengfei Ren · Jingyu Wang · Haifeng Sun · Qi Qi · Xingyu Liu · Menghao Zhang · Lei Zhang · Jing Wang · Jianxin Liao
[ Exhibit Hall I ]
Abstract
3D hand pose estimation plays a critical role in various human-computer interaction tasks. Single-frame 3D hand pose estimation methods have poor temporal smoothness and are easily affected by self-occlusion, which severely impacts their practical applicability. Traditional joint-based sequential pose estimation methods primarily focus on the human body and struggle to handle the complex hand structure, high degrees of freedom in hand motion, and rapidly changing hand motion trends. To address these challenges, we propose a prior-aware dynamic temporal modeling framework for sequential 3D hand pose estimation. We introduce a flexible memory mechanism to model hand prior information, which alleviates the scale and depth ambiguity in single-frame hand pose estimation. Additionally, we propose a dynamic temporal convolution module that adjusts the receptive field size and feature aggregation weights based on the motion information at each moment, effectively capturing rapid motion trends. By decoupling dynamic temporal modeling at the joint and hand levels, our method captures both subtle short-term variations and long-term motion trends, significantly improving the smoothness and accuracy of hand pose estimation. Experiments on four public datasets demonstrate that our method achieves the state-of-the-art results in terms of hand pose estimation accuracy and temporal smoothness.
Poster
Chen Gao · Shuo Zhang · Youfang Lin
[ Exhibit Hall I ]
Abstract
Disparity estimation is an essential step in processing and analyzing Light Field (LF) images. Recent methods construct the cost volume to exploit the correspondence of the LFs over the preset maximum disparity, limiting them to process the large parallax scenes. Different from constructing cost volume, the self-attention mechanism calculates the parallax attention between epipolar lines to find the matching points. However, for LFs that have different views, the related disparity scales are different in parallax attention since the baselines with the central view are different. Moreover, if the matching information is occluded in one view, the disparity information can be explored through other views. Therefore, mapping these attentions to the same scale and selecting effective matching information are key points for disparity estimation from parallax attention. In this paper, we explore parallax attention for LF and design an unsupervised method, named Epipolar Consistent Attention Aggregation Network (ECAAN). We first introduce an epipolar consistent scale unification block by considering the consistency relationships to standardize disparity scales of the parallax attention maps. Based on the intra-properties and inter-relationships of parallax attention, we further propose a consistent occlusion-free aggregation block to integrate the information from the occlusion-free areas. Besides, we design an improved …
Poster
Tianma Shen · Aditya Shrish Puranik · James Vong · Vrushabh Deogirikar · Ryan Fell · Julianna Dietrich · Maria Kyrarini · Christopher Kitts · David Jeong
[ Exhibit Hall I ]
Abstract
Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's first-person perspective. Although research has used pose estimation techniques to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We introduce $\textbf{Fish2Mesh}$, a fisheye-aware transformer-based model designed for 3D egocentric human mesh recovery. We propose an egocentric position embedding block to generate an ego-specific position table for the Swin Transformer to reduce fisheye image distortion. Our model utilizes multi-task heads for SMPL parametric regression and camera translations, estimating 3D and 2D joints as auxiliary loss to support model training. To address the scarcity of egocentric camera data, we create a training dataset by employing the pre-trained 4D-Human model and third-person cameras for weak supervision. Our experiments demonstrate that Fish2Mesh outperforms previous state-of-the-art 3D HMR models. Egocentric human body estimation allows for the inference of user body pose and shape from a wearable camera's first-person perspective. Although research has used pose estimation techniques to overcome self-occlusions and image distortions caused by head-mounted fisheye images, similar advances in 3D human mesh recovery (HMR) techniques have been limited. We introduce \textbf{Fish2Mesh}, a …
Poster
Jinhyung Park · Javier Romero · Shunsuke Saito · Fabian Prada · Takaaki Shiratori · Yichen Xu · Federica Bogo · Shoou-I Yu · Kris Kitani · Rawal Khirodkar
[ Exhibit Hall I ]
Abstract
Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models. The code and model will be made publicly available.
Poster
Bowen Chen · Yun Sing Koh · Gillian Dobbie
[ Exhibit Hall I ]
Abstract
Traditional image segmentation methods struggle with fine-grained pattern extraction, especially in an unsupervised setting without labeled data. Shallow and deep learning approaches either lack structural coherence or focus on object-level segmentation rather than internal textures. Additionally, existing methods often fail to generalize across diverse animal species due to variations in pattern complexity and lighting variations.We introduce GloPER, an unsupervised segmentation framework that extracts fine-grained animal patterns without labeled supervision. By enforcing local image reconstruction with only two colors per region, GloPER captures structured patterns while mitigating the effects of shadows and lighting inconsistencies.Given the lack of fine-detailed labeled data, we construct a dataset of 10 animal species, each with at least 100 well labeled images, enabling direct segmentation assessment. Experimental results show that GloPER outperforms both shallow and deep segmentation baselines, with a 42.44\% higher DICE score on average across all 10 animal species. We also assess its effectiveness through animal re-identification (ReID), where GloPER’s extracted binary patterns achieve superior accuracy, in some cases exceeding full-image ReID performance, underscoring the discriminative power of structured segmentation.
Poster
Yuqian Fu · Runze Wang · Bin Ren · Guolei Sun · Biao Gong · Yanwei Fu · Danda Pani Paudel · Xuanjing Huang · Luc Gool
[ Exhibit Hall I ]
Abstract
Bridging the gap between ego-centric and exo-centric views has been a long-standing question in computer vision. In this paper, we focus on the emerging Ego-Exo object correspondence task, which aims to understand object relations across ego-exo perspectives through segmentation. While numerous segmentation models have been proposed, most operate on a single image (view), making them impractical for cross-view scenarios. PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task. However, due to the drastic viewpoint change between ego and exo, PSALM fails to accurately locate and segment objects, especially in complex backgrounds or when object appearances change significantly. To address these issues, we propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion (MCFuse) and SSL-based Cross-View Object Alignment (XObjAlign). MCFuse introduces language as an additional cue, integrating both visual masks and textual descriptions to improve object localization and prevent incorrect associations. XObjAlign enforces cross-view consistency through self-supervised alignment, enhancing robustness to object appearance variations. Extensive experiments demonstrate ObjectRelator’s effectiveness on the large-scale Ego-Exo4D benchmark and HANDAL-X (an adapted dataset for cross-view segmentation) with state-of-the-art performance. Codes and models will be released.
Poster
Mohammadreza Salehi · Shashanka Venkataramanan · Ioana Simion · Stratis Gavves · Cees Snoek · Yuki Asano
[ Exhibit Hall I ]
Abstract
Dense self-supervised learning has shown great promise for learning pixel- and patch-level representations, but extending it to videos remains challenging due to the complexity of motion dynamics. Existing approaches struggle as they rely on static augmentations that fail under object deformations, occlusions, and camera movement, leading to inconsistent feature learning over time. We propose a motion-guided self-supervised learning framework that clusters dense point tracks to learn spatiotemporally consistent representations. By leveraging an off-the-shelf point tracker, we extract long-range motion trajectories and optimize feature clustering through a momentum-encoder-based optimal transport mechanism. To ensure temporal coherence, we propagate cluster assignments along tracked points, enforcing feature consistency across views despite viewpoint changes. Integrating motion as an implicit supervisory signal, our method learns representations that generalize across frames, improving robustness in dynamic scenes and challenging occlusion scenarios. By initializing from strong image-pretrained models and leveraging video data for training, we improve state-of-the-art by 1\% to 6\% on six image and video datasets and four evaluation benchmarks.
Poster
Spyros Kondylatos · Nikolaos Ioannis Bountos · Dimitrios Michail · Xiao Xiang Zhu · Gustau Camps-Valls · Ioannis Papoutsis
[ Exhibit Hall I ]
Abstract
Recent advances in Computer Vision have introduced the concept of pretrained representation uncertainty, enabling zero-shot uncertainty estimation. This holds significant potential for Earth Observation (EO), where trustworthiness is critical, yet the complexity of EO data poses challenges to uncertainty-aware methods. In this work, we investigate the generalization of representation uncertainty in EO, considering the domain's unique semantic characteristics. We pretrain uncertainties on large EO datasets and propose an evaluation framework to assess their zero-shot performance in multi-label classification and segmentation EO tasks. Our findings reveal that, unlike uncertainties pretrained on natural images, EO-pretraining exhibits strong generalization across unseen EO domains, geographic locations, and target granularities, while maintaining sensitivity to variations in ground sampling distance. We demonstrate the practical utility of pretrained uncertainties showcasing their alignment with task-specific uncertainties in downstream tasks, their sensitivity to real-world EO image noise, and their ability to generate spatial uncertainty estimates out-of-the-box. In this study, we explore representation uncertainty in EO, highlighting its strengths and limitations, laying the groundwork for future research in the field. Code and model checkpoints will be publicly released.
Poster
Yusuke Yoshiyasu · Leyuan Sun · Ryusuke Sagawa
[ Exhibit Hall I ]
Abstract
In this paper, we introduce MeshMamba, a neural network model for learning 3D articulated mesh models by employing the recently proposed Mamba State Space Models (SSMs). MeshMamba is efficient and scalable in handling a large number of input tokens, enabling the generation and reconstruction of body mesh models with approximately 10,000 vertices. The key to effectively learning MeshMamba is the serialization technique of mesh vertices into the orderings that are easily processed by Mamba. This is achieved by sorting the vertices based on the body part annotations or the 3D vertex locations of a template mesh, such that the ordering respects the structure of articulated shapes. Based on MeshMamba we design 1) MambaDiff3D, a denoising diffusion model for generating 3D articulated meshes, and 2) Mamba-HMR, a 3D human mesh recovery model which reconstructs a human body shape pose from a single image. Experimental results showed that MambaDiff3D can generate dense 3D human meshes in clothes, with grasping hands etc. and outperforms previous approaches in the 3D human shape generation task. Also, Mamba-HMR extends the ability of previous non-parametric human mesh recovery approaches, which were limited in handling body-only poses using around 500 vertex tokens, to the whole-body setting with face …
Poster
Mingxuan Wu · Huang Huang · Justin Kerr · Chung Min Kim · Anthony Zhang · Brent Yi · Angjoo Kanazawa
[ Exhibit Hall I ]
Abstract
Whether snipping with scissors or opening a box, humans can quickly understand the 3D configurations of familiar objects. For novel objects, we can resort to long-form inspection to build intuition. The more we observe the object, the better we get at predicting its 3D state immediately. Existing systems, however, are limited to either optimizing underlying representations from multi-view observations or training a feed-forward predictor from supervised datasets. We introduce Predict-Optimize-Distill (POD), a self-improving framework that interleaves prediction and optimization in a mutually reinforcing cycle to achieve better 4D object understanding with increasing observation time. Given a multi-view object scan and a long-form monocular video of human-object interaction, POD iteratively trains a neural network to predict local part poses from RGB frames, uses this predictor to initialize a global optimization which refines output poses through inverse rendering, then finally distills the results of optimization back into the model by generating synthetic self-labeled training data from novel viewpoints. Each iteration improves both the predictive model and the optimized motion trajectory, creating a virtuous cycle that bootstraps its own training data to learn about the pose configurations of an object. We also introduce a quasi-multiview mining strategy for reducing depth ambiguity by leveraging …
Poster
Yue Li · Qi Ma · Runyi Yang · Huapeng Li · Mengjiao Ma · Bin Ren · Nikola Popovic · Nicu Sebe · Ender Konukoglu · Theo Gevers · Luc Gool · Martin R. Oswald · Danda Pani Paudel
[ Exhibit Hall I ]
Abstract
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge.To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes.Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines. …
Poster
Chen Zhao · Xuan Wang · Tong Zhang · Saqib Javed · Mathieu Salzmann
[ Exhibit Hall I ]
Abstract
3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness in novel view synthesis (NVS). However, 3DGS tends to overfit when trained with sparse views, limiting its generalization to novel viewpoints. In this paper, we address this overfitting issue by introducing Self-Ensembling Gaussian Splatting (SE-GS). We achieve self-ensembling by incorporating an uncertainty-aware perturbation strategy during training. A $\mathbf{\Delta}$-model and a $\mathbf{\Sigma}$-model are jointly trained on the available images. The $\mathbf{\Delta}$-model is dynamically perturbed based on rendering uncertainty across training steps, generating diverse perturbed models with negligible computational overhead. Discrepancies between the $\mathbf{\Sigma}$-model and these perturbed models are minimized throughout training, forming a robust ensemble of 3DGS models. This ensemble, represented by the $\mathbf{\Sigma}$-model, is then used to generate novel-view images during inference. Experimental results on the LLFF, Mip-NeRF360, DTU, and MVImgNet datasets demonstrate that our approach enhances NVS quality under few-shot training conditions, outperforming existing state-of-the-art methods.
Poster
Shaoyuan Xie · Lingdong Kong · Yuhao Dong · Chonghao Sima · Wenwei Zhang · Qi Alfred Chen · Ziwei Liu · Liang Pan
[ Exhibit Hall I ]
Abstract
Recent advancements in Vision-Language Models (VLMs) have fueled interest in autonomous driving applications, particularly for interpretable decision-making. However, the assumption that VLMs provide visually grounded and reliable driving explanations remains unexamined. To address this, we introduce DriveBench, a benchmark evaluating 12 VLMs across 17 settings, covering 19,200 images, 20,498 QA pairs, and four key driving tasks. Our findings reveal that VLMs often generate plausible responses from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs possess inherent corruption-awareness but only explicitly acknowledge these issues when directly prompted. Given the challenges and inspired by the inherent corruption awareness, we propose Robust Agentic Utilization (RAU), leveraging VLMs’ corruption awareness and agentic planning with external tools to enhance perception reliability for downstream tasks. Our study challenges existing evaluation paradigms and provides a roadmap toward more robust and interpretable autonomous driving systems.
Poster
Changwoon Choi · Jeongjun Kim · Geonho Cha · Minkwan Kim · Dongyoon Wee · Young Min Kim
[ Exhibit Hall I ]
Abstract
Recent works on dynamic 3D neural field reconstruction assume the input from synchronized multi-view videos whose poses are known.The input constraints are often not satisfied in real-world setups, making the approach impractical. We show that unsynchronized videos from unknown poses can generate dynamic neural fields as long as the videos capture human motion. Humans are one of the most common dynamic subjects captured in videos, and their shapes and poses can be estimated using state-of-the-art libraries. While noisy, the estimated human shape and pose parameters provide a decent initialization point to start the highly non-convex and under-constrained problem of training a consistent dynamic neural representation. Given the shape and pose parameters of humans in individual frames, we formulate methods to calculate the time offsets between videos, followed by camera pose estimations that analyze the 3D joint positions. Then, we train the dynamic neural fields employing multiresolution grids while we concurrently refine both time offsets and camera poses. The setup still involves optimizing many parameters; therefore, we introduce a robust progressive learning strategy to stabilize the process. Experiments show that our approach achieves accurate spatio-temporal calibration and high-quality scene reconstruction in challenging conditions.
Poster
Hao Zhang · Haolan Xu · Chun Feng · Varun Jampani · Narendra Ahuja
[ Exhibit Hall I ]
Abstract
Skinning and rigging are fundamental components in animation, articulated object reconstruction, motion transfer, and 4D generation. Existing approaches predominantly rely on Linear Blend Skinning (LBS), due to its simplicity and differentiability. However, LBS introduces artifacts such as volume loss and unnatural deformations, and it fails to model elastic materials like soft tissues, fur, and flexible appendages (e.g., elephant trunks, ears, and fatty tissues). In this work, we propose \textbf{PhysRig}: a differentiable physics-based skinning and rigging framework that overcomes these limitations by embedding the rigid skeleton into a volumetric representation (e.g., a tetrahedral mesh), which is simulated as a deformable soft-body structure driven by the animated skeleton. Our method leverages continuum mechanics and discretizes the object as particles embedded in an Eulerian background grid to ensure differentiability with respect to both material properties and skeletal motion. Additionally, we introduce material prototypes, significantly reducing the learning space while maintaining high expressiveness. To evaluate our framework, we construct a comprehensive synthetic dataset using meshes from Objaverse, The Amazing Animals Zoo, and MixaMo, covering diverse object categories and motion patterns. Our method consistently outperforms traditional LBS-based approaches, generating more realistic and physically plausible results. Furthermore, we demonstrate the applicability of our framework in the …
Poster
Yingping Liang · Yutao Hu · Wenqi Shao · Ying Fu
[ Exhibit Hall I ]
Abstract
Feature matching plays a fundamental role in many computer vision tasks, yet existing methods heavily rely on scarce and clean multi-view image collections, which constrains their generalization to diverse and challenging scenarios. Moreover, conventional feature encoders are typically trained on single-view 2D images, limiting their capacity to capture 3D-aware correspondences. In this paper, we propose a novel two-stage framework that lifts 2D images to 3D space, named as Lift to Match (L2M), taking full advantage of large-scale and diverse single-view images. To be specific, in the first stage, we learn a 3D-aware feature encoder using a combination of multi-view image synthesis and 3D feature Gaussian representation, which injects 3D geometry knowledge into the encoder. In the second stage, a novel-view rendering strategy, combined with large-scale synthetic data generation from single-view images, is employed to learn a feature decoder for robust feature matching, thus achieving generalization across diverse domains. Extensive experiments demonstrate that our method achieves superior generalization across zero-shot evaluation benchmarks, highlighting the effectiveness of the proposed framework for robust feature matching.
Poster
Tian-Xing Xu · Xiangjun Gao · Wenbo Hu · Xiaoyu Li · Song-Hai Zhang · Ying Shan
[ Exhibit Hall I ]
Abstract
Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability. Code and models will be publicly released.
Poster
Ahmed Abdelreheem · Filippo Aleotti · Jamie Watson · Zawar Qureshi · Abdelrahman Eldesokey · Peter Wonka · Gabriel Brostow · Sara Vicente · Guillermo Garcia-Hernando
[ Exhibit Hall I ]
Abstract
We introduce the novel task of Language-Guided Object Placement in 3D scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt.Compared with other language-guided localization tasks in 3D scenes such as grounding,this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space.We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models We will release the dataset and the benchmark and baseline code on acceptance.
Poster
Hongyu Shen · Junfeng Ni · Weishuo Li · Mingtao Pei · Yixin Chen · Siyuan Huang
[ Exhibit Hall I ]
Abstract
We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., self-prompting) and offline (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.
Poster
Lei Sun · Yuhan Bao · Jiajun Zhai · Jingyun Liang · YULUN ZHANG · Kaiwei Wang · Danda Pani Paudel · Luc Gool
[ Exhibit Hall I ]
Abstract
Low-light image enhancement (LLIE) aims to improve the visibility of images captured in poorly lit environments. Prevalent event-based solutions primarily utilize events triggered by motion, i.e., "motion events" to strengthen only the edge texture, while leaving the high dynamic range and excellent low-light responsiveness of event cameras largely unexplored. This paper instead opens a new avenue from the perspective of estimating the illumination using "temporal-mapping" events, i.e., by converting the timestamps of events triggered by a transmittance modulation into brightness values. The resulting fine-grained illumination cues facilitate a more effective decomposition and enhancement of the reflectance component in low-light images through the proposed Illumination-aided Reflectance Enhancement module. Furthermore, the degradation model of temporal-mapping events under low-light condition is investigated for realistic training data synthesizing. To address the lack of datasets under this regime, we construct a beam-splitter setup and collect EvLowLight dataset that includes images, temporal-mapping events, and motion events. Extensive experiments across 5 synthetic datasets and our real-world EvLowLight dataset substantiate that the devised pipeline, dubbed RetinEV, excels in producing well-illuminated, high dynamic range images, outperforming previous state-of-the-art event-based methods by up to 6.62 dB, while maintaining an efficient inference speed of 35.6 frame-per-second on a $640\times480$ image.
Poster
Wenyao Zhang · Hongsi Liu · Bohan Li · Jiawei He · Zekun Qi · Yunnan Wang · Eastern Institute of Technology Shengyang · Ningbo Institute Of Digital Twin XinQiang · Galbot Wenjun · Eastern Institute for Advanced Study Xin
[ Exhibit Hall I ]
Abstract
Current self-supervised monocular depth estimation (MDE) approaches encounter performance limitations due to insufficient semantic-spatial knowledge extraction. To address this challenge, we propose Hybrid-depth, a novel framework that systematically integrates foundation models (e.g., CLIP and DINO) to extract visual priors and acquire sufficient contextual information for MDE. Our approach introduces a coarse-to-fine progressive learning framework: 1) Firstly, we aggregate multi-grained features from CLIP (global semantics) and DINO (local spatial details) under contrastive language guidance. A proxy task comparing close-distant image patches is designed to enforce depth-aware feature alignment using text prompts; 2) Next, building on the coarse features, we integrate camera pose information and pixel-wise language alignment to refine depth predictions. This module seamlessly integrates with existing self-supervised MDE pipelines (e.g., Monodepth2, ManyDepth) as a plug-and-play depth encoder, enhancing continuous depth estimation. By aggregating CLIP's semantic context and DINO's spatial details through language guidance, our method effectively addresses feature granularity mismatches. Extensive experiments on the KITTI benchmark demonstrate that our method significantly outperforms SOTA methods across all metrics, which also indeed benefits downstream tasks like BEV perception.
Poster
Jeong Hun Yeo · Minsu Kim · Chae Won Kim · Stavros Petridis · Yong Man Ro
[ Exhibit Hall I ]
Abstract
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.
Poster
Jiaxin Lu · Gang Hua · Qixing Huang
[ Exhibit Hall I ]
Abstract
The automatic assembly problem has attracted increasing interest due to its complex challenges that involve 3D representation. This paper introduces Jigsaw++, a novel generative method designed to tackle the multifaceted challenges of reconstructing complete shape for the reassembly problem. Existing approach focusing primarily on piecewise information for both part and fracture assembly, often overlooking the integration of complete object prior. Jigsaw++ distinguishes itself by learning a category-agnostic shape prior of complete objects. It employs the proposed ``retargeting'' strategy that effectively leverages the output of any existing assembly method to generate complete shape reconstructions. This capability allows it to function orthogonally to the current methods. Through extensive evaluations on Breaking Bad dataset and PartNet, Jigsaw++ has demonstrated its effectiveness, reducing reconstruction errors and enhancing the precision of shape reconstruction, which sets a new direction for future reassembly model developments.
Poster
Hongyu Wen · Yiming Zuo · Venkat Subramanian · Patrick Chen · Jia Deng
[ Exhibit Hall I ]
Abstract
Transparent objects are common in daily life, and understanding their multi-layer depth information—perceiving both the transparent surface and the objects behind it—is crucial for real-world applications that interact with transparent materials.In this paper, we introduce LayeredDepth, the first dataset with multi-layer depth annotations, including a real-world benchmark and a synthetic data generator, to support the task of multi-layer depth estimation. Our real-world benchmark consists of 1,500 images from diverse scenes, and evaluating state-of-the-art depth estimation methods on it reveals that they struggle with transparent objects. The synthetic data generator is fully procedural and capable of providing training data for this task with an unlimited variety of objects and scene compositions. Using this generator, we create a synthetic dataset with 15,300 images. Baseline models training solely on this synthetic dataset produce good cross-domain multi-layer depth estimation. Fine-tuning state-of-the-art single-layer depth models on it substantially improves their performance on transparent objects, with quadruplet accuracy on our benchmark increased from 55.59% to 75.16%.
Poster
Yuxi Xiao · Jianyuan Wang · Nan Xue · Nikita Karaev · Iurii Makarov · Bingyi Kang · Xing Zhu · Hujun Bao · Yujun Shen · Xiaowei Zhou
[ Exhibit Hall I ]
Abstract
3D point tracking from monocular videos has recently shown promising results, attracting increasing attention from the community. However, existing methods typically struggle with two key challenges: (a) significant background motion caused by camera movement, and (b) frequent occlusions that necessitate re-identifying previously observed objects. Monocular egocentric videos are prime examples where these challenges prominently arise. In this work, we introduce SpatialTrackerV2, a novel 3D point tracking approach capable of computing accurate 3D trajectories for arbitrary 2D pixels, excelling not only in common video scenarios but also in challenging contexts with substantial camera motion and frequent occlusions. Our method separates camera motion from object motion, explicitly modeling the camera movement and its interplay with depth maps to significantly enhance 3D point tracking. Additionally, we propose a joint refinement module that simultaneously improves depth estimation, camera motion, and 3D tracking accuracy in a unified manner. Benefiting from large-scale training on a mixture of synthetic and real-world data, SpatialTrackerV2 demonstrates strong robustness and generalization capabilities. Extensive experiments across different benchmarks validate its effectiveness and substantial performance improvements over existing approaches.
Poster
Peiming Li · Ziyi Wang · Yulin Yuan · Hong Liu · Xiangming Meng · Junsong Yuan · Mengyuan Liu
[ Exhibit Hall I ]
Abstract
Point cloud videos capture dynamic 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions. Although Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity, the spatio-temporal disorder of point cloud videos hinders their unidirectional modeling when directly unfolding the point cloud video into a 1D sequence through temporally sequential scanning. To address this challenge, we propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos. Specifically, we introduce Spatial-Temporal Selection Scanning (STSS), which reorganizes unordered points into semantic-aware sequences through prompt-guided clustering, thereby enabling the effective utilization of points that are spatially and temporally distant yet similar within the sequence. For missing 4D geometric and motion details, Spatio-Temporal Structure Aggregation (STSA) aggregates spatio-temporal features and compensates. To improve temporal interaction within the sampled sequence, Temporal Interaction Sampling (TIS) enhances fine-grained temporal dependencies through non-anchor frame utilization and expanded receptive fields. Experimental results on the MSR-Action3D, NTU RGB+D, and Synthia 4D datasets validate the effectiveness of our method. Our code is available at https://anonymous.4open.science/r/UST-SSM.
Poster
Qi Bi · Jingjun Yi · Huimin Huang · Hao Zheng · Haolan Zhan · Wei Ji · Yawen Huang · Yuexiang Li · Yefeng Zheng
[ Exhibit Hall I ]
Abstract
Diffusion models have demonstrated powerful capability as a versatilist for dense vision tasks, yet the generalization ability to unseen domains remains rarely explored.In light of this issue, we focus on investigating generalizable paradigms for diffusion based dense prediction and propose an efficient frequency learning scheme, dubbed as \texttt{HarDiff}, alleviating the domain gap across various scenes.Interestingly, the low-frequency features, converted by the Discrete Hartley Transform, activate the broader content of an image, while the high-frequency features maintain sufficient details for dense pixels.Hence, our \texttt{HarDiff} is driven by two compelling designs:(1) Low-Frequency Training Process, which extracts structural priors from the source domain, for enhancing understanding of task-related content;(2) High-Frequency Sampling Process, which utilizes detail-oriented guidance from the unseen target domain, to infer precise dense predictions with target-related details.Extensive empirical evidence shows that \texttt{HarDiff} can be easily plugged into various dense vision tasks, \eg. semantic segmentation, depth estimation and haze removal, yielding improvements over the state-of-the-art methods in twelve public benchmarks. We will release our code.
Poster
Tuo Xiang · Xuemiao Xu · Bangzhen Liu · Jinyi Li · Yong Li · Shengfeng He
[ Exhibit Hall I ]
Abstract
The rapid growth of 3D digital content necessitates expandable recognition systems for open-world scenarios. However, existing 3D class-incremental learning methods struggle under extreme data scarcity due to geometric misalignment and texture bias. While recent approaches integrate 3D data with 2D foundation models (e.g., CLIP), they suffer from semantic blurring caused by texture-biased projections and indiscriminate fusion of geometric-textural cues, leading to unstable decision prototypes and catastrophic forgetting.To address these issues, we propose Cross-Modal Geometric Rectification (CMGR), a framework that enhances 3D geometric fidelity by leveraging CLIP’s hierarchical spatial semantics. Specifically, we introduce a Structure-Aware Geometric Rectification module to hierarchically align 3D part structures with CLIP’s intermediate spatial priors via attention-driven geometric fusion. Additionally, a Texture Amplification Module synthesizes minimal yet discriminative textures to suppress noise and reinforce cross-modal consistency. To further stabilize incremental prototypes, we employ a Base-Novel Discriminator that isolates geometric variations.Extensive experiments demonstrate that our method significantly improves 3D few-shot class-incremental learning, achieving superior geometric coherence and robustness to texture bias across cross-domain and within-domain settings.
Poster
Ruining Li · Chuanxia Zheng · Christian Rupprecht · Andrea Vedaldi
[ Exhibit Hall I ]
Abstract
Most 3D object generators focus on aesthetic quality, often neglecting physical constraints necessary in applications.One such constraint is that the 3D object should be self-supporting, i.e., remains balanced under gravity.Prior approaches to generating stable 3D objects used differentiable physics simulators to optimize geometry at test-time, which is slow, unstable, and prone to local optima. Inspired by the literature on aligning generative models to external feedback, we propose Direct Simulation Optimization (DSO), a framework to use the feedback from a (non-differentiable) simulator to increase the likelihood that the 3D generator outputs stable 3D objects directly. We construct a dataset of 3D objects labeled with a stability score obtained from the physics simulator. We can then fine-tune the 3D generator using the stability score as the alignment metric, via direct preference optimization (DPO) or direct reward optimization (DRO), a novel objective, which we introduce, to align diffusion models without requiring pairwise preferences. Our experiments show that the fine-tuned feed-forward generator, using either DPO or DRO objective, is much faster and more likely to produce stable objects than test-time optimization. Notably, the DSO framework works even without any ground-truth 3D objects for training, allowing the 3D generator to self-improve by automatically collecting simulation …
Poster
Aleksandar Jevtić · Christoph Reich · Felix Wimbauer · Oliver Hahn · Christian Rupprecht · Stefan Roth · Daniel Cremers
[ Exhibit Hall I ]
Abstract
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
Poster
Xinhao Cai · Qiuxia Lai · Gensheng Pei · Xiangbo Shu · Yazhou Yao · Wenguan Wang
[ Exhibit Hall I ]
Abstract
In this paper, we propose a generation-detection cycle consistent (GDCC) learning framework that jointly optimizes both layout-to-image (L2I) generation and object detection (OD) tasks in an end-to-end manner. The key of GDCC lies in the inherent duality between the two tasks, where L2I takes all object boxes and labels as input conditions to generate images, and OD maps images back to these layout conditions. Specifically, in GDCC, L2I generation is guided by a layout translation cycle loss, ensuring that the layouts used to generate images align with those predicted from the synthesized images. Similarly, OD benefits from an image translation cycle loss, which enforces consistency between the synthesized images fed into the detector and those generated from predicted layouts. While current L2I and OD tasks benefit from large-scale annotated layout-image pairs, our GDCC enables more efficient use of annotation-free synthetic data, thereby further enhancing data efficiency. It is worth noting that our GDCC framework is computationally efficient thanks to the perturbative single-step sampling strategy and a priority timestep re-sampling strategy during training. Besides, GDCC preserves the architectures of L2I, OD models, and the generation pipeline within the framework, thus maintaining the original inference speed. Extensive experiments demonstrate that GDCC significantly …
Poster
Wenxuan Guo · Xiuwei Xu · Hang Yin · Ziwei Wang · Jianjiang Feng · Jie Zhou · Jiwen Lu
[ Exhibit Hall I ]
Abstract
Visual navigation with an image as goal is a fundamental and challenging problem. Conventional methods either rely on end-to-end RL learning or modular-based policy with topological graph or BEV map as memory, which cannot fully model the geometric relationship between the explored 3D environment and the goal image. In order to efficiently and accurately localize the goal image in 3D space, we build our navigation system upon the renderable 3D gaussian (3DGS) representation. However, due to the computational intensity of 3DGS optimization and the large search space of 6-DoF camera pose, directly leveraging 3DGS for image localization during agent exploration process is prohibitively inefficient. To this end, we propose IGL-Nav, an Incremental 3D Gaussian Localization framework for efficient and 3D-aware image-goal navigation. Specifically, we incrementally update the scene representation as new images arrive with feed-forward monocular prediction. Then we coarsely localize the goal by leveraging the geometric information for discrete space matching, which can be equivalent to efficient 3D convolution. When the agent is close to the goal, we finally solve the fine target pose with optimization via differentiable rendering. The proposed IGL-Nav outperforms existing state-of-the-art methods by a large margin across diverse experimental configurations. It can also handle the …
Poster
Dejie Yang · Zijing Zhao · Yang Liu
[ Exhibit Hall I ]
Abstract
Visual Robot Manipulation (VRM) aims to enable a robot to follow natural language instructions based on robot states and visual observations, and therefore requires costly multi-modal data. To compensate for the deficiency of robot data, existing approaches have employed vision-language pretraining with large-scale data. However, they neither utilize web data that differs from robotic tasks, nor train the model in an implicit way (e.g., predicting future frames at the pixel level), thus showing limited generalization ability under insufficient robot data. In this paper, we propose to learn from large-scale human action video datasets in an explicit way (i.e., imitating human actions from hand keypoints), introducing Visual Robot Manipulation with Analogical Reasoning (AR-VRM). To acquire action knowledge explicitly from human action videos, we propose a keypoint Vision-Language Model (VLM) pretraining scheme, enabling the VLM to learn human action knowledge and directly predict human hand keypoints. During fine-tuning on robot data, to facilitate the robotic arm in imitating the action patterns of human motions, we first retrieve human action videos that perform similar manipulation tasks and have similar historical observations , and then learn the Analogical Reasoning (AR) map between human hand keypoints and robot components. Taking advantage of focusing on action …
Poster
Donghyeon Kwon · Youngseok Yoon · Hyeongseok Son · Suha Kwak
[ Exhibit Hall I ]
Abstract
Camera-based 3D object detection has gained attention for its cost-effectiveness, but it in general lags behind LiDAR-based approaches due to its lack of explicit 3D spatial cues. To take the best of both camera- and LiDAR-based detectors, we propose MemDistill, a novel cross-modal knowledge distillation framework for 3D object detection.MemDistill transfers rich 3D knowledge from a LiDAR-based teacher model to a camera-based student model through a dedicated memory unit and a scene-dependent memory retrieval module.To be specific, our framework distills the teacher's 3D knowledge, optimizes the memory to store that knowledge compactly, and learns the retriever that searches the memory to produce 3D features relevant to the input scene, compensating for the missing LiDAR modality.Experiments on the nuScenes dataset demonstrate that MemDistill significantly improves performance of its camera-only baseline, achieving the state of the art in camera-based 3D object detection.
Poster
Daixun Li · Yusi Zhang · Mingxiang Cao · donglai Liu · Weiying Xie · Tianlin Hui · Lunkai Lin · Zhiqiang Xie · Yunsong Li
[ Exhibit Hall I ]
Abstract
Vision-Language-Action (VLA) is crucial for autonomous decision-making in embodied systems. While current methods have advanced single-skill abilities, their short-horizon capability limits applicability in real-world scenarios. To address this challenge, we innovatively propose $\textbf{MindExplore}$, a general hierarchical VLA system with cross-skill for long-horizon tasks in highly dynamic sand. The key insight is to iteratively align the knowledge domain of task planning and action execution. Thus, this task-oriented action enables outstanding generalization across a wide range of real-world scenarios. In the reasoning layer, task-specific chains of thought (CoT) are designed for planning long-horizon task sequences and providing meta-action signals. In the acting layer, a simple but powerful Mixture of Policy Experts strategy is built inspired by signals and multimodal inputs for adaptively selecting skill experts and generating closed-loop action sequences. Also, it integrates a lightweght Multimodal Diffusion Policy (MMDP) to enhance spatial perception by fusing multi-visual modality features. Besides, the pioneering memory mechanism establishes feedback between the reasoning and acting layers, facilitating adaptive execution of long-horizon tasks and real-time replanning. Notably, we create $\textbf{SandGo-1k}$ and $\textbf{SandThink-21k}$, the first expert-level multimodal CoT dataset and embodied dataset tailored for sandy environments. At a high execution frequency of 30 FPS, MindExplore is 3.01 $\times$ more …
Poster
Minh Tran · Hongda Mao · Qingshuang Chen · Yelin Kim
[ Exhibit Hall I ]
Abstract
Generating body pose from head-mounted, egocentric inputs is essential for immersive VR/AR and assistive technologies, as it supports more natural interactions. However, the task is challenging due to limited visibility of body parts in first-person views and the sparseness of sensory data, with only a single device placed on the head. To address these challenges, we introduce Head2Body, a novel framework for body pose estimation that effectively combines IMU and visual data. First, we introduce a pre-trained IMU encoder, trained on over 1,700 hours of head-IMU data from wearable eyeglasses, to better capture detailed temporal motion cues given limited labeled egocentric pose data. For visual processing, we leverage large vision-language models (LVLMs) to segment body parts that appear sporadically in video frames to improve visual feature extraction. To better guide the pose generation process with sparse signals from only head-mounted devices, we incorporates a Vector Quantized Variational Autoencoder (VQ-VAE) to represent poses as discrete tokens, which capture high-frequency motion patterns and provide a more structured representation of body pose. Our experiments demonstrate the effectiveness of the proposed approach, yielding 8–13% gains over state-of-the-art baselines on four datasets: AMASS, KinPoly, GIMO, and EgoExo4D. By capturing subtle temporal dynamics and leveraging complementary …
Poster
Xingxiang Zhou · Xiangdong Su · Haoran Zhang · Wei Chen · Guanglai Gao
[ Exhibit Hall I ]
Abstract
Low-light image enhancement (LLIE) is a fundamental task in computer vision. Its goal is to extract more useful information from dark regions. Many existing methods have made excellent strides in improving image brightness and enhancing texture details. However, these approaches often lead to overexposure in certain regions when dealing with unevenly illuminated images, resulting in the loss of original information within the images. To address this issue, we propose a Bézier surface constraint (BSCNet) method based on task decoupling to enhance low-light images with uneven brightness. Specifically, we design a diffusion model with a branch structure that separates the enhancement process into brightness adjustment and color restoration, enabling independent control over brightness uniformity. Additionally, we use Bézier surfaces as a learning target to impose smoothness constraints on the image, thereby addressing the issue of uneven brightness in the enhanced image. To counteract potential detail loss introduced by Bézier surfaces, we introduce a spatial-frequency reconstruction module based on the Fourier transform to enhance fine-grained texture information. Experimental comparisons of six generalized LLIE datasets show that our proposed method has demonstrated outstanding effectiveness.
Poster
Jae Young Kang · Hoonhee Cho · Kuk-Jin Yoon
[ Exhibit Hall I ]
Abstract
3D object detection is essential for autonomous systems, enabling precise localization and dimension estimation. While LiDAR and RGB cameras are widely used, their fixed frame rates create perception gaps in high-speed scenarios. Event cameras, with their asynchronous nature and high temporal resolution, offer a solution by capturing motion continuously. The recent approach, which integrates event cameras with conventional sensors for continuous-time detection, struggles in fast-motion scenarios due to its dependency on synchronized sensors. We propose a novel 3D object detection framework that relies solely on event cameras, eliminating the need for conventional 3D sensors. To compensate for the lack of semantic and geometric information in event data, we introduce a dual filter mechanism that extracts both. Additionally, we enhance regression by aligning bounding boxes with object-centric information. Experiments show that our method outperforms prior approaches in dynamic environments, demonstrating the potential of event cameras for robust, continuous-time 3D perception. Our project code will be publicly available.
Poster
Fang Zhang · Wenzhao Zheng · Linqing Zhao · Zelan Zhu · Jiwen Lu · Xiuzhuang Zhou
[ Exhibit Hall I ]
Abstract
3D plane recovery from monocular images constitutes a fundamental task in indoor scene understanding. Recent methods formulate this problem as 2D pixel-level segmentation through convolutional networks or query-based architectures, which purely rely on 2D pixel features while neglecting the inherent 3D spatial nature of planar surfaces. To address this limitation, we propose an end-to-end Plane Reconstruction, Aggregation, and Splatting (PlaneRAS) framework that explicitly leverages 3D geometric reasoning combined with online planar primitive reconstruction. Our framework introduces two core components: 1) a reconstruction module utilizing customized planar primitives to compactly represent 3D scene, and 2) a recovery module that aggregates local primitives to derive globally consistent plane instances. The proposed 3D-aware representation enables direct integration of pretrained geometric priors, significantly enhancing performance beyond conventional 2D-centric approaches. Extensive experiments on ScanNet and NYUv2 datasets demonstrate state-of-the-art results across various evaluation metrics, resulting from our explicit 3D geometric modeling and effective fusion of cross-dimensional features.
Poster
Lorenzo Mur-Labadia · Maria Santos-Villafranca · Jesus Bermudez-cameo · Alejandro Perez-Yus · Ruben Martinez-Cantin · Jose Guerrero
[ Exhibit Hall I ]
Abstract
Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem.We introduce a new approach that re-defines cross-image segmentation by treating it as a mask matching task.Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) a Ego↔Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space and,(4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects.O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +31% and 94% in the Ego2Exo and Exo2Ego IoU against the official challenge baselines and a +13% and +6% compared with the SOTA with 1% of the training parameters.
Poster
Cui Miao · Tao Chang · meihan wu · Hongbin Xu · Chun Li · Ming Li · Xiaodong Wang
[ Exhibit Hall I ]
Abstract
Vision-Language-Action (VLA) models have significantly advanced robotic manipulation by enabling robots to interpret language instructions for task execution. However, training these models often relies on large-scale user-specific data, raising concerns about privacy and security, which in turn limits their broader adoption. To address this, we propose \name{}, the first federated VLA learning framework, enabling distributed model training that preserves data privacy without compromising performance. Our framework integrates task-aware representation learning, adaptive expert selection, and expert-driven federated aggregation, enabling efficient and privacy-preserving training of VLA models. Specifically, we introduce an Instruction-Oriented Scene-Parsing mechanism, which decomposes and enhances object-level features based on task instructions, improving contextual understanding. To effectively learn diverse task patterns, we design a Dual Gating Mixture-of-Experts (DGMoE) mechanism, where not only input tokens but also self-aware experts adaptively decide their activation. Finally, we propose an Expert-Driven Aggregation strategy at the federated server, where model aggregation is guided by activated experts, ensuring effective cross-client knowledge transfer. Extensive simulations and real-world robotic experiments demonstrate the effectiveness of our proposals. Notably, DGMoE significantly improves computational efficiency compared to its vanilla counterpart, while FedVLA achieves task success rates comparable to centralized training, effectively preserving data privacy.
Poster
Richard D Paul · Johannes Seiffarth · David Rügamer · Hanno Scharr · Katharina Nöh
[ Exhibit Hall I ]
Abstract
Cell tracking is a key computational task in live-cell microscopy, but fully automated analysis of high-throughput imaging requires reliable and, thus, uncertainty-aware data analysis tools, as the amount of data recorded within a single experiment exceeds what humans are able to overlook. We here propose and benchmark various methods to reason about and quantify uncertainty in linear assignment-based cell tracking algorithms. Our methods take inspiration from statistics and machine learning, leveraging two perspectives on the cell tracking problem explored throughout this work: Considering it as a Bayesian inference problem and as a classification problem. Our methods admit a framework-like character in that they equip any frame-to-frame tracking method with uncertainty quantification. We demonstrate this by applying it to various existing tracking algorithms including the recently presented Transformer-based trackers. We demonstrate empirically that our methods yield useful and well-calibrated tracking uncertainties.
Poster
Wufei Ma · Haoyu Chen · Guofeng Zhang · Yu-Cheng Chou · Celso de Melo · Alan Yuille · Jieneng Chen
[ Exhibit Hall I ]
Abstract
3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of applications, such as autonomous navigation, robotics, and AR/VR. Despite the remarkable improvements achieved by large multi-modal models (LMMs) in a wide range of image and video understanding tasks, their abilities to perform 3D spatial reasoning are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 3,000 annotated image question answering triplets from 12 question types. We balance the data distribution by collecting complimentary images that lead to opposite answers given the same question. We also adopt a novel FlipEval for robust evaluation of 3D spatial reasoning capabilities. Moreover, to study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench involves two subsets with 3D spatial reasoning questions on images from the same scene with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, revealing their limitations in different types of 3D awareness, i.e., height, orientation, location, and multi-object reasoning. Our 3DSRBench also allows …
Poster
Fengxiang Wang · Hongzhen Wang · Di Wang · Zonghao Guo · Zhenyu Zhong · Long Lan · Wenjing Yang · Jing Zhang
[ Exhibit Hall I ]
Abstract
Masked Image Modeling (MIM) has become an essential method for building foundational visual models in remote sensing (RS). However, the limitations in size and diversity of existing RS datasets restrict the ability of MIM methods to learn generalizable representations. Additionally, conventional MIM techniques, which require reconstructing all tokens, introduce unnecessary computational overhead. To address these issues, we present a new pre-training pipeline for RS models, featuring the creation of a large-scale RS dataset and an efficient MIM approach. We curated a high-quality dataset named **OpticalRS-13M** by collecting publicly available RS datasets and processing them through exclusion, slicing, and deduplication. OpticalRS-13M comprises 13 million optical images covering various RS tasks, such as object detection and pixel segmentation. To enhance efficiency, we propose **SelectiveMAE**, a pre-training method that dynamically encodes and reconstructs semantically rich patch tokens, thereby reducing the inefficiencies of traditional MIM models caused by redundant background pixels in RS images. Extensive experiments show that OpticalRS-13M significantly improves classification, detection, and segmentation performance, while SelectiveMAE increases training efficiency over 2$\times$ times. This highlights the effectiveness and scalability of our pipeline in developing RS foundational models.
Poster
Taowen Wang · Cheng Han · James Liang · Wenhao Yang · Dongfang Liu · Luna Zhang · Qifan Wang · Jiebo Luo · Ruixiang Tang
[ Exhibit Hall I ]
Abstract
Recently in robotics, Vision-Language-Action (VLA) models have emerged as a transformative approach, enabling robots to execute complex tasks by integrating visual and linguistic inputs within an end-to-end learning framework. While VLA models offer significant capabilities, they also introduce new attack surfaces, making them vulnerable to adversarial attacks. With these vulnerabilities largely unexplored, this paper systematically quantifies the robustness of VLA-based robotic systems. Recognizing the unique demands of robotic execution, our attack objectives target the inherent spatial and functional characteristics of robotic systems. In particular, we introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory. Additionally, we design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments. Our evaluation reveals a marked degradation in task success rates, with up to a 100\% reduction across a suite of simulated robotic tasks, highlighting critical security gaps in current VLA architectures. By unveiling these vulnerabilities and proposing actionable evaluation metrics, we advance both the understanding and enhancement of safety for VLA-based robotic systems, underscoring the necessity for continuously developing robust defense strategies prior to …
Poster
Shintaro Shiba · Yoshimitsu Aoki · Guillermo Gallego
[ Exhibit Hall I ]
Abstract
Event cameras are emerging vision sensors,whose noise is challenging to characterize.Existing denoising methods for event cameras consider other tasks such as motion estimation separately (i.e., sequentially after denoising).However, motion is an intrinsic part of event data, since scene edges cannot be sensed without motion.This work proposes, to the best of our knowledge, the first method that simultaneously estimates motion in its various forms (e.g., ego-motion, optical flow) and noise.The method is flexible, as it allows replacing the 1-step motion estimation ofthe widely-used Contrast Maximization framework with any other motion estimator,such as deep neural networks.The experiments show that the proposed method achieves state-of-the-art results on the E-MLB denoising benchmark and competitive results on the DND21 benchmark,while showing its efficacy on motion estimation and intensity reconstruction tasks.We believe that the proposed approach contributes to strengthening the theory ofevent-data denoising, as well as impacting practical denoising use-cases, aswe release the code upon acceptance.
Poster
Lu Chen · Yizhou Wang · SHIXIANG TANG · Qianhong Ma · Tong He · Wanli Ouyang · Xiaowei Zhou · Hujun Bao · Sida Peng
[ Exhibit Hall I ]
Abstract
This paper addresses the task of learning an agent model behaving like humans, which can jointly perceive, predict, and act in egocentric worlds. Previous methods usually train separate models for these three abilities, which prevents them from learning from each other. In this paper, we propose a joint predictive agent model, named EgoAgent, that simultaneously learns to represent the world, predict future states, and take reasonable actions within a single transformer. EgoAgent introduces two innovations to learn from the causal and temporally intertwined nature of these abilities: (1) Interleaved sequential modeling of states and actions with the causal attention mechanism, and (2) A joint embedding-action-prediction architecture featuring temporal asymmetric predictor-observer branches. Integrating these designs based on JEPA, EgoAgent unifies these capabilities in a cohesive learning framework. Comprehensive evaluations of EgoAgent on representative tasks such as image classification, egocentric future state prediction, and 3D human motion prediction tasks demonstrate the superiority of our method. The code and trained model will be released for reproducibility.
Poster
Stefan Stojanov · Linan Zhao · Yunzhi Zhang · Daniel Yamins · Jiajun Wu
[ Exhibit Hall I ]
Abstract
Establishing dense correspondences across image pairs is essential for tasks such as shape reconstruction and robot manipulation. In the challenging setting of matching across different categories, the function of an object, i.e., the effect that an object can cause on other objects, can guide how correspondences should be established. This is because object parts that enable specific functions often share similarities in shape and appearance. We derive the definition of dense functional correspondence based on this observation and propose a weakly-supervised learning paradigm to tackle the prediction task. The main insight behind our approach is that we can leverage vision-language models to pseudo-label multi-view images to obtain functional parts. We then integrate this with dense contrastive learning from pixel correspondences to distill both functional and spatial knowledge into a new model that can establish dense functional correspondence. Further, we curate synthetic and real evaluation datasets as task benchmarks. Our results demonstrate the advantages of our approach over baseline solutions consisting of off-the-shelf self-supervised image representations and grounded vision language models.
Poster
Jin-Hee Lee · Jae-keun Lee · Jeseok Kim · Kwon Soon
[ Exhibit Hall I ]
Abstract
To ensure safe autonomous driving in complex urban environments, it is essential not only to develop high-performance object detection models but also to establish a diverse and representative dataset that captures a wide range of urban scenarios and object characteristics. To address these challenges, we introduce a new multi-class 3D LiDAR dataset that comprehensively reflects various urban environments and object types, along with a robust semi-supervised 3D object detection (SSOD) framework. Our SSOD framework leverages a novel multiple teachers model, where similar object classes are grouped and supervised by category-specialized teacher networks. This category-specific collaborative guidance enables the student network to learn more effectively, leading to improved object detection performance. Additionally, we propose the Pseudo-points Generator (PointGen), a simple yet effective technique designed to enhance the generation of high-quality pseudo-labels for the teacher network, mitigating the impact of sparse LiDAR point clouds. Extensive experiments on the Waymo Open Dataset, KITTI, and our newly introduced dataset validate the effectiveness of both our dataset and SSOD framework. Experimental results demonstrate that our approach consistently outperforms state-of-the-art 3D SSOD methods across all evaluated datasets. To encourage further research in this domain, we will publicly release our multi-class LiDAR dataset and source code on …
Poster
Xuange Zhang · Dengjie Li · Bo Liu · Zenghao Bao · Yao Zhou · Baisong Yang · liuzhongying liuzhongying · Yujie Zhong · Tongtong Yuan
[ Exhibit Hall I ]
Abstract
Benefiting from recent advancements in large language models and modality alignment techniques, existing Large Vision-Language Models~(LVLMs) have achieved prominent performance across a wide range of scenarios. However, the excessive computational complexity limits the widespread use of these models in practical applications. We argue that one main bottleneck in computational complexity is caused by the involvement of redundant vision sequences in model computation. This is inspired by a reassessment of the efficiency of vision and language information transmission in the language decoder of LVLMs. Then, we propose a novel vision-language interaction mechanism called **L**ayer-wise **V**ision **I**njection with **D**isentangled **A**ttention (LVIDA). In LVIDA, only the language sequence undergoes full forward propagation, while the vision sequence interacts with the language at specific stages within each language decoder layer. It is striking that our approach significantly reduces computational complexity with minimal performance loss. Specifically, LVIDA achieves approximately a 10× reduction in the computational cost of the language decoder across multiple LVLM models while maintaining comparable performance. Our code will be made publicly available soon.
Poster
Jianyu Wu · Yizhou Wang · Xiangyu Yue · Xinzhu Ma · Jinyang Guo · Dongzhan Zhou · Wanli Ouyang · SHIXIANG TANG
[ Exhibit Hall I ]
Abstract
While accurate and user-friendly Computer-Aided Design (CAD) is crucial for industrial design and manufacturing, existing methods still struggle to achieve this due to their over-simplified representations or architectures incapable of supporting multimodal design requirements. In this paper, we attempt to tackle this problem from both methods and datasets aspects. First, we propose a cascade MAR with topology predictor (CMT), the first multimodal framework for CAD generation based on Boundary Representation (B-Rep). Specifically, the cascade MAR can effectively capture the ``edge-counters-surface'' priors that are essential in B-Reps, while the topology predictor directly estimates topology in B-Reps from the compact tokens in MAR. Second, to facilitate large-scale training, we develop a large-scale multimodal CAD dataset, mmABC, which includes over 1.3 million B-Rep models with multimodal annotations, including point clouds, text descriptions, and multi-view images. Extensive experiments show the superior of CMT in both conditional and unconditional CAD generation tasks. For example, we improve Coverage and Valid ratio by +10.68% and +10.3%, respectively, compared to state-of-the-art methods on ABC in unconditional generation. CMT also improves +4.01 Chamfer on image conditioned CAD generation on mmABC. The dataset, code and pretrained network shall be released.
Poster
Haru Kondoh · Asako Kanezaki
[ Exhibit Hall I ]
Abstract
The field of multimodal robot navigation in indoor environments has garnered significant attention in recent years. However, as tasks and methods become more advanced, the action decision systems tend to become more complex and operate as black-boxes. For a reliable system, the ability to explain or describe its decisions is crucial; however, there tends to be a trade-off in that explainable systems cannot outperform non-explainable systems in terms of performance. In this paper, we propose incorporating the task of describing actions in language into the reinforcement learning of navigation as an auxiliary task. Existing studies have found it difficult to incorporate describing actions into reinforcement learning due to the absence of ground-truth data. We address this issue by leveraging knowledge distillation from pre-trained description generation models, such as vision-language models. We comprehensively evaluate our approach across various navigation tasks, demonstrating that it can describe actions while attaining high navigation performance. Furthermore, it achieves state-of-the-art performance in the particularly challenging multimodal navigation task of semantic audio-visual navigation.
Poster
Zhigang Wang · Yifei Su · Chenhui Li · Dong Wang · Yan Huang · Xuelong Li · Bin Zhao
[ Exhibit Hall I ]
Abstract
Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and can not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method. Codes will be publicly available.
Poster
Emily Jia · Jiageng Mao · Zhiyuan Gao · Yajie Zhao · Yue Wang
[ Exhibit Hall I ]
Abstract
Humans possess an exceptional ability to imagine 4D scenes, encompassing both motion and 3D geometry, from a single still image. This ability is rooted in our accumulated observations of similar scenes and an intuitive understanding of physics. In this paper, we aim to replicate this capacity in neural networks, specifically focusing on natural fluid imagery. Existing methods for this task typically employ simplistic 2D motion estimators to animate the image, leading to motion predictions that often defy physical principles, resulting in unrealistic animations. Our approach introduces a novel method for generating 4D scenes with physics-consistent animation from a single image. We propose the use of a physics-informed neural network that predicts motion for each point, guided by a loss term derived from fundamental physical principles, including the Navier-Stokes equations. To reconstruct the 3D geometry, we predict feature-based 3D Gaussians from the input image, which are then animated using the predicted motions and rendered from any desired camera perspective. Experimental results highlight the effectiveness of our method in producing physically plausible animations, showcasing significant performance improvements over existing methods.
Poster
Ming Dai · Wenxuan Cheng · Jiedong Zhuang · Jiang-Jiang Liu · Hongshen Zhao · Zhenhua Feng · Wankou Yang
[ Exhibit Hall I ]
Abstract
Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the model’s capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO/+/g, and RefCOCO/+/g (REC/RES) benchmarks demonstrate the effectiveness of PropVG.
Poster
Zheng Zhang · Lihe Yang · Tianyu Yang · Chaohui Yu · Xiaoyang Guo · Yixing Lao · Hengshuang Zhao
[ Exhibit Hall I ]
Abstract
Recent advances in monocular depth estimation have significantly improved its robustness and accuracy. Despite these improvements, relative depth models, which offer strong generalization capability, fail to provide real-world depth measurements. Notably, these models exhibit severe flickering and 3D inconsistency when applied to video data, limiting their application for 3D reconstruction. To address these challenges, we introduce StableDepth, a scene-consistent and scale-invariant depth estimation method that achieves stable predictions with scene-level 3D consistency. We propose a dual decoder structure to learn smooth depth supervised by large-scale unlabeled video data. Our approach not only enhances the generalization capability but also reduces flickering during video depth estimation. Leveraging the vast amount of unlabeled video data, our method offers extensive stability and is easy to scale up with low cost. Unlike previous methods requiring full video sequences, StableDepth enables online inference at 13$\times$ faster speed, while achieving significant accuracy improvements (6.4\%-86.8\%) across multiple benchmarks and delivering comparable temporal consistency to video diffusion based depth estimators. We highly encourage viewing the supplementary video materials to gain a better understanding of the effectiveness of our approach.
Poster
Jun-Hee Kim · Jumin Han · Seong-Whan Lee
[ Exhibit Hall I ]
Abstract
Standard 3D human pose estimation (HPE) benchmarks employ root-centering, which normalizes poses relative to the pelvis but discards absolute root position information. While effective for evaluation, this approach limits real-world applications such as motion tracking, AR/VR, and human-computer interaction, where absolute root position is essential. Moreover, incorporating root position into these models often leads to performance degradation.To address these limitations, we introduce PoseAnchor, a unified framework that seamlessly integrates root position estimation while improving overall pose accuracy.PoseAnchor leverages Iterative Hard Thresholding Robust Least Squares Regression (ITRR), a novel robust regression approach introduced to 3D HPE for the first time. ITRR effectively mitigates the impact of noisy 2D detections, enabling more accurate root position estimation.With ITRR, PoseAnchor enables zero-shot root localization, allowing existing models to estimate absolute root positions without retraining or architectural modifications.ITRR identifies a support set of reliable joints based on their spatial relationships to achieve robust root estimation, effectively filtering out unreliable joints.Beyond zero-shot localization, PoseAnchor incorporates ITRR into a Data-Driven Training framework that selectively utilizes the support set to optimize pose learning.By dynamically filtering high-confidence joint data, PoseAnchor mitigates noise while improving robustness.Experiments demonstrate that PoseAnchor achieves state-of-the-art results, surpassing both root-centered and root-aware methods in fully …
Poster
Ling Liu · Jun Tian · Li Yi
[ Exhibit Hall I ]
Abstract
4D panoptic segmentation in a streaming setting is critical for highly dynamic environments, such as evacuating dense crowds and autonomous driving in complex scenarios, where real-time, fine-grained perception within a constrained time budget is essential. In this paper, we introduce 4DSegStreamer, a novel framework that employs a Dual-Thread System to efficiently process streaming frames. Our method consists of a predictive thread and an inference thread. The predictive thread leverages historical motion and geometric information to extract features and forecast future dynamics. The inference thread ensures timely prediction for incoming frames by aligning with the latest memory and compensating for ego-motion and dynamic object movements. We evaluate 4DSegStreamer on the indoor HOI4D dataset and the outdoor SemanticKITTI and nuScenes datasets. Comprehensive experiments demonstrate the effectiveness of our approach, particularly in accurately predicting dynamic objects in complex scenes.
Poster
Artem Nikonorov · Georgy Perevozchikov · Andrei Korepanov · Nancy Mehta · Mahmoud Afifi · Egor Ershov · Radu Timofte
[ Exhibit Hall I ]
Abstract
We present cmKAN, a versatile framework for color matching. Given an input image with colors from a source color distribution, our method effectively and accurately maps these colors to match a target color distribution in both supervised and unsupervised settings. Our framework leverages the spline capabilities of Kolmogorov-Arnold Networks (KANs) to model the color matching between source and target distributions. Specifically, we developed a hypernetwork that generates spatially varying weight maps to control the nonlinear splines of a KAN, enabling accurate color matching. As part of this work, we introduce a first large-scale dataset of paired images captured by two distinct cameras and evaluate the efficacy of our and existing methods in matching colors. We evaluated our approach across various color-matching tasks, including: (1) raw-to-raw mapping, where the source color distribution is in one camera’s raw color space and the target in another camera’s raw space; (2) raw-to-sRGB mapping, where the source color distribution is in a camera’s raw space and the target is in the display sRGB space, emulating the color rendering of a camera ISP; and (3) sRGB-to-sRGB mapping, where the goal is to transfer colors from a source sRGB space (e.g., produced by a source camera ISP) …
Poster
Junyuan Deng · Wei Yin · Xiaoyang Guo · Qian Zhang · Xiaotao Hu · Weiqiang Ren · XIAOXIAO LONG · Ping Tan
[ Exhibit Hall I ]
Abstract
In this paper, we present DM-Calib, a diffusion-based approach for estimating pinhole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, resulting in poor generalization across diverse real-world images. Recent advancements in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence indicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Image, which losslessly encodes the numerical camera intrinsics and integrates seamlessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Image conditioned on an input image. By fine-tuning a stable diffusion model to generate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot …
Poster
Zhenyang Liu · Yikai Wang · Kuanning Wang · Longfei Liang · Xiangyang Xue · Yanwei Fu
[ Exhibit Hall I ]
Abstract
Visual imitation learning is effective for robots to learn versatile tasks. However, many existing methods rely on behavior cloning with supervised historical trajectories, limiting their 3D spatial and 4D spatiotemporal awareness. Consequently, these methods struggle to capture the 3D structures and 4D spatiotemporal relationships necessary for real-world deployment. In this work, we propose 4D Diffusion Policy (DP4), a novel visual imitation learning method that incorporates spatiotemporal awareness into diffusion-based policies. Unlike traditional approaches that rely on trajectory cloning, DP4 leverages a dynamic Gaussian world model to guide the learning of 3D spatial and 4D spatiotemporal perceptions from interactive environments. Our method constructs the current 3D scene from a single-view RGB-D observation and predicts the future 3D scene, optimizing trajectory generation by explicitly modeling both spatial and temporal dependencies. Extensive experiments across 17 simulation tasks with 173 variants and 3 real-world robotic tasks demonstrate that the 4D Diffusion Policy (DP4) outperforms baseline methods, improving the average simulation task success rate by 16.4\% (Adroit), 14\% (DexArt), and 6.45\% (RLBench), and the average real-world robotic task success rate by 8.6\%.
Poster
Muhammad Danish · Muhammad Akhtar Munir · Syed Shah · Kartik Kuckreja · Fahad Khan · Paolo Fraccaro · Alexandre Lacoste · Salman Khan
[ Exhibit Hall I ]
Abstract
While numerous recent benchmarks focus on evaluating generic Vision-Language Models (VLMs), they do not effectively address the specific challenges of geospatial applications.Generic VLM benchmarks are not designed to handle the complexities of geospatial data, an essential component for applications such as environmental monitoring, urban planning, and disaster management.Key challenges in the geospatial domain include temporal change detection, large-scale object counting, tiny object detection, and understanding relationships between entities in remote sensing imagery.To bridge this gap, we present GEOBench-VLM, a comprehensive benchmark specifically designed to evaluate VLMs on geospatial tasks, including scene understanding, object counting, localization, fine-grained categorization, segmentation, and temporal analysis. Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges. The results indicate that although existing VLMs demonstrate potential, they face challenges when dealing with geospatial-specific tasks, highlighting the room for further improvements. Notably, the best-performing LLaVa-OneVision achieves only 41.7\% accuracy on MCQs, slightly more than GPT-4o, which is approximately double the random guess performance. Our benchmark will be publicly available.
Poster
Petr Hruby · Marc Pollefeys
[ Exhibit Hall I ]
Abstract
We propose a novel approach for estimating the relative pose between rolling shutter cameras using the intersections of line projections with a single scanline per image. This allows pose estimation without explicitly modeling camera motion. Alternatively, scanlines can be selected within a single image, enabling single-view relative pose estimation for scanlines of rolling shutter cameras. Our approach is designed as a foundational building block for rolling shutter structure-from-motion (SfM), where no motion model is required, and each scanline's pose can be computed independently.We classify minimal solvers for this problem in both generic and specialized settings, including cases with parallel lines and known gravity direction. Furthermore, we develop minimal solvers for the parallel-lines scenario, both with and without gravity priors, by leveraging connections between this problem and the estimation of 2D structure from 1D cameras.Experiments on rolling shutter images from the Fastec dataset demonstrate the feasibility of our approach for initializing rolling shutter SfM, highlighting its potential for further development.The code will be made publicly available.
Poster
William Gao · Dilin Wang · Yuchen Fan · Aljaz Bozic · Tuur Stuyck · Zhengqin Li · Zhao Dong · Rakesh Ranjan · Nikolaos Sarafianos
[ Exhibit Hall I ]
Abstract
We present a novel approach to mesh shape editing, building on recent progress in 3D reconstruction from multi-view images. We formulate shape editing as a conditional reconstruction problem, where the model must reconstruct the input shape with the exception of a specified 3D region, in which the geometry should be generated from the conditional signal. To this end, we train a conditional Large Reconstruction Model (LRM) for masked reconstruction, using multi-view consistent masks rendered from a randomly generated 3D occlusion, and using one clean viewpoint as the conditional signal. During inference, we manually define a 3D region to edit and provide an edited image from a canonical viewpoint to fill that region. We demonstrate that, in just a single forward pass, our method not only preserves the input geometry in the unmasked region through reconstruction capabilities on par with SoTA, but is also expressive enough to perform a variety of mesh edits from a single image guidance that past works struggle with, while being 2-10 times faster than the top-performing prior work.
Poster
Yulin Wang · Mengting Hu · Hongli Li · Chen LUO
[ Exhibit Hall I ]
Abstract
In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object's front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets.
Poster
Kaijie Yin · Zhiyuan Zhang · Shu Kong · Tian Gao · Cheng-zhong Xu · Hui Kong
[ Exhibit Hall I ]
Abstract
In this paper, we propose Binarized Change Detection (BiCD), the first binary neural network (BNN) designed specifically for change detection. Conventional network binarization approaches, which directly quantize both weights and activations in change detection models, severely limit the network's ability to represent input data and distinguish between changed and unchanged regions. This results in significantly lower detection accuracy compared to real-valued networks. To overcome these challenges, BiCD enhances both the representational power and feature separability of BNNs, improving detection performance. Specifically, we introduce an auxiliary objective based on the Information Bottleneck (IB) principle, guiding the encoder to retain essential input information while promoting better feature discrimination. Since directly computing mutual information under the IB principle is intractable, we design a compact, learnable auxiliary module as an approximation target, leading to a simple yet effective optimization strategy that minimizes both reconstruction loss and standard change detection loss.Extensive experiments on street-view and remote sensing datasets demonstrate that BiCD establishes a new benchmark for BNN-based change detection, achieving state-of-the-art performance in this domain.
Poster
Andrew Bond · Jui-Hsien Wang · Long Mai · Erkut Erdem · Aykut Erdem
[ Exhibit Hall I ]
Abstract
Efficient neural representations for dynamic video scenes are critical for applications ranging from video compression to interactive simulations. Yet, existing methods often face challenges related to high memory usage, lengthy training times, and temporal consistency. To address these issues, we introduce a novel neural video representation that combines 3D Gaussian splatting with continuous camera motion modeling. By leveraging Neural ODEs, our approach learns smooth camera trajectories while maintaining an explicit 3D scene representation through Gaussians. Additionally, we introduce a spatiotemporal hierarchical learning strategy, progressively refining spatial and temporal features to enhance reconstruction quality and accelerate convergence. This memory-efficient approach achieves high-quality rendering at impressive speeds. Experimental results show that our hierarchical learning, combined with robust camera motion modeling, captures complex dynamic scenes with strong temporal consistency, achieving state-of-the-art performance across diverse video datasets in both high- and low-motion scenarios. Unlike prior methods that depend heavily on extensive external supervision, our approach operates entirely within a self-contained pipeline without requiring any additional supervision.
Poster
Shuting Dong · Mingzhi Chen · Feng Lu · Hao Yu · Guanghao Li · Zhe Wu · Ming Tang · Chun Yuan
[ Exhibit Hall I ]
Abstract
With the rapid advancement of Visual Place Recognition (VPR) systems, their unauthorized use on social media images enables monitoring of individuals' daily movements, posing serious privacy risks. However, privacy protection for addressing these risks in VPR systems remains an underexplored area. While adversarial perturbations have been widely explored for visual privacy protection, existing methods still fail to simultaneously satisfy the black-box constraint, imperceptibility, and real-time performance required in realistic VPR privacy protection scenarios. In this paper, we present the first look at privacy protection in VPR systems and introduce VPR-Cloak, an efficient privacy-preserving network. We introduce a saliency-aware prior to identify decisive regions for place recognition and propose Saliency-Aware Prior Guided Perturbation Optimization (SAP-PO) to selectively optimize perturbation generation in these areas. To enhance imperceptibility, we further optimize perturbations in the frequency domain, meticulously refining high-frequency components of perturbations while preserving low-frequency structures essential for human perception. Extensive experiments on multiple benchmark datasets and on various black-box VPR models verify that our method outperforms existing SOTA methods. Additionally, our method achieves a \textbf{15× speedup} in runtime compared to SOTA methods. We also validate the effectiveness of our method based on commercial APIs, including \textbf{Google and Microsoft Bing}, demonstrating the practical …
Poster
Nuo Chen · Chao Xiao · Yimian Dai · Shiman He · Miao Li · Wei An
[ Exhibit Hall I ]
Abstract
Small object detection (SOD) in anti-UAV task is a challenging problem due to the small size of UAVs and complex backgrounds. Traditional frame-based cameras struggle to detect small objects in complex environments due to their low frame rates, limited dynamic range, and data redundancy. Event cameras, with microsecond temporal resolution and high dynamic range, provide a more effective solution for SOD. However, existing event-based object detection datasets are limited in scale, feature large targets size, and lack diverse backgrounds, making them unsuitable for SOD benchmarks. In this paper, we introduce a Event-based Small object detection (EVSOD) dataset (namely EV-UAV), the first large-scale, highly diverse benchmark for anti-UAV tasks. It includes 147 sequences with over 2.3 million event-level annotations, featuring extremely small targets (averaging 6.8 × 5.4 pixels) and diverse scenarios such as urban clutter and extreme lighting conditions. Furthermore, based on the observation that small moving targets form continuous curves in spatiotemporal event point clouds, we propose Event based Sparse Segmentation Network (EV-SpSegNet), a novel baseline for event segmentation in point cloud space, along with a Spatiotemporal Correlation (STC) loss that leverages motion continuity to guide the network in retaining target events. Extensive experiments on the EV-UAV dataset demonstrate the …
Poster
Hanxiao Jiang · Hao-Yu Hsu · Kaifeng Zhang · Hsin-Ni Yu · Shenlong Wang · Yunzhu Li
[ Exhibit Hall I ]
Abstract
Creating a physical digital twin of a real-world object has immense potential in robotics, content creation, and XR. In this paper, we present PhysTwin, a novel framework that uses sparse videos of dynamic objects in interaction to produce a photo- and physically realistic, real-time interactive virtual replica.Our approach centers on two key components: (1) a physics-informed representation that combines spring-mass models for realistic physical simulation, generative shape models for geometry, and Gaussian splats for rendering, and (2) a novel multi-stage optimization-based inverse modeling framework that reconstructs complete geometry, infers dense physical properties, and replicates realistic appearance from videos. Our method integrates an inverse physics framework with visual perception cues, enabling high-fidelity reconstruction even from partial, occluded, and limited viewpoints.PhysTwin supports modeling various deformable objects, including ropes, stuffed animals, cloth, and delivery packages. Experiments show that PhysTwin outperforms competing methods in reconstruction, rendering, future prediction, and simulation under novel interactions. We further demonstrate its applications in interactive real-time simulation and model-based robotic motion planning. (See our supplement webpage for all videos and demos.)
Poster
Xinli Xu · Wenhang Ge · Dicong Qiu · ZhiFei Chen · Dongyu Yan · Zhuoyun LIU · Haoyu Zhao · hanfeng Zhao · Shunsi Zhang · Junwei Liang · Ying-Cong Chen
[ Exhibit Hall I ]
Abstract
Estimating physical properties for visual data is a crucial task in computer vision, graphics, and robotics, underpinning applications such as augmented reality, physical simulation, and robotic grasping. However, this area remains under-explored due to the inherent ambiguities in physical property estimation. To address these challenges, we introduce GaussianProperty, a training-free framework that assigns physical properties of materials to 3D Gaussians. Specifically, we integrate the segmentation capability of SAM with the recognition capability of GPT-4V(ision) to formulate a global-local physical property reasoning module for 2D images. Then we project the physical properties from multi-view 2D images to 3D Gaussians using a voting strategy. We demonstrate that 3D Gaussians with physical property annotations enable applications in physics-based dynamic simulation and robotic grasping. For physics-based dynamic simulation, we leverage the Material Point Method (MPM) for realistic dynamic simulation. For robot grasping, we develop a grasping force prediction strategy that estimates a safe force range required for object grasping based on the estimated physical properties. Extensive experiments on material segmentation, physics-based dynamic simulation, and robotic grasping validate the effectiveness of our proposed method, highlighting its crucial role in understanding physical properties from visual data.
Poster
Gang Fu
[ Exhibit Hall I ]
Abstract
Dichromatic Reflection Model (DRM), a widely used physical image formation model, has been extensively applied to specular highlight removal. However, traditional DRM solvers fail to effectively recover the missing content underneath specular highlights and are prone to incur visual artifacts. Additionally, existing deep learning-based methods do not exploit the underlying variables in DRM; instead, they primarily learn to translate an input image into its diffuse image (and specular residue image). As a result, their performance remains somewhat limited. To overcome these issues, we propose a neural DRM solver for specular highlight removal. Our pipeline for the solver consists of three networks: Highlight Detection Network (HDNet), Alpha-chrom Estimation Network (ACENet), and Refinement Network (RNet). Specifically, HDNet is first used to detect specular highlights. Meanwhile, leveraging multi-level contextural contrasted features from HDNet, ACENet estimates the underlying variables in DRM. Using these estimates, our new reconstruction models generate specular-free and specular residue images. To bridge the domain gap between color spaces, we additionally introduce RNet to refine the results. Extensive experiments on various datasets demonstrate that our neural solver is superior to previous traditional solvers as well as deep learning-based methods.
Poster
Haowen Bai · Jiangshe Zhang · Zixiang Zhao · Lilun Deng · Yukun Cui · Shuang Xu
[ Exhibit Hall I ]
Abstract
Multi-exposure image fusion consolidates multiple low dynamic range images of the same scene into a singular high dynamic range image. Retinex theory, which separates image illumination from scene reflectance, is naturally adopted to ensure consistent scene representation and effective information fusion across varied exposure levels. However, the conventional pixel-wise multiplication of illumination and reflectance inadequately models the glare effect induced by overexposure. To better adapt this theory for multi-exposure image fusion, we introduce an unsupervised and controllable method termed Retinex-MEF. Specifically, our method decomposes multi-exposure images into separate illumination components and a shared reflectance component, and effectively modeling the glare induced by overexposure. Employing a bidirectional loss constraint to learn the common reflectance component, our approach effectively mitigates the glare effect. Furthermore, we establish a controllable exposure fusion criterion, enabling global exposure adjustments while preserving contrast, thus overcoming the constraints of fixed-level fusion. A series of experiments across multiple datasets, including underexposure-overexposure fusion, exposure control fusion, and homogeneous extreme exposure fusion, demonstrate the effective decomposition and flexible fusion capability of our model. The code will be released.
Poster
Yuqi Li · Chuanguang Yang · Hansheng Zeng · Zeyu Dong · Zhulin An · Yongjun Xu · Yingli Tian · Hao Wu
[ Exhibit Hall I ]
Abstract
Spatiotemporal forecasting tasks, such as traffic flow, combustion dynamics, and weather forecasting, often require complex models that suffer from low training efficiency and high memory consumption. This paper proposes a lightweight framework, Spectral Decoupled Knowledge Distillation, which transfers the multi-scale spatiotemporal representations from a complex teacher model to a more efficient lightweight student network. The teacher model follows an encoder-latent evolution-decoder architecture, where its latent evolution module decouples high-frequency details (e.g., instant traffic fluctuations) and low-frequency trends (e.g. long-term weather evolution) using convolution (local high-frequency extractor) and Transformer (global low-frequency modeler). However, the multi-layer convolution and deconvolution structures result in slow training and high memory usage. To address these issues, we propose a frequency-aligned knowledge distillation strategy, which extracts multi-scale spectral features from the teacher’s latent space, including high and low frequency components, to guide the lightweight student model (e.g., ResNet, U-Net) in capturing both local fine-grained variations and global evolution patterns. Experiments show that the student model achieves over 95% of the teacher’s forecasting accuracy while using only 20%-30% of its memory, with training speed improved by more than 50%. Our theoretical analysis reveals that the frequency-domain decoupling enables the student model to capture long-range dependencies without the need …
Poster
Zixuan Hu · Dongxiao Li · Xinzhu Ma · SHIXIANG TANG · Xiaotong Li · Wenhan Yang · LINGYU DUAN
[ Exhibit Hall I ]
Abstract
Accurate monocular 3D object detection (M3OD) is pivotal for safety-critical applications like autonomous driving, yet its reliability deteriorates significantly under real-world domain shifts caused by environmental or sensor variations. To address these shifts, Test-Time Adaptation (TTA) methods have emerged, enabling models to adapt to target distributions during inference. While prior TTA approaches recognize the positive correlation between low uncertainty and high generalization ability, they fail to address the dual uncertainty inherent to M3OD: semantic uncertainty (ambiguous class predictions) and geometric uncertainty (unstable spatial localization). To bridge this gap, we propose Dual Uncertainty Optimization (**DUO**), the first TTA framework designed to jointly minimize both uncertainties for robust M3OD. Through a convex optimization lens, we introduce an innovative convex structure of the focal loss and further derive a novel conjugate loss, enabling label-agnostic uncertainty weighting and balanced learning for high-uncertainty objects. In parallel, we design a semantic-aware normal field constraint that preserves geometric coherence in regions with clear semantic cues, reducing uncertainty from the unstable 3D representation. This dual-branch mechanism forms a complementary loop: enhanced spatial perception improves semantic classification, and robust semantic predictions further refine spatial understanding. Extensive experiments demonstrate the superiority of DUO over existing methods across various datasets and …
Poster
Dimitrios Mallis · Ahmet Karadeniz · Sebastian Cavada · Danila Rukhovich · Niki Foteinopoulou · Kseniya Cherenkova · Anis Kacem · Djamila Aouada
[ Exhibit Hall I ]
Abstract
We propose CAD-Assistant, a general-purpose CAD agent for AI-assisted design. Our approach is based on a powerful Vision and Large Language Model (VLLM) as a planner and a tool-augmentation paradigm using CAD-specific tools. CAD-Assistant addresses multimodal user queries by generating actions that are iteratively executed on a Python interpreter equipped with the FreeCAD software, accessed via its Python API. Our framework is able to assess the impact of generated CAD commands on geometry and adapts subsequent actions based on the evolving state of the CAD design. We consider a wide range of CAD-specific tools including a sketch image parameterizer, rendering modules, a 2D cross-section generator, and other specialized routines. CAD-Assistant is evaluated on multiple CAD benchmarks, where it outperforms VLLM baselines and supervised task-specific methods. Beyond existing benchmarks, we qualitatively demonstrate the potential of tool-augmented VLLMs as general-purpose CAD solvers across diverse workflows.
Poster
Edgar Sucar · Zihang Lai · Eldar Insafutdinov · Andrea Vedaldi
[ Exhibit Hall I ]
Abstract
DUSt3R has recently shown that one can reduce many tasks in multi-view geometry, including estimating camera intrinsics and extrinsics, reconstructing the scene in 3D, and establishing image correspondences, to the prediction of a pair of viewpoint-invariant point maps, i.e., pixel-aligned point clouds defined in a common reference frame. This formulation is elegant and powerful, but unable to tackle dynamic scenes. To address this challenge, we introduce the concept of Dynamic Point Maps (DPM), extending standard point maps to support 4D tasks such as motion segmentation, scene flow estimation, 3D object tracking, and 2D correspondence. Our key intuition is that, when time is introduced, there are several possible spatial and time references that can be used to define the point maps. We identify a minimal subset of such combinations that can be regressed by a network to solve the sub tasks mentioned above. We train a DPM predictor on a mixture of synthetic and real data and evaluate it across diverse benchmarks for video depth prediction, dynamic point cloud reconstruction, 3D scene flow and object pose tracking, achieving state-of-the-art performance.
Poster
Yufei Zhang · Zijun Cui · Jeffrey Kephart · Qiang Ji
[ Exhibit Hall I ]
Abstract
While 3D hand reconstruction from monocular images has made significant progress, generating accurate and temporally coherent motion estimates from video sequences remains challenging, particularly during complex hand-object interactions. In this paper, we present a novel 3D hand motion recovery framework that enhances image-based reconstructions through a diffusion-based and physics-augmented motion refinement model. Our model captures the distribution of refined motion estimates conditioned on initial ones, generating improved sequences through an iterative denoising process. Instead of relying on scarce annotated video data, we train our model only using existing motion capture data without images. Moreover, we identify valuable intuitive physics knowledge during hand-object interactions, including key motion states and their associated motion constraints. We effectively integrate these physical insights into our diffusion model to improve its performance. Extensive experiments demonstrate that our approach significantly improves various frame-wise reconstruction methods, achieving state-of-the-art (SOTA) performance on existing benchmarks.
Poster
Vahid Balazadeh · Mohammadmehdi Ataei · Hyunmin Cheong · Amir Khasahmadi · Rahul Krishnan
[ Exhibit Hall I ]
Abstract
Physical reasoning, which involves interpreting object behaviors within dynamic environments, remains a significant challenge for Vision-Language Models (VLMs). The limitations in physical reasoning arise from an inability to translate learned knowledge into predictions about physical behavior. We perform a careful study to show how continual fine-tuning can mitigate this issue. However, fine-tuning is expensive for large models and impractical to repeatedly perform for every task. This necessitates the creation of modular and scalable ways to teach VLMs about physical reasoning. To that end, we introduce Physics Context Builders (PCBs), a novel modular framework where specialized VLMs are fine-tuned to generate detailed physical scene descriptions. These can be used as physical contexts for larger VLMs to enhance their reasoning capabilities. PCBs enable the separation of visual perception from reasoning, allowing us to analyze their relative contributions to physical understanding. We perform careful experiments on CLEVRER and on Falling Tower, a stability detection dataset with both simulated and real-world scenes, to demonstrate that PCBs provide substantial performance improvements, increasing average accuracy by up to 13.8\% on complex physical reasoning tasks. Notably, PCBs show strong Sim2Real transfer, successfully generalizing from simulated training data to real-world scenes.Our work demonstrates that enhancing visual perception through …
Poster
Xiuyu Wu · Xinhao Wang · Xiubin Zhu · Lan Yang · Jiyuan Liu · Xingchen Hu
[ Exhibit Hall I ]
Abstract
Due to the arbitrary orientation of objects in aerial images, rotation equivariance is a critical property for aerial object detectors. However, recent studies on rotation-equivariant aerial object detection remain scarce. Most detectors rely on data augmentation to enable models to learn approximately rotation-equivariant features. A few detectors have constructed rotation-equivariant networks, but due to the breaking of strict rotation equivariance by typical downsampling processes, these networks only achieve approximately rotation-equivariant backbones. Whether strict rotation equivariance is necessary for aerial image object detection remains an open question. In this paper, we implement a strictly rotation-equivariant backbone and neck network with a more advanced network structure and compare it with approximately rotation-equivariant networks to quantitatively measure the impact of rotation equivariance on the performance of aerial image detectors. Additionally, leveraging the inherently grouped nature of rotation-equivariant features, we propose a multi-branch head network that reduces the parameter count while improving detection accuracy. Based on the aforementioned improvements, this study proposes the Multi-branch head rotation-equivariant single-stage Detector (MessDet), which achieves state-of-the-art performance on the challenging aerial image datasets DOTA-v1.0, DOTA-v1.5 and DIOR-R with an exceptionally low parameter count. The code will be made publicly available.
Poster
Shuo Zhang · Chen Gao · Youfang Lin
[ Exhibit Hall I ]
Abstract
Light Field (LF) images captured under low illumination conditions typically exhibit low quality. Recent learning-based methods for low-light LF enhancement are generally tailored to specific illumination inputs, limiting their performance in real-world scenes. Moreover, how to maintain the inherent view-consistency in the enhanced images also remain as a difficult problem. In this paper, we propose to explore the view consistency for scene-adaptive low-light LF enhancement. We first analyze the view consistency for LF illumination maps and design a self-supervised view-consistent loss to keep the consistency between the illumination maps of different views in LFs. To enhance the model's perception of illumination, we combine both global and local information to estimate the illumination map, which is easily plugged into other models. Subsequently, we use the illumination maps to light up the low-light LF images and restore the corruption to produce the final enhanced image. Extensive experiments demonstrate that our View-Consistency Network (VCNet) outperforms state-of-the-art methods on real-world low-light LF datasets in both fixed lighting conditions and dynamic lighting conditions. Our proposed illumination adjustment is also demonstrated that can comprehensively improve the performance of existing methods in terms of both image quality and view consistency.
Poster
Yash Garg · Saketh Bachu · Arindam Dutta · Rohit Lal · Sarosij Bose · Calvin-Khang Ta · M. Salman Asif · Amit Roy-Chowdhury
[ Exhibit Hall I ]
Abstract
Human pose and shape (HPS) estimation methods have been extensively studied, with many demonstrating high zero-shot performance on in-the-wild images and videos. However, these methods often struggle in challenging scenarios involving complex human poses or significant occlusions. Although some studies address 3D human pose estimation under occlusion, they typically evaluate performance on datasets that lack realistic or substantial occlusions, e.g., most existing datasets introduce occlusions with random patches over the human or clipart-style overlays, which may not reflect real-world challenges. To bridge this gap in realistic occlusion datasets, we introduce a novel benchmark dataset, VOccl3D, a $\textbf{V}$ideo-based human $\textbf{Occ}$lusion dataset with $\textbf{3D}$ body pose and shape annotations. Inspired by works such as AGORA and BEDLAM, we constructed this dataset using advanced computer graphics rendering techniques, incorporating diverse real-world occlusion scenarios, clothing textures, and human motions. Additionally, we fine-tuned recent HPS methods, CLIFF and BEDLAM-CLIFF, on our dataset, demonstrating significant qualitative and quantitative improvements across multiple public datasets, as well as on the test split of our dataset, while comparing its performance with other state-of-the-art methods. Furthermore, we leveraged our dataset to enhance human detection performance under occlusion by fine-tuning an existing object detector, YOLO11, thus leading to a robust end-to-end …
Poster
Jiahao Zhang · Zongli Jiang · Gang Wang · Jinli Zhang · Yixin Wei · Liang Li · Yizheng Wang
[ Exhibit Hall I ]
Abstract
Tracking flying drones in infrared videos is a crucial yet challenging task. Existing drone trackers and datasets have limitations in dealing with and characterizing tiny targets ($\leq$20×20 pixels) against highly complex backgrounds. To tackle this issue, we have developed a large-scale benchmark for tiny drone tracking in infrared videos (TDTIV), which comprises 290k frames and 280k manually annotated bounding boxes. Unlike traditional trackers that primarily rely on appearance matching, we introduce a novel method called Motion-Centric Adaptive Tracking (MCATrack), which initially employs a magnocell-inspired motion response to enhance the local signal-to-noise ratio of tiny target regions while suppressing complex clutter. Moreover, we design a Dynamic Cross-Guided module that integrates both initial and updated target features to address pose variations in long-term tracking. This module captures the latest target information to generate highly relevant candidate regions and refines them through precise optimization to achieve more accurate tracking results.Extensive experiments performed on the TDTIV and the well-recognized Anti-UAV 410 datasets have demonstrated the superiority of MCATrack over state-of-the-art competing trackers. The codes along with the benchmark will be made publicly available.
Poster
Jinhao Duan · Fei Kong · Hao Cheng · James Diffenderfer · Bhavya Kailkhura · Lichao Sun · Xiaofeng Zhu · Xiaoshuang Shi · Kaidi Xu
[ Exhibit Hall I ]
Abstract
Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the “overall truthfulness” of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as “per-token” hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist “generic truthful directions” shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods in OH mitigation.
Poster
Johannes Jakubik · Felix Yang · Benedikt Blumenstiel · Erik Scheurer · Rocco Sedona · Stefano Maurogiovanni · Valerio Marsocci · Nikolaos Dionelis · Jente Bosmans · Niklas Kopp · Rahul Ramachandran · Paolo Fraccaro · Thomas Brunschwiler · Gabriele Cavallaro · Juan Moreno · Nicolas Longépé
[ Exhibit Hall I ]
Abstract
We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "thinking in modalities" (TiM)---the capability of generating additional artificial data during finetuning and inference to improve the model output---and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code will be open-sourced under a permissive license.
Poster
Erik Daxberger · Nina Wenzel · David Griffiths · Haiming Gang · Justin Lazarow · Gefen Kohavi · Kai Kang · Marcin Eichner · Yinfei Yang · Afshin Dehghan · Peter Grasch
[ Exhibit Hall I ]
Abstract
Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.
Poster
Yi-Ting Shen · Sungmin Eum · Doheon Lee · Rohit Shete · Chiao-Yi Wang · Heesung Kwon · Shuvra Bhattacharyya
[ Exhibit Hall I ]
Abstract
Composed pose retrieval (CPR) enables users to search for human poses by specifying a reference pose and a transition description, but progress in this field is hindered by the scarcity and inconsistency of annotated pose transitions. Existing CPR datasets rely on costly human annotations or heuristic-based rule generation, both of which limit scalability and diversity. In this work, we introduce AutoComPose, the first framework that leverages multimodal large language models (MLLMs) to automatically generate rich and structured pose transition descriptions. Our method enhances annotation quality by structuring transitions into fine-grained body part movements and introducing mirrored/swapped variations, while a cyclic consistency constraint ensures logical coherence between forward and reverse transitions. To advance CPR research, we construct and release two dedicated benchmarks, AIST-CPR and PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive experiments demonstrate that training retrieval models with AutoComPose yields superior performance over human-annotated and heuristic-based methods, significantly reducing annotation costs while improving retrieval quality. Our work pioneers the automatic annotation of pose transitions, establishing a scalable foundation for future CPR research.
Poster
Dinh-Vinh-Thuy Tran · Ruochen Chen · Shaifali Parashar
[ Exhibit Hall I ]
Abstract
Shape-from-Template (SfT) refers to the class of methods that reconstruct the 3D shape of a deforming object from images/videos using a 3D template. Traditional SfT methods require point correspondences between images and the texture of the 3D template in order to reconstruct 3D shapes from images/videos in real time. Their performance severely degrades when encountered with severe occlusions in the images because of the unavailability of correspondences. In contrast, modern SfT methods use a correspondence-free approach by incorporating deep neural networks to reconstruct 3D objects, thus requiring huge amounts of data for supervision. Recent advances use a fully unsupervised or self-supervised approach by combining differentiable physics and graphics to deform 3D template to match input images. In this paper, we propose an unsupervised SfT which uses only image observations: color features, gradients and silhouettes along with a mesh inextensibility constraint to reconstruct at a $400\times$ faster pace than (best-performing) unsupervised SfT. Moreover, when it comes to generating finer details and severe occlusions, our method outperforms the existing methodologies by a large margin. Code will be released upon acceptance.
Poster
shengyuan zhang · An Zhao · Ling Yang · Zejian Li · Chenye Meng · Haoran Xu · Tianrun Chen · AnYang Wei · Perry GU · Lingyun Sun
[ Exhibit Hall I ]
Abstract
Diffusion models have been applied to 3D LiDAR scene completion due to their strong training stability and high completion quality.However, the slow sampling speed limits the practical application of diffusion-based scene completion models since autonomous vehicles require an efficient perception of surrounding environments. This paper proposes a novel distillation method tailored for 3D LiDAR scene completion models, dubbed $\textbf{ScoreLiDAR}$, which achieves efficient yet high-quality scene completion.ScoreLiDAR enables the distilled model to sample in significantly fewer steps after distillation.To improve completion quality, we also introduce a novel $\textbf{Structural Loss}$, which encourages the distilled model to capture the geometric structure of the 3D LiDAR scene.The loss contains a scene-wise term constraining the holistic structure and a point-wise term constraining the key landmark points and their relative configuration.Extensive experiments demonstrate that ScoreLiDAR significantly accelerates the completion time from 30.55 to 5.37 seconds per frame ($>$5$\times$) on SemanticKITTI and achieves superior performance compared to state-of-the-art 3D LiDAR scene completion models.
Poster
Ruonan Yu · Songhua Liu · Zigeng Chen · Jingwen Ye · Xinchao Wang
[ Exhibit Hall I ]
Abstract
Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments show that our method significantly reduces the storage cost to merely 0.001% compared to full soft-label storage methods while achieving comparable performance to state-of-the-art …
Poster
Yung-Hsu Yang · Luigi Piccinelli · Mattia Segu · Siyuan Li · Rui Huang · Yuqian Fu · Marc Pollefeys · Hermann Blum · Zuria Bauer
[ Exhibit Hall I ]
Abstract
Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D → Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models will be released.
Poster
Juliette Marrie · Romain Menegaux · Michael Arbel · Diane Larlus · Julien Mairal
[ Exhibit Hall I ]
Abstract
We address the problem of extending the capabilities of vision foundation models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a novel method to uplift 2D image features into Gaussian Splatting representations of 3D scenes. Unlike traditional approaches that rely on minimizing a reconstruction loss, our method employs a simpler and more efficient feature aggregation technique, augmented by a graph diffusion mechanism. Graph diffusion refines 3D features, such as coarse segmentation masks, by leveraging 3D geometry and pairwise similarities induced by DINOv2.Our approach achieves performance comparable to the state of the art on multiple downstream tasks while delivering significant speed-ups.Notably, we obtain competitive segmentation results using only generic DINOv2 features, despite DINOv2 not being trained on millions of annotated segmentation masks like SAM. When applied to CLIP features, our method demonstrates strong performance in open-vocabulary object localization tasks, highlighting the versatility of our approach.
Poster
Gwanghyun Kim · Xueting Li · Ye Yuan · Koki Nagano · Tianye Li · Jan Kautz · Se Young Chun · Umar Iqbal
[ Exhibit Hall I ]
Abstract
Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model’s role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the …
Poster
Xiaopeng LIN · Yulong Huang · Hongwei Ren · Zunchang Liu · Hongxiang Huang · Yue Zhou · Haotian FU · Bojun Cheng
[ Exhibit Hall I ]
Abstract
Motion deblurring addresses the challenge of image blur caused by camera or scene movement. Event cameras provide motion information that is encoded in the asynchronous event streams. To efficiently leverage the temporal information of event streams, we employ Spiking Neural Networks (SNNs) for motion feature extraction and Artificial Neural Networks (ANNs) for color information processing. Due to the non-uniform distribution and inherent redundancy of event data, existing cross-modal feature fusion methods exhibit certain limitations. Inspired by the visual attention mechanism in the human visual system, this study introduces a bioinspired dual-drive hybrid network (BDHNet). Specifically, the Neuron Configurator Module (NCM) is designed to dynamically adjust neuron configurations based on cross-modal features, thereby focusing the spikes in blurry regions and adapting to varying blurry scenarios dynamically. Additionally, the Region of Blurry Attention Module (RBAM) is introduced to generate a blurry mask in an unsupervised manner, effectively extracting motion clues from the event features and guiding more accurate cross-modal feature fusion. Extensive subjective and objective evaluations demonstrate that our method outperforms current state-of-the-art methods on both synthetic and real-world datasets.
Poster
Zekun Qian · Ruize Han · Junhui Hou · Linqi Song · Wei Feng
[ Exhibit Hall I ]
Abstract
Open-vocabulary multi-object tracking (OVMOT) represents a critical new challenge involving the detection and tracking of diverse object categories in videos, encompassing both seen categories (base classes) and unseen categories (novel classes). This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT). Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, not fully leveraging the video information. In this work, we propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video analysis standpoint. First, we consider the tracking-related state of the objects during tracking and propose a new prompt-guided attention mechanism for more accurate detection (localization and classification) of time-varying objects. Subsequently, we leverage raw video data without annotations for training by formulating a self-supervised object similarity learning technique to facilitate temporal object tracking (association). Experimental results underscore that VOVTrack outperforms existing methods, establishing itself as a state-of-the-art solution for the open-vocabulary tracking task.
Poster
Mostofa Rafid Uddin · Jana Armouti · Min Xu
[ Exhibit Hall I ]
Abstract
Identifying different protein compositions and conformations from microscopic images of protein mixtures is a challenging open problem. We address this problem through disentangled representation learning, where separating protein compositions and conformations in an intermediate latent space enables accurate identification. Since conformations manifest as transformations that cause subtle changes in voxel space and compositions correspond to content invariant to these transformations, the task reduces to content-transformation disentangling. However, existing content-transformation disentanglement methods require an explicit parametric form for the transformation, which conformation transformations lack, making those methods unsuitable. To overcome this limitation, we propose DualContrast, a novel contrastive learning-based method that implicitly parameterizes both transformation and content and disentangles them. DualContrast achieves this by generating positive and negative pairs for content and transformation in both data and latent spaces. We demonstrate that existing contrastive approaches fail under similar implicit parameterization, underscoring the necessity of our method. We validate our claims through extensive experiments on 3D microscopic images of protein mixtures and additional shape-focused datasets beyond microscopy. Finally, we achieve the first completely unsupervised identification of different protein compositions and conformations in 3D microscopic images of protein mixtures.
Poster
Liying Yang · Chen Liu · Zhenwei Zhu · Ajian Liu · Hui Ma · Jian Nong · Yanyan Liang
[ Exhibit Hall I ]
Abstract
Recently, the generation of dynamic 3D objects from a video has shown impressive results. Existing methods directly optimize Gaussians using whole information in frames. However, when dynamic regions are interwoven with static regions within frames, particularly if the static regions account for a large proportion, existing methods often overlook information in dynamic regions and are prone to overfitting on static regions. This leads to producing results with blurry textures. We consider that decoupling dynamic-static features to enhance dynamic representations can alleviate this issue. Thus, we propose a dynamic-static feature decoupling module (DSFD). Along temporal axes, it regards the regions of current frame features that possess significant differences relative to reference frame features as dynamic features. Conversely, the remaining parts are the static features. Then, we acquire decoupled features driven by dynamic features and current frame features. Moreover, to further enhance the dynamic representation of decoupled features from different viewpoints and ensure accurate motion prediction, we design a temporal-spatial similarity fusion module (TSSF). Along spatial axes, it adaptively selects similar information of dynamic regions. Hinging on the above, we construct a novel approach, DS4D. Experimental results verify our method achieves state-of-the-art (SOTA) results in video-to-4D. In addition, the experiments on a …
Poster
Shijie Li · Chunyu Liu · Xun Xu · Si Yong Yeo · Xulei Yang
[ Exhibit Hall I ]
Abstract
Motion forecasting is a crucial component of autonomous driving systems, enabling the generation of accurate and smooth future trajectories to ensure safe navigation to the destination. In previous methods, potential future trajectories are often absent in the scene encoding stage, which may lead to suboptimal outcomes. Additionally, prior approaches typically employ transformer architectures for spatiotemporal modeling of trajectories and map information, which suffer from the quadratic scaling complexity of the transformer architecture. In this work, we propose an interaction-based method, named Future-Aware Interaction Network, that introduces potential future trajectories into scene encoding for a comprehensive traffic representation. Furthermore, a State Space Model (SSM), specifically Mamba, is introduced for both spatial and temporal modeling. To adapt Mamba for spatial interaction modeling, we propose an adaptive reordering strategy that transforms unordered data into a structured sequence. Additionally, Mamba is employed to refine generated future trajectories temporally, ensuring more consistent predictions. These enhancements not only improve model efficiency but also enhance the accuracy and diversity of predictions.We conduct comprehensive experiments on the widely used Argoverse 1 and Argoverse 2 datasets, demonstrating that the proposed method achieves superior performance compared to previous approaches in a more efficient way. The code will be released according …
Poster
Jinxiu Liang · Bohan Yu · Siqi Yang · Haotian Zhuang · Jieji Ren · Peiqi Duan · Boxin Shi
[ Exhibit Hall I ]
Abstract
We present EventUPS, the first uncalibrated photometric stereo method using event cameras—neuromorphic sensors that asynchronously detect brightness changes with microsecond resolution. Frame-based uncalibrated photometric stereo methods imposed high bandwidth demands and limiting applicability in dynamic scenes. They require dense image correspondence under varying illumination, cannot be directly applicable due to event data due to their fundamentally different sensing paradigm. Our approach introduces three key innovations: i) an augmented null space formulation that directly relates each event to constraints on surface normals and lighting, naturally handling ambient illumination; ii) a continuous parameterization of time-varying illumination that bridges asynchronous events to synchronized lighting estimation; iii) a structured lighting approach with known relative geometry that resolves the ambiguity to merely convex-concave uncertainty. We validate EventUPS using a custom-built LED-based lighting system implementing dual-ring and trefoil curve patterns. Extensive experiments on synthetic, semi-real, and real data demonstrate that our method achieves accuracy surpassing frame-based counterpart while requiring only 5\% of the data bandwidth.
Poster
Hsuan-I Ho · Chen Guo · Po-Chen Wu · Ivan Shugurov · Chengcheng Tang · Abhay Mittal · Sizhe An · Manuel Kaufmann · Linguang Zhang
[ Exhibit Hall I ]
Abstract
We introduce PHD, a novel approach for 3D human pose and shape estimation that leverages user identity information from videos to improve pose estimation accuracy and shape consistency. Unlike traditional methods designed to be user-agnostic and optimized for generalization, our pipeline precomputes the body shape and then employs a personalized pose fitting process conditioned on the body shape and input image. We observe that while existing methods commonly improve 2D alignment by refining the pose with constraints derived from the 2D image, the lack of 3D pose prior often reduces pose plausibility, thereby compromising 3D accuracy. To address this, we integrate a body shape-conditioned 3D pose prior, implemented as a Point Diffusion model, to iteratively guide pose fitting via a Point Distillation loss. Our results demonstrate that our 3D pose prior significantly prevents artifacts introduced by 2D-only constraints, which consequently improves the pose accuracy. In addition, our 3D prior-driven fitting method is highly versatile and can be seamlessly combined with state-of-the-art 3D pose estimators to improve pose accuracy.
Poster
Chengxu Liu · Lu Qi · Jinshan Pan · Xueming Qian · Ming-Hsuan Yang
[ Exhibit Hall I ]
Abstract
Unpaired image dehazing has attracted increasing attention due to its flexible data requirements during model training. Dominant methods based on contrastive learning not only introduce haze-unrelated content information, but also ignore haze-specific properties in the frequency domain (\ie,~haze-related degradation is mainly manifested in the amplitude spectrum). To address these issues, we propose a novel frequency domain-based diffusion model, named FrDiff, for fully exploiting the beneficial knowledge in unpaired clear data. In particular, inspired by the strong generative ability shown by Diffusion Models (DMs), we tackle the dehazing task from the perspective of frequency domain reconstruction and perform the DMs to yield the amplitude spectrum consistent with the distribution of clear images. To implement it, we propose an Amplitude Residual Encoder (ARE) to extract the amplitude residuals, which effectively compensates for the amplitude gap from the hazy to clear domains, as well as provide supervision for the DMs training. In addition, we propose a Phase Correction Module (PCM) to eliminate artifacts by further refining the phase spectrum during dehazing with a simple attention mechanism. Experimental results demonstrate that our FrDiff outperforms other state-of-the-art methods on both synthetic and real-world datasets.
Poster
Zhu Yu · Bowen Pang · Lizhe Liu · Runmin Zhang · Qiang Li · Si-Yuan Cao · Maochun Luo · Mingxia Chen · Sheng Yang · Hui-liang Shen
[ Exhibit Hall I ]
Abstract
This work presents LOcc, an effective and generalizable framework for open-vocabulary occupancy (OVO) prediction. Previous approaches typically supervise the networks through coarse voxel-to-text correspondences via image features as intermediates or noisy and sparse correspondences from voxel-based model-view projections. To alleviate the inaccurate supervision, we propose a semantic transitive labeling pipeline to generate dense and fine-grained 3D language occupancy ground truth. Our pipeline presents a feasible way to dig into the valuable semantic information of images, transferring text labels from images to LiDAR point clouds and ultimately to voxels, to establish precise voxel-to-text correspondences. By replacing the original prediction head of supervised occupancy models with a geometry head for binary occupancy states and a language head for language features, LOcc effectively uses the generated language ground truth to guide the learning of 3D language volume. Through extensive experiments, we demonstrate that our transitive semantic labeling pipeline can produce more accurate pseudo-labeled ground truth, diminishing labor-intensive human annotations. Additionally, we validate LOcc across various architectures, where all models consistently outperform state-of-the-art zero-shot occupancy prediction approaches on the Occ3D-nuScenes dataset. The code for the proposed method is available.
Poster
Ruixuan Cong · Yu Wang · Mingyuan Zhao · Da Yang · Rongshan Chen · Hao Sheng
[ Exhibit Hall I ]
Abstract
Deep learning-based light field image super-resolution methods have witnessed remarkable success in recent years. However, most of them only focus on the encoder design and overlook the importance of upsampling process in decoder part. Inspired by the recent progress in single image domain with implicit neural representation, we elaborately propose spatial-epipolar implicit image function (SEIIF), which optimizes upsampling process to significantly improve performance and supports arbitrary-scale light filed image super-resolution. Specifically, SEIIF contains two complementary upsampling patterns. One is spatial implicit image function (SIIF) that exploits intra-view information in sub-aperture images. The other is epipolar implicit image function (EIIF) that mines inter-view information in epipolar plane images. By unifying the upsampling step of two branches, SEIIF extra introduces cross-branch feature interaction to fully fuse intra-view information and inter-view information. Besides, given that line structure in epipolar plane image integrates spatial-angular correlation of light field, we present an oriented line sampling strategy to exactly aggregate inter-view information. The experimental results demonstrate that our SEIIF can be effectively combined with most encoders and achieve outstanding performance on both fixed-scale and arbitrary-scale light field image super-resolution. Our code will be available upon acceptance.
Poster
Shizun Wang · Zhenxiang Jiang · Xingyi Yang · Xinchao Wang
[ Exhibit Hall I ]
Abstract
Recovering 4D from monocular video, which jointly estimates dynamic geometry and camera poses, is an inevitably challenging problem. While recent pointmap-based 3D reconstruction methods (e.g., DUSt3R) have made great progress in reconstructing static scenes, directly applying them to dynamic scenes leads to inaccurate results. This discrepancy arises because moving objects violate multi-view geometric constraints, disrupting the reconstruction.To address this, we introduce **C4D**, a framework that leverages temporal **C**orrespondences to extend existing 3D reconstruction formulation to **4D**. Specifically, apart from predicting pointmaps, C4D captures two types of *correspondences*: *short-term* optical flow and *long-term* point tracking. We train a dynamic-aware point tracker that provides additional mobility information, facilitating the estimation of motion masks to separate moving elements from the static background, thus offering more reliable guidance for dynamic scenes.Furthermore, we introduce a set of dynamic scene optimization objectives to recover per-frame 3D geometry and camera parameters. Simultaneously, the correspondences lift 2D trajectories into smooth 3D trajectories, enabling fully integrated 4D reconstruction.Experiments show that our framework achieves complete 4D recovery and demonstrates strong performance across multiple downstream tasks, including depth estimation, camera pose estimation, and point tracking.
Poster
Yuan Liang · Yang Zhou · Ziming Sun · Tianyi Xiang · Guiqing Li · Shengfeng He
[ Exhibit Hall I ]
Abstract
Depth estimation in dynamic, multi-object scenes remains a major challenge, especially under severe occlusions. Existing monocular models, including foundation models, struggle with instance-wise depth consistency due to their reliance on global regression. We tackle this problem from two key aspects: data and methodology. First, we introduce the Group Instance Depth (GID) dataset, the first large-scale video depth dataset with instance-level annotations, featuring 101,500 frames from real-world activity scenes. GID bridges the gap between synthetic and real-world depth data by providing high-fidelity depth supervision for multi-object interactions. Second, we propose InstanceDepth, the first occlusion-aware depth estimation framework for multi-object environments. Our two-stage pipeline consists of (1) Holistic Depth Initialization, which assigns a coarse scene-level depth structure, and (2) Instance-Aware Depth Rectification, which refines instance-wise depth using object masks, shape priors, and spatial relationships. By enforcing geometric consistency across occlusions, our method sets a new state-of-the-art on the GID dataset and multiple benchmarks.
Poster
AO LI · Jinpeng Liu · Yixuan Zhu · Yansong Tang
[ Exhibit Hall I ]
Abstract
Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI’s superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.
Poster
Liang Qin · Min Wang · Peiwei Li · Wengang Zhou · Houqiang Li
[ Exhibit Hall I ]
Abstract
Object Goal Navigation (ObjectNav) in unknown environments presents significant challenges, particularly in Open-Vocabulary Mobile Manipulation (OVMM), where robots must efficiently explore large spaces, locate small objects, and accurately position themselves for subsequent manipulation. Existing approaches struggle to meet these demands: rule-based methods offer structured exploration but lack adaptability, while reinforcement learning (RL)-based methods enhance adaptability but fail to ensure effective long-term navigation. Moreover, both approaches often overlook precise stopping positions, which are critical for successful manipulation.To address these challenges, we propose APRR (Active Perception Meets Rule-Guided RL), a two-phase framework that designs a new rule-guided RL policy for the exploration phase and a novel active target perception policy for the last-mile navigation phase. Inspired by human search behavior, our rule-guided RL policy enables efficient and adaptive exploration by combining structured heuristics with learning-based decision-making. In the last-mile navigation phase, we introduce an RL-based policy enhanced with active target perception, allowing the robot to refine its position dynamically based on real-time detection feedback. Experimental results demonstrate that APRR improves the success rate by 13\%, significantly outperforming existing methods. Furthermore, real-world experiments validate the practicality and effectiveness of APRR in real-world mobile manipulation scenarios, offering a robust and adaptable solution for precise …
Poster
Jan Skvrna · Lukas Neumann
[ Exhibit Hall I ]
Abstract
Inferring object 3D position and orientation from a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring LiDAR and vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured.We present a novel method to train a 3D object detector from a single RGB camera without domain-specific human annotations, making orders of magnitude more data available for training. The method uses newly proposed Local Object Motion Model to disentangle object movement source between subsequent frames, is approximately 700 times faster than previous work and compensates camera focal length differences to aggregate multiple datasets.The method is evaluated on three public datasets, where despite using no human labels, it outperforms prior work by a significant margin. It also shows its versatility as a pre-training tool for fully-supervised training and shows that combining pseudo-labels from multiple datasets can achieve comparable accuracy to using human labels from a single dataset.
Poster
Haipeng Li · Tianhao Zhou · Zhanglei Yang · WuYi WuYi · Chen Yan · Zijing Mao · Shen Cheng · Bing Zeng · Shuaicheng Liu
[ Exhibit Hall I ]
Abstract
Estimating 2D camera motion is a fundamental task in computer vision, representing the non-linear projection of 3D rotation and translation onto a 2D plane. Current methods primarily rely on homography-based approaches, which model perspective transformations for planar scenes, or meshflow-based techniques, which utilize grid-based local homographies to accommodate non-linear motion. However, homography is restricted to dominant planes and meshflow’s nonlinear capacity remains limited. To address these challenges, we introduce **CamFlow**, a novel representation that captures non-linear 2D camera motion through the use of hybrid motion bases: 1) physical bases to model essential motion patterns and 2) noisy motion bases to enhance flexibility. In addition, we propose a hybrid probabilistic loss function, leveraging a Laplace distribution to improve robustness and facilitate efficient training.We also design a test-time adaptation strategy to refine motion estimates for video stabilization in unseen video contexts. To evaluate the camera motion, we propose a new benchmark by masking dynamic objects in existing optical flow datasets. Extensive experiments, including zero-shot evaluations across diverse conditions, demonstrate that CamFlow outperforms state-of-the-art homography and meshflow methods in terms of robustness and generalization.Code and dataset will be released upon publication.
Poster
Risa Shinoda · Nakamasa Inoue · Hirokatsu Kataoka · Masaki Onishi · Yoshitaka Ushiku
[ Exhibit Hall I ]
Abstract
Precise automated understanding of agricultural tasks such as disease identification is essential for the sustainable crop production. Recent advances in vision-language models (VLMs) are expected to further expand the range of agricultural tasks by facilitating human-model interaction through easy, text-based communication. Here, we introduce AgroBench (Agronomist AI Benchmark), a benchmark for evaluating VLM models across seven agricultural topics, covering key areas in agricultural engineering and relevant to real-world farming. Unlike recent agricultural VLM benchmarks, AgroBench is annotated by expert agronomists. Our AgroBench covers a state-of-the-art range of categories, including 197 crop categories and 682 disease categories, to thoroughly evaluate VLM capabilities. In our evaluation on AgroBench, we reveal that VLMs have room for improvement in fine-grained identification tasks. Notably, in weed identification, most open-source VLMs perform close to random. With our wide range of topics and expert-annotated categories, we analyze the types of errors made by VLMs and suggest potential pathways for future VLM development. Our dataset and code will be available.
Poster
Karhan Kayan · Stamatis Alexandropoulos · Rishabh Jain · Yiming Zuo · Erich Liang · Jia Deng
[ Exhibit Hall I ]
Abstract
We introduce PosedVideo365, a diverse dataset of 365 videos with accurate camera pose. Our dataset bridges the gap between accuracy and data diversity in current SLAM benchmarks by introducing a novel ground truth collection framework that leverages calibration boards and a $360^{\circ}$ camera. We collect indoor, outdoor, and object scanning videos with synchronized monocular and stereo RGB video outputs as well as IMU. We further propose a new scene scale-aware evaluation metric for SLAM based on the the optical flow induced by the camera pose estimation error. In contrast to the current metrics, our new metric allows for comparison between the performance of SLAM methods across scenes as opposed to existing metrics such as Average Trajectory Error (ATE), allowing researchers to analyze the failure modes of their methods. We also propose a challenging Novel View Synthesis benchmark that covers cases not covered by current NVS benchmarks, such as fully non-Lambertian scenes with $360^{\circ}$ camera trajectories.
Poster
Heng Jia · Na Zhao · Linchao Zhu
[ Exhibit Hall I ]
Abstract
Despite recent advances in feed-forward 3DGS methods, generalizable 3D reconstruction remains challenging, particularly in the multi-view correspondence modeling. We present a hybrid framework for multi-view correspondence modeling, which integrates volumetric latent fusion with Transformer-based feature aggregation. Our framework consists of two complementary components: a latent volume that encodes view-invariant correspondences through epipolar geometry, and a camera-aware Transformer conditioned on Plücker coordinates. By combining explicit and implicit feature aggregation mechanisms, our approach enhances generalization while demonstrating accelerated convergence, requiring only half the training steps to achieve results comparable to state-of-the-art methods. Additionally, through comprehensive evaluation, we show that Visual Foundation Models trained with pixel-aligned supervision are more suitable for 3D reconstruction tasks. Our approach supports variable input views, improving reconstruction quality as view count increases while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code will be released.
Poster
Bo Wang · Huiyuan Fu · Zhiye Huang · Siru Zhang · Xin Wang · Huadong Ma
[ Exhibit Hall I ]
Abstract
Exposure correction aims to restore over/under-exposed images to well-exposed ones using a single network. However, existing methods mainly handle non-extreme exposure conditions and struggle with the severe luminance and texture loss caused by extreme exposure. Through a thorough investigation, we find that the lack of high-quality benchmark datasets significantly limits progress in extreme exposure correction.To address this issue, we introduce the first Real-world Extreme Exposure Dataset, REED. By leveraging the burst shooting mode of cameras, we capture image sequences covering a luminance range from extremely dark to extremely bright. To prevent misalignment caused by camera motion and scene changes, we apply cropping and an improved SIFT algorithm to ensure precise alignment.We also propose a novel Context-Guided Luminance-Normalized Iterative Exposure Refinement Network. We employ Contrastive Loss and Luminance Normalizer to disentangle the coupled distribution of over/under-exposed images. In certain cases, luminance alone is insufficient for determining over/under-exposure, so we integrate semantic guidance into the Semantic-aware Exposure Diffusion Model to further enhance luminance and texture restoration. Inspired by the effectiveness of iterative correction in improving color and texture, we introduce the CLIP-Guided Iterative Refinement Strategy. Extensive experiments validate the superiority of our dataset and method. Our dataset and code will be publicly …
Poster
Hai Jiang · Binhao Guan · Zhen Liu · Xiaohong Liu · Jian Yu · Zheng Liu · Songchen Han · Shuaicheng Liu
[ Exhibit Hall I ]
Abstract
Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset will be released to facilitate future research.
Poster
Zhi Hou · Tianyi Zhang · Yuwen Xiong · Haonan Duan · Hengjun Pu · Ronglei Tong · Chengyang Zhao · Xizhou Zhu · Yu Qiao · Jifeng Dai · Yuntao Chen
[ Exhibit Hall I ]
Abstract
While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces. We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences through a unified multimodal diffusion process. Departing from prior methods that condition denoising on fused embeddings via shallow networks, Dita employs in-context conditioning—enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. This design explicitly models action deltas and environmental nuances. By capitalizing on the Transformer's scalability, Dita effectively unifies cross-embodiment datasets spanning varying camera perspectives, tasks, and action spaces. Evaluations across extensive benchmarks demonstrate state-of-the-art or comparative performance in simulation. Notably, Dita achieves robust real-world adaptation to environmental variances and complex long-horizon tasks through 10-shot finetuning, using only third-person camera inputs. The architecture establishes a versatile, lightweight and open-source baseline for generalist robot policy learning. The code and website are included in the supplementary materials.
Poster
Fengrui Tian · Tianjiao Ding · Jinqi Luo · Hancheng Min · Rene Vidal
[ Exhibit Hall I ]
Abstract
This paper studies the problem of generating an unbounded dynamic scene from a single view, which has wide applications in augmented/virtual reality and robotics. Since the scene is changing over time, different generated views need to be consistent with the underlying 3D motions. While previous works learn such consistency by training from multiple views, the generated scene regions are bounded to be close to the training views with limited camera movements. To address this issue, we propose DynamicVoyager that reformulates the dynamic scene generation as a scene outpainting process for new dynamic content. As 2D outpainting models can hardly generate 3D consistent motions from only 2D pixels at a single view, we consider pixels as rays to enrich the pixel input with the ray context, so that the 3D motion consistency can be learned from the ray information. More specifically, we first map the single-view video input to a dynamic point cloud with the estimated video depths. Then we render the partial video at a novel view and outpaint the video with ray contexts from the point cloud to generate 3D consistent motions. We employ the outpainted video to update the point cloud, which is used for scene outpainting from …
Poster
Haotian Wang · Aoran Xiao · Xiaoqin Zhang · Meng Yang · Shijian Lu
[ Exhibit Hall I ]
Abstract
Generalizable depth completion enables the acquisition of dense metric depth maps for unseen environments, offering robust perception capabilities for various downstream tasks. However, training such models typically requires large-scale datasets with metric depth labels, which are often labor-intensive to collect. This paper presents PacGDC, a label-efficient technique that enhances geometry diversity with minimal annotation effort for generalizable depth completion. PacGDC builds on insights into inherent 2D-to-3D projection ambiguities and consistencies in object shapes and positions, allowing the synthesis of numerous pseudo geometries for the same visual scene. This process greatly broadens data coverage by manipulating scene scales of the corresponding depth maps. To leverage this property, we propose a novel data synthesis pipeline built upon multiple depth foundation models. These models robustly provide pseudo depth labels with varied scene scales in both local objects and global layouts, while ensuring projection consistency that contributes to generalization. To further diversify geometries, we introduce interpolation and relocation strategies, as well as unlabeled images, extending the coverage beyond the individual use of foundation models. Extensive experiments show that PacGDC achieves remarkable generalizability across multiple benchmarks, excelling in diverse scene semantics/scales and depth sparsity/patterns under both zero-shot and few-shot settings.
Poster
mingze sun · Shiwei Mao · Keyi Chen · Yurun Chen · Shunlin Lu · Jingbo Wang · Junting Dong · Ruqi Huang
[ Exhibit Hall I ]
Abstract
Recent advancements in large-scale generative models have significantly improved the quality and diversity of 3D shape generation. However, most existing methods focus primarily on generating static 3D models, overlooking the potential dynamic nature of certain shapes, such as humanoids, animals, and insects. To address this gap, we focus on rigging, a fundamental task in animation that establishes skeletal structures and skinning for 3D models. In this paper, we introduce OmniRig, the first large-scale rigging dataset, comprising 79,499 meshes with detailed skeleton and skinning information. Unlike traditional benchmarks that rely on predefined standard poses (e.g., A-pose, T-pose), our dataset embraces diverse shape categories, styles, and poses. Leveraging this rich dataset, we propose ARMO, a novel rigging framework that utilizes an autoregressive model to predict both joint positions and connectivity relationships in a unified manner. By treating the skeletal structure as a complete graph and discretizing it into tokens, we encode the joints using an auto-encoder to obtain a latent embedding and an autoregressive model to predict the tokensA mesh-conditioned latent diffusion model is used to predict the latent embedding for conditional skeleton generation. Our method addresses the limitations of regression-based approaches, which often suffer from error accumulation and suboptimal connectivity estimation. …
Poster
Jiahao Ma · Tianyu Wang · Miaomiao Liu · David Ahmedt Aristizabal · Chuong Nguyen
[ Exhibit Hall I ]
Abstract
Multiview pedestrian detection typically involves two stages: human modeling and pedestrian localization. Human modeling represents pedestrians in 3D space by fusing multiview information, making its quality crucial for detection accuracy. However, existing methods often introduce noise and have low precision. While some approaches reduce noise by fitting on costly multiview 3D annotations, they often struggle to generalize across diverse scenes. To eliminate reliance on human-labeled annotations and accurately model humans, we propose Depth-Consistency Human Modeling (DCHM), a framework designed for consistent depth estimation and multiview fusion in global coordinates. Specifically, our proposed pipeline iteratively achieves multiview depth consistency in sparse-view, large-scaled, and crowded scenarios, producing precise point clouds for pedestrian localization. Extensive experiments demonstrate that our method significantly reduces noise during human modeling, outperforming previous state-of-the-art baselines. Additionally, to the best of our knowledge, we are the first to reconstruct pedestrians and perform multiview segmentation in such a challenging setting.
Poster
Weihao Wang · Yu Lan · Mingyu You · Bin He
[ Exhibit Hall I ]
Abstract
3D assembly completion represents a fundamental task in 3D computer vision and robotics. This task aims to retrieve the missing parts from a set of candidates and predict their 6-DoF poses to make the partial assembly complete. However, due to the inherent uncertainty in completion and the similarity among candidates, even humans struggle to achieve precise completion without external guidance. To address this challenge, we introduce an auxiliary image depicting the complete assembly from a specific view. The primary challenge lies in the lack of correspondence or grounding between the partial assembly and the image, leading to ambiguities in identifying missing parts and ineffective guidance for completion. Moreover, this correspondence heavily depends on the view of image, which, unfortunately, is often unknown in real-world scenarios. To this end, we propose a novel cross-modal 3D assembly completion framework. At its core is missing-oriented feature fusion augmented by self-supervised view alignment to establish view-consistent 2D-3D correspondence between the image and the partial assembly, which effectively captures clues of missing parts from the image and provides targeted guidance for completion. Extensive experiments demonstrate our state-of-the-art performance on the PartNet dataset and show its generalization capabilities in two downstream applications: component suggestion and furniture …
Poster
Ye Tao · jiawei zhang · Yahao Shi · Dongqing Zou · Bin Zhou
[ Exhibit Hall I ]
Abstract
Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.
Poster
Xiaogang Xu · Jiafei Wu · Qingsen Yan · Jiequan Cui · Richang Hong · Bei Yu
[ Exhibit Hall I ]
Abstract
A major challenge in Low-Light Image Enhancement (LLIE) is its ill-posed nature: low-light images often lack sufficient information to align with normal-light ones (e.g., not all training data can be fully fitted to the ground truth). Numerous studies have attempted to bridge the gap between low- and normal-light data by introducing effective additional information, which is called "references" in this paper. However, existing methods overlook the valuable references hidden within the training dataset itself. In this work, we propose a novel LLIE strategy that simultaneously learns image-specific features by neural networks while formulating effective common features from the training data as the reference. These common features are correlated with the samples that are not fully fitted by the LLIE network itself, and they are represented as a set of Learnable Feature Patches and Vectors (LFPVs) in the hidden feature space. LFPVs are updated through two mechanisms: the sample-updater, which extracts useful features from training samples to refine LFPVs, and the mutual-updater, which propagates information across LFPVs to mutually update them. LFPVs can be adaptively aligned with image-specific features via our designed query-and-fusion procedure, boosting the LLIE performance. Our proposed method can be integrated into any LLIE framework, improving both enhancement …
Poster
Zesong Yang · Bangbang Yang · Wenqi Dong · Chenxuan Cao · Liyuan Cui · Yuewen Ma · Zhaopeng Cui · Hujun Bao
[ Exhibit Hall I ]
Abstract
Humans can naturally identify and mentally complete occluded objects in cluttered environments. However, imparting similar cognitive ability to robotics remains challenging even with advanced reconstruction techniques, which models scenes as undifferentiated wholes and fails to recognize complete object from partial observations. In this paper, we propose InstaScene, a new paradigm towards holistic 3D perception of complex scenes with a primary goal: decomposing arbitrary instances while ensuring complete reconstruction. To achieve precise decomposition, we develop a novel spatial contrastive learning by tracing rasterization of each instance across views, significantly enhancing semantic supervision in cluttered scenes. To overcome incompleteness from limited observations, we introduce in-situ generation that harnesses valuable observations and geometric cues, effectively guiding 3D generative models to reconstruct complete instances that seamlessly align with the real world. Experiments on scene decomposition and object completion across complex real-world and synthetic scenes demonstrate that our method achieves superior decomposition accuracy while producing geometrically faithful and visually intact objects. Code will be released upon acceptance.
Poster
Mohammad Mohammadi · Ziyi Wu · Igor Gilitschenski
[ Exhibit Hall I ]
Abstract
Long-term temporal information is crucial for event-based perception tasks, as raw events only encode pixel brightness changes. Recent works show that when trained from scratch, recurrent models achieve better results than feedforward models in these tasks. However, when leveraging self-supervised pre-trained weights, feedforward models can outperform their recurrent counterparts. Current self-supervised learning (SSL) methods for event-based pre-training largely mimic RGB image-based approaches. They pre-train feedforward models on raw events within a short time interval, ignoring the temporal information of events. In this work, we introduce TESPEC, a self-supervised pre-training framework tailored for recurrent architectures. To unleash the power of recurrent models, TESPEC is the first method utilizing longer sequences of events in the pre-training stage. TESPEC employs the masked image modeling paradigm with a new reconstruction target. We design a novel method to accumulate events into pseudo grayscale videos containing high-level semantic information about the underlying scene, which is robust to sensor noise and reduces motion blur. Reconstructing this target thus requires the model to reason about long-term history of events. Extensive experiments demonstrate our state-of-the-art performance in downstream tasks, including object detection, semantic segmentation, and monocular depth estimation.
Poster
Gengze Zhou · Yicong Hong · Zun Wang · Chongyang Zhao · Mohit Bansal · Qi Wu
[ Exhibit Hall I ]
Abstract
The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent.This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.
Poster
Jan Ackermann · Jonas Kulhanek · Shengqu Cai · Haofei Xu · Marc Pollefeys · Gordon Wetzstein · Leonidas Guibas · Songyou Peng
[ Exhibit Hall I ]
Abstract
In dynamic 3D environments, accurately updating scene representations over time is crucial for applications in robotics, mixed reality, and embodied AI. As scenes evolve, efficient methods to incorporate changes are needed to maintain up-to-date, high-quality reconstructions without the computational overhead of re-optimizing the entire scene.This paper introduces CL-Splats, which incrementally updates Gaussian splatting-based 3D representations from sparse scene captures.CL-Splats integrates a robust change-detection module that segments updated and static components within the scene, enabling focused, local optimization that avoids unnecessary re-computation.Moreover, CL-Splats supports storing and recovering previous scene states, facilitating temporal segmentation and new scene-analysis applications.Our extensive experiments demonstrate that CL-Splats achieves efficient updates with improved reconstruction quality over the state-of-the-art. This establishes a robust foundation for future real-time adaptation in 3D scene reconstruction tasks.We will release our source code and the synthetic and real-world datasets we created to support further research in this area.
Poster
Ziqi Ma · Yisong Yue · Georgia Gkioxari
[ Exhibit Hall I ]
Abstract
Why don't we have foundation models in 3D yet? A key limitation is data scarcity. For 3D object part segmentation, existing datasets are small in size and lack diversity. We show that it is possible to break this data barrier by building a data engine powered by 2D foundation models. Our data engine automatically annotates any number of object parts: 1755x more unique part types than existing datasets combined. By training on our annotated data with a simple contrastive objective, we obtain an open-world model that generalizes to any part in any object based on any text query. Even when evaluated zero-shot, we outperform existing methods on the datasets they train on. We achieve 260% improvement in mIoU and boost speed by 6x to 300x. Our scaling analysis confirms that this generalization stems from the data scale, which underscores the impact of our data engine. Finally, to advance general-category open-world 3D part segmentation, we release a benchmark covering a wide range of objects and parts.
Poster
Junho Kim · Gwangtak Bae · Eun Sun Lee · Young Min Kim
[ Exhibit Hall I ]
Abstract
Understanding scene contexts is crucial for machines to perform tasks and adapt prior knowledge in unseen or noisy 3D environments. As data-driven learning is intractable to comprehensively encapsulate diverse ranges of layouts and open spaces, we propose teaching machines to identify relational commonalities in 3D spaces. Instead of focusing on point-wise or object-wise representations, we introduce 3D scene analogies, which are smooth maps between 3D scene regions that align spatial relationships. Unlike well-studied single instance-level maps, these scene-level maps smoothly link large scene regions, potentially enabling unique applications in trajectory transfer in AR/VR, long demonstration transfer for imitation learning, and context-aware object rearrangement. To find 3D scene analogies, we propose neural contextual scene maps, which extract descriptor fields summarizing semantic and geometric contexts, and holistically align them in a coarse-to-fine manner for map estimation. This approach reduces reliance on individual feature points, making it robust to input noise or shape variations. Experiments demonstrate the effectiveness of our approach in identifying scene analogies and transferring trajectories or object placements in diverse indoor scenes, indicating its potential for robotics and AR/VR applications.
Poster
Yidi Shao · Mu Huang · Chen Change Loy · Bo Dai
[ Exhibit Hall I ]
Abstract
We introduce GausSim, a novel neural network-based simulator designed to capture the dynamic behaviors of real-world elastic objects represented through Gaussian kernels. We leverage continuum mechanics and treat each kernel as a Center of Mass System (CMS) that describes continuous piece of matter, accounting for realistic deformations without idealized assumptions. To improve computational efficiency and fidelity, we employ a hierarchical structure that further organizes kernels into CMSs with explicit formulations, enabling a coarse-to-fine simulation approach. This structure significantly reduces computational overhead while preserving detailed dynamics. In addition, GausSim incorporates explicit physics constraints, such as mass and momentum conservation, ensuring interpretable results and robust, physically plausible simulations. To validate our approach, we present a new dataset, READY, containing multi-view videos of real-world elastic deformations. Experimental results demonstrate that GausSim achieves superior performance compared to existing physics-driven baselines, offering a practical and accurate solution for simulating complex dynamic behaviors. Code and model will be released.
Poster
Zhenhua Ning · Zhuotao Tian · Shaoshuai Shi · Daojing He · Guangming Lu · Wenjie Pei · Li Jiang
[ Exhibit Hall I ]
Abstract
Recent advances in point cloud perception have demonstrated remarkable progress in scene understanding through vision-language alignment leveraging large language models (LLMs). However, existing methods may still encounter challenges in handling complex instructions that require accurate spatial reasoning, even if the 3D point cloud data provides detailed spatial cues such as size and position for identifying the targets. To tackle this issue, we propose Relevant Reasoning Segmentation (R$^2$S), a reasoning-based segmentation framework. The framework emulates human cognitive processes by decomposing spatial reasoning into two sequential stages: first identifying relevant elements, then processing instructions guided by their associated visual priors. Furthermore, acknowledging the inadequacy of existing datasets in complex reasoning tasks, we introduce 3D ReasonSeg, a reasoning-based segmentation dataset comprising 25,185 training samples and 3,966 validation samples with precise annotations. Both quantitative and qualitative experiments demonstrate that the R$^2$S and 3D ReasonSeg effectively endow 3D point cloud perception with stronger spatial reasoning capabilities, and we hope that they can serve as a new baseline and benchmark for future work.
Poster
Kailai Zhou · Fuqiang Yang · Shixian Wang · Bihan Wen · Chongde Zi · Linsen Chen · Qiu Shen · Xun Cao
[ Exhibit Hall I ]
Abstract
RGB-Thermal (RGBT) multispectral vision is essential for robust perception in complex environments. Most RGBT tasks follow a case-by-case research paradigm, relying on manually customized models to learn task-oriented representations. Nevertheless, this paradigm is inherently constrained by artificial inductive bias, modality bias, and data bottleneck. To address these limitations, we make the initial attempt to build a Generalized RGBT MultiSpectral foundation model (M-SpecGene), which aims to learn modality-invariant representations from large-scale broad data in a self-supervised manner. M-SpecGene provides new insights into multispectral fusion and integrates prior case-by-case studies into a unified paradigm. Considering the unique characteristic of information imbalance in RGBT data, we introduce the Cross-Modality Structural Sparsity (CMSS) metric to quantify the information density across two modalities. Then we develop the GMM-CMSS progressive masking strategy to facilitate a flexible, easy-to-hard, and object-centric pre-training process. Comprehensive experiments validate M-SpecGene's generalizability across eleven datasets for four RGBT downstream tasks.
Poster
Han Wang · Shengyang Li · Jian Yang · Yuxuan Liu · Yixuan Lv · Zhuang Zhou
[ Exhibit Hall I ]
Abstract
Detecting and tracking ground objects using earth observation imagery remains a significant challenge in the field of remote sensing. Continuous maritime ship tracking is crucial for applications such as maritime search and rescue, law enforcement, and shipping analysis. However, most current ship tracking methods rely on geostationary satellites or video satellites. The former offer low resolution and are susceptible to weather conditions, while the latter have short filming durations and limited coverage areas, making them less suitable for the real-world requirements of ship tracking. To address these limitations, we present the Hybrid Optical and Synthetic Aperture Radar (SAR) Ship Re-Identification Dataset (HOSS ReID dataset), designed to evaluate the effectiveness of ship tracking using low-Earth orbit constellations of optical and SAR sensors. This approach ensures shorter re-imaging cycles and enables all-weather tracking. HOSS ReID dataset includes images of the same ship captured over extended periods under diverse conditions, using different satellites of different modalities at varying times and angles. Furthermore, we propose a baseline method for cross-modal ship re-identification, TransOSS, which is built on the Vision Transformer architecture. It refines the patch embedding structure to better accommodate cross-modal tasks, incorporates additional embeddings to introduce more reference information, and employs contrastive learning …
Poster
Junru Lin · Chirag Vashist · Mikaela Uy · Colton Stearns · Xuan Luo · Leonidas Guibas · Ke Li
[ Exhibit Hall I ]
Abstract
Existing dynamic scene interpolation methods typically assume that the motion between consecutive time steps is small enough so that displacements can be locally approximated by linear models. In practice, even slight deviations from this small-motion assumption can cause conventional techniques to fail. In this paper, we introduce Global Motion Corresponder (GMC), a novel approach that robustly handle large motion and achieves smooth transitions. GMC learns a unary potential field that predicts SE(3) mappings into a shared canonical space, balancing correspondence, spatial and semantic smoothness, and local rigidity. We demonstrate that our method significantly outperforms existing baselines on 3D scene interpolation when the two states undergo large global motions. Furthermore, our method enables extrapolation where other baseline methods cannot.
Poster
Shouwei Ruan · Hanqing Liu · Yao Huang · XIaoqi Wang · Caixin KANG · Hang Su · Yinpeng Dong · Xingxing Wei
[ Exhibit Hall I ]
Abstract
Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs' robustness to real-world 3D variations, we propose AdvDreamer, the first framework capable of generating physically reproducible Adversarial 3D Transformation (Adv-3DT) samples from single-view observations. In AdvDreamer, we integrate three key innovations: Firstly, to characterize real-world 3D variations with limited prior knowledge precisely, we design a zero-shot Monocular Pose Manipulation pipeline built upon generative 3D priors. Secondly, to ensure the visual quality of worst-case Adv-3DT samples, we propose Naturalness Reward Model that provides continuous naturalness regularization during adversarial optimization, effectively preventing convergence to hallucinated or unnatural elements. Thirdly, to enable systematic evaluation across diverse VLM architectures and visual-language tasks, we introduce the Inverse Semantic Probability loss as the adversarial optimization objective, which solely operates in the fundamental visual-textual alignment space. Based on the captured Adv-3DT samples with high aggressiveness and transferability, we establish MM3DTBench, the first VQA benchmark dataset tailored to evaluate VLM robustness under challenging 3D variations. Extensive evaluations of representative VLMs with varying architectures reveal that real-world 3D variations can pose severe threats to model performance across various tasks.
Poster
Siqi Yang · Jinxiu Liang · Zhaojun Huang · Yeliduosi Xiaokaiti · Yakun Chang · Zhaofei Yu · Boxin Shi
[ Exhibit Hall I ]
Abstract
High-speed video reconstruction from neuromorphic spike cameras offers a promising alternative to traditional frame-based imaging, providing superior temporal resolution and dynamic range with reduced power consumption. Nevertheless, reconstructing high-quality colored videos from spikes captured in ultra-short time interval remains challenging due to the noisy nature of spikes. While some existing methods extend temporal capture window to improve reconstruction quality, they compromise the temporal resolution advantages of spike cameras. In this paper, we introduce SpikeDiff, the first zero-shot framework that leverages pretrained diffusion models to reconstruct high-quality colored videos from sub-millisecond chromatic spikes. By incorporating physics-based guidance into the diffusion sampling process, SpikeDiff bridges the domain gap between chromatic spikes and conventional images, enabling high-fidelity reconstruction without requiring domain-specific training data. Extensive experiments demonstrate that SpikeDiff achieves impressive reconstruction quality while maintaining ultra-high temporal resolution, outperforming existing methods across diverse challenging scenarios.
Poster
Hao Chen · Tao Han · Song Guo · Jie ZHANG · Yonghan Dong · Yunlong Yu · LEI BAI
[ Exhibit Hall I ]
Abstract
This paper presents Variables-Adaptive Mixture of Experts (VA-MoE), a novel framework for incremental weather forecasting that dynamically adapts to evolving spatiotemporal patterns in real-time data. Traditional weather prediction models often struggle with exorbitant computational expenditure and the need to continuously update forecasts as new observations arrive. VA-MoE addresses these challenges by leveraging a hybrid architecture of experts, where each expert specializes in capturing distinct sub-patterns of atmospheric variables (e.g., temperature, humidity, wind speed). Moreover, the proposed method employs a variable-adaptive gating mechanism to dynamically select and combine relevant experts based on the input context, enabling efficient knowledge distillation and parameter sharing. This design significantly reduces computational overhead while maintaining high forecast accuracy. Experiments on real-world ERA5 dataset demonstrate that VA-MoE performs comparable against state-of-the-art models in both short-term (e.g., 1–3 days) and long-term (e.g., 5 days) forecasting tasks, with only about 25\% of trainable parameters and 50\% of the initial training data.
Poster
hyunjin cho · Giyun choi · Jongwon Choi
[ Exhibit Hall I ]
Abstract
Existing Human Mesh Recovery (HMR) methods typically assume a standard human body structure, overlooking diverse anatomical conditions such as limb loss or mobility impairments. This assumption biases the models when applied to individuals with disabilities—a shortcoming further exacerbated by the limited availability of suitable datasets. To address this gap, we propose Amputated Joint Aware Human Recovery (AJAHR), which is an adaptive pose estimation framework that enhances mesh reconstruction for individuals with impairments. Our model incorporates a body-part amputation classifier—jointly trained alongside human mesh recovery—to detect potential amputations. We also introduce Amputee 3D (A3D), a synthetic dataset offering a wide range of amputee poses for more robust training. While maintaining strong performance on non-amputees, our approach achieves state-of-the-art results for amputated individuals.
Poster
Zhexiong Wan · Jianqin Luo · Yuchao Dai · Gim Hee Lee
[ Exhibit Hall I ]
Abstract
Recent point tracking methods have made great strides in recovering the trajectories of any point (especially key points) in long video sequences associated with large motions. However, the spatial and temporal granularities of point trajectories remain constrained by limited motion estimation accuracy and video frame rate. Leveraging the high temporal resolution and motion sensitivity of event cameras, we introduce event data for the first time to recover spatially dense and temporally continuous trajectories of every point at any time. Specifically, we define the dense and continuous point trajectory representation as estimating multiple control points of curves for each pixel and model the movement of sparse events triggered along continuous point trajectories. Building on this, we propose a novel multi-frame iterative streaming framework that first estimates local inter-frame motion representations from two consecutive frames with inter-frame events, then aggregates them into a global long-term motion representation to utilize input full video and event data with an arbitrary number of frames. Extensive experiments on simulated and real data demonstrate the significant improvement of our framework over state-of-the-art methods and the crucial role of introducing events to model continuous point trajectories.
Poster
Athinoulla Konstantinou · Georgios Leontidis · Mamatha Thota · Aiden Durrant
[ Exhibit Hall I ]
Abstract
Learning self-supervised representations that are invariant and equivariant to transformations is crucial for advancing beyond traditional visual classification tasks. However, many methods rely on predictor architectures to encode equivariance, despite evidence that architectural choices, such as capsule networks, inherently excel at learning interpretable pose-aware representations. To explore this, we introduce EquiCaps (Equivariant Capsule Network), a capsule-based approach to pose-aware self-supervision that eliminates the need for a specialised predictor for enforcing equivariance. Instead, we leverage the intrinsic pose-awareness capabilities of capsules to improve performance in pose estimation tasks. To further challenge our assumptions, we increase task complexity via multi-geometric transformations to enable a more thorough evaluation of invariance and equivariance by introducing 3DIEBench-T, an extension of a 3D object-rendering benchmark dataset. Empirical results demonstrate that EquiCaps outperforms prior state-of-the-art equivariant methods on geometric tasks, including rotation and translation, achieving a supervised-level $R^2$ of 0.78 on the 3DIEBench rotation prediction benchmark and improving upon SIE and CapsIE by 0.05 and 0.04 $R^2$, respectively. Moreover, in contrast to non-capsule-based equivariant approaches, EquiCaps maintains robust equivariant performance under combined geometric transformations, underscoring its generalisation capabilities and the promise of predictor-free capsule architectures. Code and dataset will be released.
Poster
Ye Lu · Jie Wang · Jianjun Gao · Rui Gong · Chen Cai · Kim-Hui Yap
[ Exhibit Hall I ]
Abstract
Recent Mamba-based methods for the pose-lifting task tend to model joint dependencies by 2D-to-1D mapping with diverse scanning strategies.Though effective, they struggle to model intricate joint connections and uniformly process all joint motion trajectories while neglecting the intrinsic differences across motion characteristics.In this work, we propose a structure-aware and motion-adaptive framework to capture spatial joint topology along with diverse motion dynamics independently, named as SAMA. Specifically, SAMA consists of a Structure-aware State Integrator (SSI) and a Motion-adaptive State Modulator (MSM). The Structure-aware State Integrator is tasked with leveraging dynamic joint relationships to fuse information at both the joint feature and state levels in the state space, based on pose topology rather than sequential state transitions.The Motion-adaptive State Modulator is responsible for joint-specific motion characteristics recognition, thus applying tailored adjustments to diverse motion patterns across different joints.Through the above key modules, our algorithm enables structure-aware and motion-adaptive pose lifting.Extensive experiments across multiple benchmarks demonstrate that our algorithm achieves advanced results with fewer computational costs.
Poster
Dehao Yuan · Levi Burner · Jiayi Wu · Minghui Liu · Jingxi Chen · Yiannis Aloimonos · Cornelia Fermuller
[ Exhibit Hall I ]
Abstract
Event-based motion field estimation is an important task. However, current optical flow methods face challenges: learning-based approaches, often frame-based and relying on CNNs, lack cross-domain transferability, while model-based methods, though more robust, are less accurate. To address the limitations of optical flow estimation, recent works have focused on normal flow, which can be more reliably measured in regions with limited texture or strong edges. However, existing normal flow estimators are predominantly model-based and suffer from high errors.In this paper, we propose a novel supervised point-based method for normal flow estimation that overcomes the limitations of existing event learning-based approaches. Using a local point cloud encoder, our method directly estimates per-event normal flow from raw events, offering multiple unique advantages: 1) It produces temporally and spatially sharp predictions. 2) It supports more diverse data augmentation, such as random rotation, to improve robustness across various domains. 3) It naturally supports uncertainty quantification via ensemble inference, which benefits downstream tasks. 4) It enables training and inference on undistorted data in normalized camera coordinates, improving transferability across cameras. Extensive experiments demonstrate our method achieves better and more consistent performance than state-of-the-art methods when transferred across different datasets. Leveraging this transferability, we train our model …
Poster
Shuang Guo · Friedhelm Hamann · Guillermo Gallego
[ Exhibit Hall I ]
Abstract
Event cameras rely on motion to obtain information about scene appearance. In other words, for event cameras, motion and appearance are seen both or neither, which are encoded in the output event stream. Previous works consider recovering these two visual quantities as separate tasks, which does not fit with the nature of event cameras and neglects the inherent relations between both tasks. In this paper, we propose an unsupervised learning framework that jointly estimates optical flow (motion) and image inten-sity (appearance), with a single network. Starting from the event generation model, we newly derive the event-based photometric error as a function of optical flow and image intensity, which is further combined with the contrast maximization framework, yielding a comprehensive loss function that provides proper constraints for both flow and intensity estimation. Exhaustive experiments show that our model achieves state-of-the-art performance for both optical flow (achieves 20% and 25% improvement in EPE and AE respectively in the unsupervised learning category) and intensity estimation (produces competitive results with other baselines, particularly in high dynamic range scenarios). Last but not least, our model achieves shorter inference time than all the other optical flow models and many of the image reconstruction models, while they …
Poster
Adrian Chow · Evelien Riddell · Yimu Wang · Sean Sedwards · Krzysztof Czarnecki
[ Exhibit Hall I ]
Abstract
Open-vocabulary 3D object detection for autonomous driving aims to detect novel objects beyond the predefined training label sets in point cloud scenes. Existing approaches achieve this by connecting traditional 3D object detectors with vision-language models (VLMs) to regress 3D bounding boxes for novel objects and perform open-vocabulary classification through cross-modal alignment between 3D and 2D features. However, achieving robust cross-modal alignment remains a challenge due to semantic inconsistencies when generating corresponding 3D and 2D feature pairs. To overcome this challenge, we present OV-SCAN, an Open-Vocabulary 3D framework that enforces Semantically Consistent Alignment for Novel object discovery. OV-SCAN employs two core strategies: discovering precise 3D annotations and filtering out low-quality or corrupted alignment pairs (arising from 3D annotation, occlusion-induced, or resolution-induced noise). Extensive experiments on the nuScenes dataset demonstrate that OV-SCAN achieves state-of-the-art performance.
Poster
Atin Pothiraj · Jaemin Cho · Elias Stengel-Eskin · Mohit Bansal
[ Exhibit Hall I ]
Abstract
Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, **C**ounting **A**modally for **P**atterns **T**hrough **U**nseen **RE**gions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene).CAPTURe requires both recognizing visual patterns and reasoning, making it an ideal testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models, allowing them to fill in missing information. CAPTURe consists of two parts:(1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs -- GPT-4o, Intern-VL2-Llama3, Molmo, and Qwen2-VL -- on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in …
Poster
Zhijian Huang · Chengjian Feng · Baihui Xiao · Feng yan · ZEQUN JIE · Yujie Zhong · Xiaodan Liang · Lin Ma
[ Exhibit Hall I ]
Abstract
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models. Despite the advancements, current data-driven AD approaches tend to concentrate on a single dataset and specific tasks, neglecting their overall capabilities and ability to generalize. To bridge these gaps, we propose DriveMM, a general large multimodal model designed to process diverse data inputs, such as images and multi-view videos, while performing a broad spectrum of AD tasks, including perception, prediction, and planning. Initially, the model undergoes curriculum pre-training to process varied visual signals and perform basic visual comprehension and perception tasks. Subsequently, we augment and standardize various AD-related datasets to fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To assess the general capabilities and generalization ability, we conduct evaluations on six public benchmarks and undertake zero-shot transfer on an unseen dataset, where DriveMM achieves state-of-the-art performance across all tasks. We hope DriveMM as a promising solution for future end-to-end autonomous driving applications in the real world.
Poster
Yuyi Liu · Xinhang Song · Tianliang Qi · Shuqiang Jiang
[ Exhibit Hall I ]
Abstract
Towards visual room rearrangement for embodied agents, this paper tackles the intricate challenge of restoring a disarrayed scene configuration to its intended goal state. The task necessitates a range of sophisticated capabilities, including efficient spatial navigation, precise and accurate object interaction, sensitive scene change detection, and meticulous restoration techniques. The inherent complexity of this endeavor stems from the diverse nature of potential object changes, encompassing movements within the space, alterations in appearance, and changes in existence—where objects may be introduced or removed from the scene. Previous methods, either end-to-end reinforcement learning or modular approaches, struggle with handling these changes in a unified manner due to the heterogeneous nature of the inference spaces. To address this, this paper proposes a Trial-Oriented Visual Rearrangement (TOR) framework, which leverages the principles of stronger embodiment to prune the joint reasoning space and identify a smaller shared space for processing various object changes. TOR maintains a differential point cloud representation to capture environmental changes and uses two core mechanisms, assessment and refinement, to iteratively restore the scene to the goal state. Experimental results demonstrate the effectiveness of TOR in restoring both object movement and appearance changes and show its generalization to complex multi-room environments.
Poster
Yufeng Jin · Vignesh Prasad · Snehal Jauhri · Mathias Franzius · Georgia Chalvatzaki
[ Exhibit Hall I ]
Abstract
Efficient and accurate object pose estimation is an essential component for modern vision systems in many applications such as Augmented Reality, autonomous driving, and robotics. While research in model-based 6D object pose estimation has delivered promising results, model-free methods are hindered by the high computational load in rendering and inferring consistent poses of arbitrary objects in a live RGB-D video stream. To address this issue, we present 6DOPE-GS, a novel method for online 6D object pose estimation & tracking with a single RGB-D camera by effectively leveraging advances in Gaussian Splatting. Thanks to the fast differentiable rendering capabilities of Gaussian Splatting, 6DOPE-GS can simultaneously optimize for 6D object poses and 3D object reconstruction. To achieve the necessary efficiency and accuracy for live tracking, our method uses incremental 2D Gaussian Splatting with an intelligent dynamic keyframe selection procedure to achieve high spatial object coverage and prevent erroneous pose updates. We also propose an opacity statistic-based pruning mechanism for adaptive Gaussian density control, to ensure training stability and efficiency. We evaluate our method on the HO3D and YCBInEOAT datasets and show that 6DOPE-GS matches the performance of state-of-the-art baselines for model-free simultaneous 6D pose tracking and reconstruction while providing a 5x speedup. …
Poster
Javier Tirado-Garín · Javier Civera
[ Exhibit Hall I ]
Abstract
We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and Kannala-Brandt. Our approach also applies to edited---cropped and stretched---images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. We will make our code and weights publicly available.
Poster
Zukang Liao · Min Chen
[ Exhibit Hall I ]
Abstract
In many applications, machine-learned (ML) models are required to hold some invariance qualities, such as rotation, size, and intensity invariance. Among these, testing for background invariance presents a significant challenge due to the vast and complex data space it encompasses. To evaluate invariance qualities, we use a visualization-based testing framework which allows human analysts to assess and make informed decisions about the invariance properties of ML models. We show such informative testing framework is preferred as ML models with the same global statistics (e.g., accuracy scores) can behave differently and have different visualized testing patterns. However, such human analysts might not lead to consistent decisions without a systematic sampling approach to select representative testing suites. In this work, we present a technical solution for selecting background scenes according to their semantic proximity to a target image that contains a foreground object being tested. We construct an ontology for storing knowledge about relationships among different objects using association analysis. This ontology enables efficient and meaningful search for background scenes of different semantic distances to a target image, enabling the selection of a test suite that is both diverse and reasonable. Compared with other testing techniques, e.g., random sampling, nearest neighbours, or …
Poster
Sung-Yeon Park · Can Cui · Yunsheng Ma · Ahmadreza Moradipari · Rohit Gupta · Kyungtae Han · Ziran Wang
[ Exhibit Hall I ]
Abstract
Recent advances in multi-modal large language models (MLLMs) have demonstrated strong performance across various domains; however, their ability to comprehend driving scenes remains less proven. The complexity of driving scenarios, which includes multi-view information, poses significant challenges for existing MLLMs. In this paper, we introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding. To further support generalization to multi-view driving scenarios, we also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs. For context-aware analysis of traffic scenes, we categorize our dataset into nine subtasks across three core skills: Road Environment Perception, Spatial Relations Recognition, and Ego-Centric Reasoning. Furthermore, we present BEV-LLM, integrating Bird's-Eye-View (BEV) features from multi-view images into MLLMs. Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives. In contrast, BEV-LLM demonstrates remarkable adaptability to this domain, outperforming other models in six of the nine subtasks. These findings highlight how BEV integration enhances multi-view MLLMs while also identifying key areas that require further refinement for effective adaptation to driving scenes. To facilitate further research, we will publicly release both NuPlanQA-Eval and NuPlanQA-1M upon acceptance of this paper.
Poster
Byeongjun Kwon · Munchurl Kim
[ Exhibit Hall I ]
Abstract
.Zero-shot depth estimation (DE) models exhibit strong generalization performance as they are trained on large-scale datasets. However, existing models struggle with high-resolution images due to the discrepancy in image resolutions of training (with smaller resolutions) and inference (for high resolutions). Processing them at full resolution leads to decreased estimation accuracy on depth with tremendous memory consumption, while downsampling to the training resolution results in blurred edges in the estimated depth images. Prevailing high-resolution depth estimation methods adopt a patch-based approach, which introduces depth discontinuity issues when reassembling the estimated depth patches and results in test-time inefficiency. Additionally, to obtain fine-grained depth details, these methods rely on synthetic datasets due to the real-world sparse ground truth depth, leading to poor generalizability. To tackle these limitations, we propose Patch Refine Once (PRO), an efficient and generalizable tile-based framework. Our PRO consists of two key components: (i) Grouped Patch Consistency Training that enhances test-time efficiency while mitigating the depth discontinuity problem by jointly processing four overlapping patches and enforcing a consistency loss on their overlapping regions within a single backpropagation step, and (ii) Bias-Free Masking that prevents the DE models from overfitting to dataset-specific biases, enabling better generalization to real-world datasets even after …
Poster
Xiao Fang · Minhyek Jeon · Zheyang Qin · Stanislav Panev · Celso de Melo · Shuowen Hu · Shayok Chakraborty · Fernando De la Torre
[ Exhibit Hall I ]
Abstract
Detecting vehicles in aerial imagery is a critical task with applications in traffic monitoring, urban planning, and defense intelligence. Deep learning methods have provided state-of-the-art (SOTA) results for this application. However, a significant challenge arises when models trained on data from one geographic region fail to generalize effectively to other areas. Variability in factors such as environmental conditions, urban layouts, road networks, vehicle types, and image acquisition parameters (e.g., resolution, lighting, and angle) leads to domain shifts that degrade model performance. This paper presents a novel approach to address this challenging problem by leveraging generative AI for the high-quality synthesis of aerial images and corresponding labels to enhance detector training through data augmentation. Our key contribution is the development of a multi-stage, multi-modal knowledge transfer framework utilizing fine-tuned latent diffusion models (LDMs) to mitigate the distribution gap between the source and target environments. Through extensive experiments across diverse aerial imagery domains, we demonstrate significant performance gains (more than 40% in some cases) over existing domain adaptation and weakly supervised learning methods. Our method also outperforms the baseline detectors trained on a source dataset by 4-12%. Furthermore, we introduce two newly annotated aerial datasets from New Zealand and Utah, which along …
Poster
Chong Cheng · Yu Hu · Sicheng Yu · Beizhen ZHAO · Zijian Wang · Hao Wang
[ Exhibit Hall I ]
Abstract
3D Gaussian Splatting (3DGS) has demonstrated its potential in reconstructing scenes from unposed images. However, optimization-based 3DGS methods struggle with sparse views due to limited prior knowledge. Meanwhile, feed-forward Gaussian approaches are constrained by input formats, making it challenging to incorporate more input views. To address these challenges, we propose RegGS, a 3D Gaussian registration-based framework for reconstructing unposed sparse views. RegGS aligns local 3D Gaussians generated by a feed-forward network into a globally consistent 3D Gaussian representation. Technically, we implement an entropy-regularized Sinkhorn algorithm to efficiently solve the optimal transport Mixture 2-Wasserstein $(\text{MW}_2)$ distance, which serves as an alignment metric for Gaussian mixture models (GMMs) in $\mathrm{Sim}(3)$ space. Furthermore, we design a joint 3DGS registration module that integrates the $\text{MW}_2$ distance, photometric consistency, and depth geometry. This enables a coarse-to-fine registration process while accurately estimating camera poses and aligning the scene. Experiments on the \textit{RE10K} and \textit{ACID} datasets demonstrate that RegGS effectively registers local Gaussians with high fidelity, achieving precise pose estimation and high-quality novel-view synthesis.
Poster
Xiaoyang Hao · Han Li
[ Exhibit Hall I ]
Abstract
Monocular 3D human pose estimation (HPE) methods estimate the 3D positions of joints from individual images. Existing 3D HPE approaches often use the cropped image alone as input for their models. However, the relative depths of joints cannot be accurately estimated from cropped images without the corresponding camera intrinsics, which determine the perspective relationship between 3D objects and the cropped images. In this work, we introduce Perspective Encoding (PE) to encode the camera intrinsics of the cropped images. Moreover, since the human subject can appear anywhere within the original image, the perspective relationship between the 3D scene and the cropped image differs significantly, which complicates model fitting. Additionally, the further the human subject deviates from the image center, the greater the perspective distortions in the cropped image. To address these issues, we propose Perspective Rotation (PR), a transformation applied to the original image that centers the human subject, thereby reducing perspective distortions and alleviating the difficulty of model fitting.By incorporating PE and PR, we propose a novel 3D HPE framework, PersPose. Experimental results demonstrate that PersPose achieves state-of-the-art (SOTA) performance on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets. For example, on the in-the-wild dataset 3DPW, PersPose achieves an MPJPE of 60.1 …
Poster
ZIYU ZHU · Xilin Wang · Yixuan Li · Zhuofan Zhang · Xiaojian Ma · Yixin Chen · Baoxiong Jia · Wei Liang · Qian Yu · Zhidong Deng · Siyuan Huang · Qing Li
[ Exhibit Hall I ]
Abstract
Embodied scene understanding requires not only comprehending visual-spatial information that has been observed but also determining where to explore next in the 3D physical world. Existing 3D Vision-Language (3D-VL) models primarily focus on grounding objects in static observations from 3D reconstruction, such as meshes and point clouds, but lack the ability to actively perceive and explore their environment. To address this limitation, we introduce \underline{\textbf{M}}ove \underline{\textbf{t}}o \underline{\textbf{U}}nderstand (\textbf{MTU3D}), a unified framework that integrates active perception with \underline{\textbf{3D}} vision-language learning, enabling embodied agents to effectively explore and understand their environment. This is achieved by three key innovations 1) Online query-based representation learning, enabling direct spatial memory construction from RGB-D frames, eliminating the need for explicit 3D reconstruction. 2) A unified objective for grounding and exploration that represents unexplored locations as frontier queries and jointly optimizes object grounding and frontier selection. 3) End-to-end trajectory learning that combines \textbf{V}ision-\textbf{L}anguage-\textbf{E}xploration pre-training over a million diverse trajectories collected from both simulated and real-world RGB-D sequences. Extensive evaluations across various embodied navigation and question-answering benchmarks show that MTU3D outperforms state-of-the-art reinforcement learning and modular navigation approaches by 14\%, 27\%, 11\%, and 3\% in success rate on HM3D-OVON, GOAT-Bench, SG3D, and A-EQA, respectively. MTU3D's versatility enables navigation …
Poster
Yinuo Zhao · Jiale Yuan · Zhiyuan Xu · Xiaoshuai Hao · Xinyi Zhang · Kun Wu · Zhengping Che · Chi Liu · Jian Tang
[ Exhibit Hall I ]
Abstract
Recent advances in vision-language models (VLMs) have significantly improved performance in embodied tasks such as goal decomposition and visual comprehension. However, providing accurate rewards for robotic manipulation without fine-tuning VLMs remains challenging due to the absence of domain-specific robotic knowledge in pre-trained datasets and high computational costs that hinder real-time applicability. To address this, we propose **$T^2$-VLM**, a novel training-free, temporally consistent framework that generates accurate rewards through tracking the changes in VLM-derived subgoals. Specifically, our method first queries the VLM to establish spatially aware subgoals and an initial completion estimate before each round of interaction. We then employ a Bayesian tracking algorithm to update the goal completion status dynamically, using subgoal hidden states to generate structured rewards for reinforcement learning (RL) agents. This approach enhances long-horizon decision-making and improves failure recovery capabilities with RL. Extensive experiments indicate that **$T^2$-VLM** achieves state-of-the-art performance in two robot manipulation benchmarks, demonstrating superior reward accuracy with reduced computation consumption. We believe our approach not only advances reward generation techniques but also contributes to the broader field of embodied AI.
Poster
Hongjin Lyu · Bo Li · Paul Rosin · Yu-Kun Lai
[ Exhibit Hall I ]
Abstract
Image colorization is a typical ill-posed problem. Among various colorization methods, scribble-based methods have a unique advantage that allows users to accurately resolve ambiguities and modify the colors of any objects to suit their specific tastes. However, due to the time-consuming scribble drawing process, users tend to draw sparse scribbles instead of dense and detailed scribbles, which makes it challenging for existing methods, especially for regions with no immediate scribbles. Facing the above problems, this paper proposes a novel colorization algorithm named Local and Global Affinity Net (LGA-Net) that formulates the scribble-based colorization task as an affinity propagation process at both local and global levels. Instead of predicting color values directly, our neural network learns to predict local and global affinity relationships between pixels for a given grayscale input, describing how colors should be propagated, which are independent of the scribbles. Given reliable affinity relationships, the color propagation process is formulated as a maximum a posteriori problem. Both local and global affinities are represented using a weighted graph and enabled by a graph Laplacian regularizer to ensure accurate color propagation. Extensive experiments demonstrate that LGA-Net produces state-of-the-art colorization results when using sparse scribbles.
Poster
Aneel Damaraju · Dean Hazineh · Todd Zickler
[ Exhibit Hall I ]
Abstract
Vision benefits from grouping pixels into objects and understanding their spatial relationships, both laterally and in depth. This is captured by a scene representation comprising of an occlusion-ordered stack of "object layers,’’ each containing an isolated and amodally-completed object. To infer this representation from an image we introduce a diffusion-based architecture named Concurrent Object Layers (CObL). CObL generates a stack of object layers concurrently, using Stable Diffusion as a prior for natural objects, and using inference-time guidance to ensure the inferred layers composite back to the input image. We train CObL using a few thousand synthetically-generated images of multi-object tabletop scenes, and we find that it zero-shot generalizes to scenes of real-world tabletops with varying numbers of novel objects. In contrast to recent models for amodal object completion, CObL reconstructs multiple partially-occluded objects without any user prompting and without knowing the number of objects beforehand; and unlike previous models for object-centric representation learning, CObL is not limited to the closed world it was trained in.
Poster
Matthew Beveridge · Shree Nayar
[ Exhibit Hall I ]
Abstract
We introduce a taxonomy of solid materials for hierarchical material recognition from local appearance. Our taxonomy is motivated by vision applications, and is arranged according to the physical traits of materials. We contribute a diverse dataset of images and aligned depth maps of materials in the wild. The depth maps can be used to generate novel views to augment the dataset. Utilizing the taxonomy and dataset, we present a learning-based approach to hierarchical material recognition that uses graph neural networks. Our model leverages taxonomic proximity between material classes, and achieves state-of-the-art performance. We show that our model has the potential to generalize in few-shot learning settings. As a result, it achieves coarse classification of underrepresented materials.
Poster
Haoran Wang · Zekun Li · Jian Zhang · Lei Qi · Yinghuan Shi
[ Exhibit Hall I ]
Abstract
Large vision models like the Segment Anything Model (SAM) exhibit significant limitations when applied to downstream tasks in the wild. Consequently, reference segmentation, which leverages reference images and their corresponding masks to impart novel knowledge to the model, emerges as a promising new direction for adapting vision models. However, existing reference segmentation approaches predominantly rely on meta-learning, which still necessitates an extensive meta-training process and brings massive data and computational cost. In this study, we propose a novel approach by representing the inherent correspondence between reference-target image pairs as a pseudo video. This perspective allows the latest version of SAM, known as SAM2, which is equipped with interactive video object segmentation (iVOS) capabilities, to be adapted to downstream tasks in a lightweight manner. We term this approach Correspondence As Video for SAM (CAV-SAM). CAV-SAM comprises two key modules: the Diffusion-Based Semantic Transition (DBST) module employs a diffusion model to construct a semantic transformation sequence, while the Test-Time Geometric Alignment (TTGA) module aligns the geometric changes within this sequence through test-time fine-tuning. We evaluated CAVSAM on widely-used datasets, achieving segmentation performance improvements exceeding 5% over SOTA methods. Implementation is provided in the supplementary materials.
Poster
Vladislav Bargatin · Egor Chistov · Alexander Yakovenko · Dmitriy Vatolin
[ Exhibit Hall I ]
Abstract
Recent advances in optical flow estimation have prioritized accuracy at the cost of growing GPU memory consumption, particularly for high-resolution (FullHD) inputs. We introduce MEMFOF, a memory-efficient multi-frame optical flow method that identifies a favorable trade-off between multi-frame estimation and GPU memory usage. Notably, MEMFOF requires only 2.09 GB of GPU memory at runtime for 1080p inputs, and 28.5 GB during training, which uniquely positions our method to be trained at native 1080p without the need for cropping or downsampling.We systematically revisit design choices from RAFT-like architectures, integrating reduced correlation volumes and high-resolution training protocols alongside multi-frame estimation, to achieve state-of-the-art performance across multiple benchmarks while substantially reducing memory overhead. Our method outperforms more resource-intensive alternatives in both accuracy and runtime efficiency, validating its robustness for flow estimation at high resolutions. At the time of submission, our method ranks first on the Spring benchmark with a 1-pixel (1px) outlier rate of 3.289. On Sintel (clean), we share first place with the 5-frame VideoFlow-MOF, achieving an endpoint error (EPE) of 0.991, and on KITTI-2015, we place first with an Fl-all error of 2.94\%. Ablation studies demonstrate the critical role of multi-frame strategies, correlation-volume scaling, and resolution-aware training in striking an optimal …
Poster
Qingwang Zhang · Yingying Zhu
[ Exhibit Hall I ]
Abstract
This paper addresses the limitations of existing cross-view object geo-localization schemes, which rely on rectangular proposals to localize irregular objects in satellite imagery. These ``rectangular shackles" inherently struggle to precisely define objects with complex geometries, leading to incomplete coverage or erroneous localization. We propose a novel scheme, cross-view object segmentation (CVOS), which achieves fine-grained geo-localization by predicting pixel-level segmentation masks of query objects. CVOS enables accurate extraction of object shapes, sizes, and areas—critical for applications like urban planning and agricultural monitoring. We also created the CVOGL-Seg dataset specifically to support and evaluate CVOS. To tackle CVOS challenges, we introduce Transformer Object Geo-localization (TROGeo), a two-stage framework. First, the Heterogeneous Task Training Stage (HTTS) employs a single transformer encoder with a Cross-View Object Perception Module (CVOPM) and is trained by learning a heterogeneous task.Second, the SAM Prompt Stage (SPS) utilizes SAM’s zero-shot segmentation capability, guided by HTTS outputs, to generate precise masks. We extensively evaluate our method on CVOGL and CVOGL-Seg datasets and demonstrate state-of-the-art performance compared to existing models. Our work demonstrates that CVOS breaks the rectangular shackles and unlocks new potential for fine-grained object geo-localization.
Poster
Xianghui Xie · Jan Lenssen · Gerard Pons-Moll
[ Exhibit Hall I ]
Abstract
We propose MVGBench, a comprehensive benchmark for multi-view image generation models (MVGs) that evaluates 3D consistency in geometry and texture, image quality, and semantics (using vision language models).Recently, MVGs have been the main driving force in 3D object creation. However, existing metrics compare generated images against ground truth target views, which is not suitable for generative tasks where multiple solutions exist while differing from ground truth. Furthermore, different MVGs are trained on different view angles, synthetic data and specific lightings -- robustness to these factors and generalization to real data are rarely evaluated thoroughly. Without a rigorous evaluation protocol, it is also unclear what design choices contribute to the progress of MVGs. MVGBench evaluates three different aspects: best setup performance, generalization to real data and robustness. Instead of comparing against ground truth, we introduce a novel 3D self-consistency metric which compares 3D reconstructions from disjoint generated multi-views. We systematically compare 12 existing MVGs on 4 different curated real and synthetic datasets. With our analysis, we identify important limitations of existing methods specially in terms of robustness and generalization, and we find the most critical design choices. Using the discovered best practices, we propose ViFiGen, a method that outperforms all evaluated …
Poster
Dongwoo Kang · Akhil Perincherry · Zachary Coalson · Aiden Gabriel · Stefan Lee · Sanghyun Hong
[ Exhibit Hall I ]
Abstract
An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose importance-based adaptive thresholding for the early-exit methods. (3) To improve temporal efficiency, we implement a caching mechanism that prevents reprocessing of views previously seen by the agent. In evaluations on seven VLN benchmarks, we demonstrate over a 2$\times$ reduction in computation across three off-the-shelf agents in both standard and continuous environments.
Poster
Shaocong Dong · Lihe Ding · Xiao Chen · Yaokun Li · Yuxin WANG · Yucheng Wang · Qi WANG · Jaehyeok Kim · Chenjian Gao · Zhanpeng Huang · Zibin Wang · Tianfan Xue · Dan Xu
[ Exhibit Hall I ]
Abstract
To generate 3D objects, early research focused on multi-view-driven approaches relying solely on 2D renderings. Recently, the 3D native latent diffusion paradigm has demonstrated superior performance in 3D generation, because it fully leverages the geometric information provided in ground truth 3D data. Despite its fast development, 3D diffusion still faces three challenges. First, the majority of these methods represent a 3D object by one single latent, regardless of its complexity. This may lead to detail loss when generating 3D objects with multiple complicated parts. Second, most 3D assets are designed parts by parts, yet the current holistic latent representation overlooks the independence of these parts and their interrelationships, limiting the model's generative ability. Third, current methods rely on global conditions (e.g., text, image, point cloud) to control the generation process, lacking detailed controllability. Therefore, motivated by how 3D designers create a 3D object, we present a new part-based 3D generation framework, CoPart, which represents a 3D object with multiple contextual part latents and simultaneously generates coherent 3D parts. This part-based framework has several advantages, including: i) reduces the encoding burden of intricate objects by decomposing them into simpler parts, ii) facilitates part learning and part relationship modeling, and iii) naturally …
Poster
Xin Wang · Xinlin Wang · Shuiping Gou
[ Exhibit Hall I ]
Abstract
Vision-based geolocation techniques that establish spatial correspondences between smaller query images and larger georeferenced images have gained significant attention. Existing approaches typically employ a separate "retrieve-then-match" paradigm, whereas such paradigms suffer from computational inefficiency or precision limitations.To this end, we propose TopicGeo, an unified framework for direct and precise query-to-reference image matching via three key innovations.The textual object semantics, called topics, distilled from CLIP prompt learning are embedded into the geolocation framework to eliminate intra-class and inter-class distribution discrepancies while also enhancing processing efficiency.Center-based adaptive label assignment and outlier rejection mechanisms as a joint retrieval-matching optimization strategy ensure task-coherent feature learning and precise spatial correspondences. A multi-level fine matching pipeline is introduced to refine matching from quality and quantity.Evaluations on large-scale synthetic and real-world datasets illustrate that TopicGeo achieves state-of-the-art performance in retrieval recall and matching accuracy while maintaining a balance in computational efficiency.
Poster
Haoyu Wu · Jingyi Xu · Hieu Le · Dimitris Samaras
[ Exhibit Hall I ]
Abstract
Token merging can effectively accelerate various vision systems by processing groups of similar tokens only once and sharing the results across them. However, existing token grouping methods are often ad hoc and random, disregarding the actual content of the samples. We show that preserving high-information tokens during merging—those essential for semantic fidelity and structural details—significantly improves sample quality, producing finer details and more coherent, realistic generations. Despite being simple and intuitive, this approach remains underexplored.To do so, we propose an importance-based token merging method that prioritizes the most critical tokens in computational resource allocation, leveraging readily available importance scores, such as those from classifier-free guidance in diffusion models. Experiments show that our approach significantly outperforms baseline methods across multiple applications, including text-to-image synthesis, multi-view image generation, and video generation with various model architectures such as Stable Diffusion, Zero123++, AnimateDiff, or PixArt-$\alpha$.
Poster
Yunuo Chen · Zezheng Lyu · Bing He · Ning Cao · Gang chen · Guo Lu · Wenjun Zhang
[ Exhibit Hall I ]
Abstract
Recent learned image compression (LIC) models have achieved remarkable rate-distortion (RD) performance, yet their high computational complexity severely limits practical deployment. To overcome this challenge, we propose a novel Stage-wise Modular Distillation framework, SMoDi, which efficiently compresses LIC models while preserving RD performance. This framework treats each stage of LIC models as an independent sub-task, mirroring the teacher model’s task decomposition to student, thereby simplifying knowledge transfer.We identify two crucial factors determining the effectiveness of knowledge distillation: student model construction and loss function design. Specifically, we first propose Teacher-Guided Student Model Construction, a pruning-like method ensuring architectural consistency between teacher and student models. Next, we introduce Implicit End-to-end Supervision, facilitating adaptive energy compaction and bitrate regularization.Based on these insights, we develop KDIC, a lightweight student model derived from the state-of-the-art S2CFormer model. Experimental results demonstrate that KDIC achieves top-tier RD performance with significantly reduced computational complexity. To our knowledge, this work is among the first successful applications of knowledge distillation to learned image compression.
Poster
Zihan Wang · Jeff Tan · Tarasha Khurana · Neehar Peri · Deva Ramanan
[ Exhibit Hall I ]
Abstract
We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio) - such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views.
Poster
Boqian Li · Zeyu Cai · Michael Black · Haiwen Feng · Yuliang Xiu
[ Exhibit Hall I ]
Abstract
Fitting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). It also reduces directional errors by (67.2% ~ 89.8%) in few-shot settings (<1% data). Qualitative results demonstrate strong performance regardless of body shape, loose clothing, or challenging poses. We will release the code and models for research purposes.
Poster
David Serrano · Aditya Arora · Luis Herranz · Kosta Derpanis · Michael Brown · Javier Vazquez-Corral
[ Exhibit Hall I ]
Abstract
White balance (WB) correction in scenes with multiple illuminants remains a persistent challenge in computer vision. Recent methods explored fusion-based approaches, where a neural network linearly blends multiple sRGB versions of an input image, each processed with predefined WB presets. However, we demonstrate that these methods are suboptimal for common multi-illuminant scenarios. Additionally, existing fusion-based methods rely on sRGB WB datasets lacking dedicated multi-illuminant images, limiting both training and evaluation. To address these challenges, we introduce two key contributions. First, we propose an efficient transformer-based model that effectively captures spatial dependencies across sRGB WB presets, substantially improving upon linear fusion techniques. Second, we introduce a large-scale multi-illuminant dataset comprising over 16,000 sRGB images rendered with five different WB settings, along with WB-corrected images. Our method achieves up to 100% improvement over existing techniques on our new multi-illuminant image fusion dataset. We will release our code and dataset upon acceptance.
Poster
Siyu Chen · Ting Han · Changshe Zhang · Xin Luo · Meiliu Wu · Guorong Cai · Jinhe Su
[ Exhibit Hall I ]
Abstract
Vision Foundation Models (VFMs) have delivered remarkable performance in Domain Generalized Semantic Segmentation (DGSS). However, recent methods often overlook the fact that visual cues are susceptible, whereas the underlying geometry remains stable, rendering depth information more robust. In this paper, we investigate the potential of integrating depth information with features from VFMs, to improve the geometric consistency within an image and boost the generalization performance of VFMs. We propose a novel fine-tuning DGSS framework, named DepthForge, which integrates the visual cues from frozen DINOv2 or EVA02 and depth cues from frozen Depth Anything V2. In each layer of the VFMs, we incorporate depth-aware learnable tokens to continuously decouple domain-invariant visual and spatial information, thereby enhancing depth awareness and attention of the VFMs. Finally, we develop a depth refinement decoder and integrate it into the model architecture to adaptively refine multi-layer VFM features and depth-aware learnable tokens. Extensive experiments are conducted based on various DGSS settings and five different datsets as unseen target domains. The qualitative and quantitative results demonstrate that our method significantly outperforms alternative approaches with stronger performance, steadier visual-spatial attention, and superior generalization ability. In particular, DepthForge exhibits outstanding performance under extreme conditions (e.g., night and snow). Code …
Poster
Mingtao Feng · Longlong Mei · Zijie Wu · Jianqiao Luo · Fenghao Tian · Jie Feng · Weisheng Dong · Yaonan Wang
[ Exhibit Hall I ]
Abstract
Text to point cloud cross-modal localization is a crucial vision-language task for future human-robot collaboration. Existing coarse-to-fine frameworks assume that each query text precisely corresponds to the center area of a submap, limiting their applicability in real-world scenarios. This work redefines the task under a more realistic assumption, relaxing the one-to-one retrieval constraint by allowing patially matching query text and submap pairs. To address this challenge, we augment datasets with partially matching submaps and introduce an uncertainty-aware framework. Specifically, we model cross-modal ambiguity in fine-grained location regression by integrating uncertainty scores, represented as 2D Gaussian distributions, to mitigate the impact of challenging samples. Additionally, we propose an uncertainty-aware similarity metric that enhances similarity assessment between query text and submaps by propagating uncertainty into coarse place recognition, enabling the model to learn discriminative features, effectively handle partially matching samples and improve task synergy. Extensive experiments on KITTI360Pose and CityRefer demonstrate that our method achieves state-of-the-art performance across both stages. Our code will be publicly available.
Poster
Zhuoyuan Li · Jiahao Lu · Jiacheng Deng · Hanzhi Chang · Lifan Wu · Yanzhe Liang · Tianzhu Zhang
[ Exhibit Hall I ]
Abstract
The open vocabulary capability of 3D models is increasingly valued, as traditional methods with models trained with fixed categories fail to recognize unseen objects in complex dynamic 3D scenes. In this paper, we propose a simple yet effective approach, SAS, to integrate the open vocabulary capability of multiple 2D models and migrate it to 3D domain. Specifically, we first propose Model Alignment via Text to map different 2D models into the same embedding space using text as a bridge. Then, we propose Annotation-Free Model Capability Construction to explicitly quantify the 2D model's capability of recognizing different categories using diffusion models. Following this, point cloud features from different 2D models are fused with the guide of constructed model capabilities. Finally, the integrated 2D open vocabulary capability is transferred to 3D domain through feature distillation. SAS outperforms previous methods by a large margin across multiple datasets, including ScanNet v2, Matterport3D, and nuScenes, while its generalizability is further validated on downstream tasks, e.g., gaussian segmentation and instance segmentation.
Poster
Yijun Yang · Zhao-Yang Wang · Qiuping Liu · Shu Wen Sun · Kang Wang · Rama Chellappa · Zongwei Zhou · Alan Yuille · Lei Zhu · Yu-Dong Zhang · Jieneng Chen
[ Exhibit Hall I ]
Abstract
Providing effective treatment and making informed decisions are essential goals of modern medicine and clinical care.We are interested in simulating disease dynamics for clinical decision-making, leveraging recent advances in large generative models.To this end, we introduce the Medical World Model (MeWM), the first world model in medicine that predicts future disease states based on clinical decisions. MeWM comprises (i) vision-language models to serve as policy models, and (ii) tumor generative models as dynamics models. The policy model generates action plans, such as clinical treatments, while the dynamics model simulates tumor progression or regression under given treatment conditions. Building on this, we propose the inverse dynamics model that applies survival analysis to the simulated post-treatment tumor, enabling the evaluation of treatment efficacy and the selection of the optimal clinical action plan. As a result, the proposed MeWM simulates disease dynamics by synthesizing post-treatment tumors, with state-of-the-art specificity in Turing tests evaluated by radiologists. Simultaneously, its inverse dynamics model outperforms medical-specialized GPTs in optimizing individualized treatment protocols across all metrics.Notably, MeWM improves clinical decision-making for interventional physicians, boosting F1-score in selecting the optimal TACE protocol by 13\%, paving the way for future integration of medical world models as the second readers.
Poster
Yanrui Bin · Wenbo Hu · Haoyuan Wang · Xinya Chen · Bing WANG
[ Exhibit Hall I ]
Abstract
Surface normal estimation serves as a cornerstone for a spectrum of computer vision applications.While numerous efforts have been devoted to static image scenarios, ensuring temporal coherence in video-based normal estimation remains a formidable challenge.Instead of merely augmenting existing methods with temporal components, we present NormalCrafter to leverage the inherent temporal priors of video diffusion models.To secure high-fidelity normal estimation across sequences, we propose Semantic Feature Regularization (SFR), which aligns diffusion features with semantic cues, encouraging the model to concentrate on the intrinsic semantics of the scene. Moreover, we introduce a two-stage training protocol that leverages both latent and pixel space learning to preserve spatial accuracy while maintaining long temporal context.Extensive evaluations demonstrate the efficacy of our method, showcasing a superior performance in generating temporally consistent normal sequences with intricate details from diverse videos.Code and models will be publicly available.
Poster
Han Han · Wei Zhai · Yang Cao · Bin Li · Zheng-Jun Zha
[ Exhibit Hall I ]
Abstract
Tracking Any Point (TAP) plays a crucial role in motion analysis. Video-based approaches rely on iterative local matching for tracking, but they assume linear motion during the blind time between frames, which leads to point loss under large displacements or nonlinear motion. The high temporal resolution and motion blur-free characteristics of event cameras provide continuous, fine-grained motion information, capturing subtle variations with microsecond precision. This paper presents an event-based framework for tracking any point, which tackles the challenges posed by spatial sparsity and motion sensitivity in events through two tailored modules. Specifically, to resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process. Additionally, a variable motion aware module is integrated to ensure temporally consistent responses that are insensitive to varying velocities, thereby enhancing matching precision.To validate the effectiveness of the approach, two event dataset for tracking any point is constructed by simulation. The method improves the $Survival_{50}$ metric by 17.9\% over event-only tracking of any point baseline. Moreover, on standard feature tracking benchmarks, it outperforms all existing methods, even those that combine events and video frames.
Poster
Ruijie Zhu · Mulin Yu · Linning Xu · Lihan Jiang · Yixuan Li · Tianzhu Zhang · Jiangmiao Pang · Bo Dai
[ Exhibit Hall I ]
Abstract
3D Gaussian Splatting is renowned for its high-fidelity reconstructions and real-time novel view synthesis, yet its lack of semantic understanding limits object-level perception. In this work, we propose ObjectGS, an object-aware framework that unifies 3D scene reconstruction with semantic understanding. Instead of treating the scene as a unified whole, ObjectGS models individual objects as local anchors that generate neural Gaussians and share object IDs, enabling precise object-level reconstruction. During training, we dynamically grow or prune these anchors and optimize their features, while a one-hot ID encoding with a classification loss enforces clear semantic constraints. We show through extensive experiments that ObjectGS not only outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks, but also integrates seamlessly with applications like mesh extraction and scene editing.
Poster
Zhiqiang Yan · Zhengxue Wang · Haoye Dong · Jun Li · Jian Yang · Gim Hee Lee
[ Exhibit Hall I ]
Abstract
We introduce DuCos, a novel depth super-resolution framework grounded in Lagrangian duality theory, offering a flexible integration of multiple constraints and reconstruction objectives to enhance accuracy and robustness. Our DuCos is the first to significantly improve generalization across diverse scenarios with foundation models as prompts. The prompt design consists of two key components: Correlative Fusion (CF) and Gradient Regulation (GR). CF facilitates precise geometric alignment and effective fusion between prompt and depth features, while GR refines depth predictions by enforcing consistency with sharp-edged depth maps derived from foundation models. Crucially, these prompts are seamlessly embedded into the Lagrangian constraint term, forming a synergistic and principled framework. Extensive experiments demonstrate that DuCos outperforms existing state-of-the-art methods, achieving superior accuracy, robustness, and generalization. The source codes and pre-trained models will be publicly available.
Poster
Muhammad Usama Saleem · Ekkasit Pinyoanuntapong · Mayur Patel · Hongfei Xue · Ahmed Helmy · Srijan Das · Pu Wang
[ Exhibit Hall I ]
Abstract
Reconstructing a 3D hand mesh from a single RGB image is challenging due to complex articulations, self-occlusions, and depth ambiguities. Traditional discriminative methods, which learn a deterministic mapping from a 2D image to a single 3D mesh, often struggle with the inherent ambiguities in 2D-to-3D mapping. To address this challenge, we propose MaskHand, a novel generative masked model for hand mesh recovery that synthesizes plausible 3D hand meshes by learning and sampling from the probabilistic distribution of the ambiguous 2D-to-3D mapping process. MaskHand consists of two key components: (1) a VQ-MANO, which encodes 3D hand articulations as discrete pose tokens in a latent space, and (2) a Context-Guided Masked Transformer that randomly masks out pose tokens and learns their joint distribution, conditioned on corrupted token sequence, image context, and 2D pose cues. This learned distribution facilitates confidence-guided sampling during inference, producing mesh reconstructions with low uncertainty and high precision. Extensive evaluations on benchmark and real-world datasets demonstrate that MaskHand achieves state-of-the-art accuracy, robustness, and realism in 3D hand mesh reconstruction. Project website: https://anonymous-ml-model.github.io/MaskHand.
Poster
Ziyue Huang · Yongchao Feng · Ziqi Liu · Shuai Yang · Qingjie Liu · Yunhong Wang
[ Exhibit Hall I ]
Abstract
Remote sensing object detection has made significant progress, but most studies still focus on closed-set detection, limiting generalization across diverse datasets. Open-vocabulary object detection (OVD) provides a solution by leveraging multimodal associations between text prompts and visual features. However, existing OVD methods for remote sensing (RS) images are constrained by small-scale datasets and fail to address the unique challenges of remote sensing interpretation, include oriented object detection and the need for both high precision and real-time performance in diverse scenarios. To tackle these challenges, we propose OpenRSD, a universal open-prompt RS object detection framework. OpenRSD supports multimodal prompts and integrates multi-task detection heads to balance accuracy and real-time requirements. Additionally, we design a multi-stage training pipeline to enhance the generalization of model. Evaluated on seven public datasets, OpenRSD demonstrates superior performance in oriented and horizontal bounding box detection, with real-time inference capabilities suitable for large-scale RS image analysis. Compared to YOLO-World, OpenRSD exhibits an 8.7% higher average precision and achieves an inference speed of 20.8 FPS. Codes and models will be released.
Poster
Maolin Wei · Wanzhou Liu · Eshed Ohn-Bar
[ Exhibit Hall I ]
Abstract
If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present RoadRules, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using RoadRules, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts (2) fine-tuning on RoadRules improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in RoadRules-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on RoadRules enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and DriveLM, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks. …
Poster
Huixin Sun · Yanjing Li · Linlin Yang · Xianbin Cao · Baochang Zhang
[ Exhibit Hall I ]
Abstract
Despite advances in generic object detection, there remains a performance gap in detecting small objects compared to normal-scale objects. We reveal that conventional object localization methods suffer from gradient instability in small objects due to sharper loss curvature, leading to a convergence challenge. To address the issue, we propose Uncertainty-Aware Gradient Stabilization (UGS), a framework that reformulates object localization as a classification task to stabilize gradients. UGS quantizes continuous labels into interval non-uniform discrete representations. Under a classification-based objective, the localization branch generates bounded and confidence-driven gradients, mitigating instability. Furthermore, UGS integrates an uncertainty minimization (UM) loss that reduces prediction variance and an uncertainty-guided refinement (UR) module that identifies and refines high-uncertainty regions via perturbations. Evaluated on four benchmarks, UGS consistently improves anchor-based, anchor-free, and state-of-the-art small object detectors. Especially, UGS boosts the prior art DNTR by 3.2\% AP on the VisDrone dataset. The code will be released upon acceptance.
Poster
Sangwon Baik · Hyeonwoo Kim · Hanbyul Joo
[ Exhibit Hall I ]
Abstract
We present a method for learning 3D spatial relationships between object pairs, referred to as object-object spatial relationships (OOR), by leveraging synthetically generated 3D samples from pre-trained 2D diffusion models. We hypothesize that images synthesized by 2D diffusion models inherently capture plausible and realistic OOR cues, enabling efficient ways to collect a 3D dataset to learn OOR for various unbounded object categories. Our approach begins by synthesizing diverse images that capture plausible OOR cues, which we then uplift into 3D samples. Leveraging our diverse collection of plausible 3D samples for the object pairs, we train a score-based OOR diffusion model to learn the distribution of their relative spatial relationships. Additionally, we extend our pairwise OOR to multi-object OOR by enforcing consistency across pairwise relations. Extensive experiments demonstrate the robustness of our method across various object-object spatial relationships, along with its applicability to real-world 3D scene arrangement tasks using the OOR diffusion model.
Poster
Yuru Jia · Valerio Marsocci · Ziyang Gong · Xue Yang · Maarten Vergauwen · Andrea Nascetti
[ Exhibit Hall I ]
Abstract
Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily focus on discriminative objectives, such as contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models—which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation—remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs.
Poster
Xiaoxiao Wang · Chunxiao Li · Peng Sun · Boming Miao · Yunjian Zhang · Yao Zhu
[ Exhibit Hall I ]
Abstract
Human keypoint detection is fundamental in computer vision, with applications in pose estimation and action recognition. However, existing evaluation metrics (e.g., OKS, PCP, PDJ) rely on human-annotated ground truth, a labor-intensive process that increases costs, limits scalability. To address this, we propose KPAScore (KeyPoint-Answering Score), an annotation-free metric independent of ground truth. It evaluates keypoint detection using a two-stage VLM-based question-answering process: first, the VLM identifies the presence of keypoints within the image, and second, visual prompts are introduced to query the likelihood of each keypoint being accurately localized within a predefined boundary. To validate the rationale behind KPAScore, we propose KPUBench (KeyPoint Understanding Benchmark), which comprehensively evaluates the VLM's ability to determine keypoint presence and localization. Extensive experiments demonstrate KPAScore’s effectiveness from three perspectives: consistency to keypoint variation, correlation with traditional metrics, alignment with human perception. We hope KPAScore will reduce reliance on manual annotations, facilitating broader adoption of keypoint detection in real-world applications.
Poster
Chengkai Hou · Yanjie Ze · Yankai Fu · Zeyu Gao · Songbo Hu · Yue Yu · Shanghang Zhang · Huazhe Xu
[ Exhibit Hall I ]
Abstract
General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks, FVP boosts the average success rate of 3D Diffusion Policy (DP3) for these tasks by 28%. The FVP pre-trained DP3 achieves state-of-the-art performance across imitation learning methods. Moreover, the efficacy of \ours adapts across various point cloud encoders and datasets. Finally, we apply FVP to the RDT-1B, a larger Vision-Language-Action robotic model, enhancing its performance on various robot tasks.
Poster
Jiakai Zhang · Shouchen Zhou · Haizhao Dai · Xinhang Liu · Peihao Wang · Zhiwen Fan · Yuan Pei · Jingyi Yu
[ Exhibit Hall I ]
Abstract
Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets. We will release our code, models, and datasets to stimulate further research.
Poster
Huachao Zhu · Zelong Liu · Zhichao Sun · Yuda Zou · Gui-Song Xia · Yongchao Xu
[ Exhibit Hall I ]
Abstract
Recognizing out-of-distribution (OoD) objects on roads is crucial for safe driving. Most existing methods rely on segmentation models' uncertainty as anomaly scores, often resulting in false positives - especially at ambiguous regions like boundaries, where segmentation models inherently exhibit high uncertainty. Additionally, it is challenging to define a suitable threshold to generate anomaly masks, especially with the inconsistencies in predictions across consecutive frames. We propose DetSeg, a novel paradigm that helps incorporate object-level understanding. DetSeg first detects all objects in the open world and then suppresses in-distribution (ID) bounding boxes, leaving only OoD proposals. These proposals can either help previous methods eliminate false positives (DetSeg-$\mathcal{R}$), or generate binary anomaly masks without complex threshold search when combined with a box-prompted segmentation module (DetSeg-$\mathcal{S}$).Additionally, we introduce vanishing point guided Hungarian matching (VPHM) to smooth the prediction results within a video clip, mitigating abrupt variations of predictions between consecutive frames. Comprehensive experiments on various benchmarks demonstrate that DetSeg significantly improves performance, reducing the FPR$\it{_{95}}$ of previous methods by up to 37.45\%, offering a more robust and practical solution for this domain.
Poster
Yu Wang · Bo Dang · Wanchun Li · Wei Chen · Yansheng Li
[ Exhibit Hall I ]
Abstract
With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces \textbf{HoliTracer}, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code will be made publicly available.
Poster
Tomasz Niewiadomski · Anastasios Yiannakidis · Hanz Cuevas Velasquez · Soubhik Sanyal · Michael Black · Silvia Zuffi · Peter Kulits
[ Exhibit Hall I ]
Abstract
The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior.Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations.However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species.Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels.While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem.Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets.Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments.Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model.We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding …
Poster
Haiwen Feng · Junyi Zhang · Qianqian Wang · Yufei Ye · Pengcheng Yu · Michael Black · Trevor Darrell · Angjoo Kanazawa
[ Exhibit Hall I ]
Abstract
Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
Poster
Leekyeung Han · Hyunji Min · Gyeom Hwangbo · Jonghyun Choi · Paul Hongsuck Seo
[ Exhibit Hall I ]
Abstract
We introduce DialNav, a novel dialog-based navigation task, where an embodied agent (Navigator) collaborates with a remote guide (Guide) through multi-turn dialog to reach a goal location. Unlike prior works our setting requires Guide to infer Navigator's location based on dialog, making dialog crucial for success. To support this task, we collect and release Remote Assistance in Navigation (RAIN) dataset, human-human dialog paired with navigation trajectories in photorealistic environments. We design a comprehensive benchmark, evaluating navigation and dialog, and conduct extensive experiments analyzing the impact of different Navigator and Guide models. We highlight key challenges and publicly release the dataset, code, and evaluation framework to foster advancements in dialog-based embodied AI.
Poster
Taewoo Kim · Kuk-Jin Yoon
[ Exhibit Hall I ]
Abstract
In low-light environments, a longer exposure time is generally required to enhance image visibility; however, this setting inevitably causes motion blur. Even with a long exposure time, videos captured in low-light environments still suffer from issues such as low visibility, low contrast, and color distortion. Additionally, the long exposure time results in videos with a low frame rate. Therefore, videos captured in low-light exhibit low visibility and motion blur, as well as low frame rates. To overcome these limitations, we propose a novel problem aimed at transforming motion-blurred, low-frame-rate videos with poor visibility in low-light environments into high-frame-rate videos while simultaneously enhancing their visibility. To tackle this challenge, we leverage the unique advantages of event cameras, which capture scene changes asynchronously, providing superior temporal resolution and a wider dynamic range compared to conventional frame-based cameras. These properties make event cameras particularly effective in reducing motion blur, compensating for low frame rates, and enhancing visibility in low-light conditions. To this end, we developed a hybrid camera system that integrates two RGB cameras and an event camera, capturing a dedicated dataset for this task and proposing novel network architectures to effectively address this problem. For future work, we plan to release the …
Poster
Haoyi Zhu · Yifan Wang · Jianjun Zhou · Wenzheng Chang · Yang Zhou · Zizun Li · Junyi Chen · Chunhua Shen · Jiangmiao Pang · Tong He
[ Exhibit Hall I ]
Abstract
The integration of geometric reconstruction and generative modeling remains a critical challenge in developing AI systems capable of human-like spatial reasoning. This paper proposes Aether, a unified framework that enables geometry-aware reasoning in world models by jointly optimizing three core capabilities: (1) 4D dynamic reconstruction, (2) action-conditioned video prediction, and (3) goal-conditioned visual planning. Through task-interleaved feature learning, Aether achieves synergistic knowledge sharing across reconstruction, prediction, and planning objectives. Building upon video generation models, our framework demonstrates unprecedented synthetic-to-real generalization despite never observing real-world data during training. Furthermore, our approach achieves zero-shot generalization in both action following and reconstruction tasks, thanks to its intrinsic geometric modeling. Remarkably, even without real-world data, its reconstruction performance far exceeds that of domain-specific models. Additionally, Aether leverages a geometry-informed action space to seamlessly translate predictions into actions, enabling effective autonomous trajectory planning. We hope our work inspires the community to explore new frontiers in physically-reasonable world modeling and its applications.
Poster
Hyeongjin Nam · Donghwan Kim · Gyeongsik Moon · Kyoung Mu Lee
[ Exhibit Hall I ]
Abstract
The misaligned human texture across different human parts is one of the main limitations of existing 3D human reconstruction methods. Each human part, such as a jacket or pants, should maintain a distinct texture without blending into others. The structural coherence of human parts serves as a crucial cue to infer human textures in the invisible regions of a single image. However, most existing 3D human reconstruction methods do not explicitly exploit such part segmentation priors, leading to misaligned textures in their reconstructions. In this regard, we present PARTE, which uses 3D human part information as a key guide to reconstruct 3D human textures. Our framework comprises two core components. First, to infer 3D human part information from a single image, we propose a 3D part segmentation module (PartSegmenter) that initially reconstructs a textureless human surface and predicts human part labels based on the textureless surface. Second, to incorporate part information into texture reconstruction, we introduce a part-guided texturing module (PartTexturer), which acquires prior knowledge from a pre-trained image generation network on texture alignment of human parts. Our extensive experiments demonstrate that PARTE achieves state-of-the-art quality in 3D human reconstruction. We will release our code.
Poster
Kim Kiehn · Albin Ahlbäck · Kathlén Kohn
[ Exhibit Hall I ]
Abstract
We completely classify all minimal problems for Structure-from-Motion (SfM) where arrangements of points and lines are fully observed by multiple uncalibrated pinhole cameras. We find 291 minimal problems, 73 of which have unique solutions and can thus be solved linearly.Two of the linear problems allow an arbitrary number of views, while all other minimal problems have at most 9 cameras. All minimal problems have at most 7 points and at most 12 lines. We compute the number of solutions of each minimal problem, as this gives a measurement of the problem's intrinsic difficulty, and find that these number are relatively low (e.g., when comparing with minimal problems for calibrated cameras). Finally, by exploring stabilizer subgroups of subarrangements, we develop a geometric and systematic way to 1) factorize minimal problems into smaller problems, 2) identify minimal problems in underconstrained problems, and 3) formally prove non-minimality.
Poster
Haoye Dong · Gim Hee Lee
[ Exhibit Hall I ]
Abstract
Human pose sequence refinement plays a crucial role in improving the accuracy, smoothness, and temporal coherence of pose estimation across a sequence of frames. Despite its importance in real-world applications, human pose sequence refinement has received less attention than human pose estimation. In this paper, we propose PS-Mamba, a novel framework that refines human pose sequences by integrating spatial-temporal graph learning with state space modeling. Specifically, we introduce the Spatial-Temporal Graph State Space (ST-GSS) block, which captures spatial and temporal dependencies across joints to smooth pose sequences while preserving structural integrity. The spatial-temporal graph models intricate joint interactions, while the state space component effectively manages temporal dynamics, reducing both short- and long-term pose instability. Additionally, we incorporate a dynamic graph weight matrix to adaptively model the relative influence of joint interactions, further mitigating pose ambiguity. Extensive experiments on challenging benchmarks demonstrate that our PS-Mamba outperforms SOTAs, achieving $\mathbf{-14.21}$ mm MPJPE (+18.5\%), $\mathbf{-13.59}$ mm PA-MPJPE (+22.1\%), and $\mathbf{-0.42}$ mm/s² ACCEL (+9.7\%) compared to SynSP on AIST++, significantly reducing jitters and enhancing pose stability. Our code has been submitted as supplementary and will be open-sourced upon acceptance.
Poster
Amin Karimi Monsefi · Mridul Khurana · Rajiv Ramnath · Anuj Karpatne · Wei-Lun (Harry) Chao · Cheng Zhang
[ Exhibit Hall I ]
Abstract
We propose TaxaDiffusion, a taxonomy-informed training framework for diffusion models to generate fine-grained animal images with high morphological and identity accuracy. Unlike standard approaches that treat each species as an independent category, TaxaDiffusion incorporates domain knowledge that many species exhibit strong visual similarities, with distinctions often residing in subtle variations of shape, pattern, and color. To exploit these relationships, TaxaDiffusion progressively trains conditioned diffusion models across different taxonomic levels --- starting from broad classifications such as Class and Order, refining through Family and Genus, and ultimately distinguishing at the Species level. This hierarchical learning strategy first captures coarse-grained morphological traits shared by species with common ancestors, facilitating knowledge transfer, before refining fine-grained differences for species-level distinction. As a result, TaxaDiffusion enables accurate generation even with limited training samples per species. Extensive experiments on three fine-grained animal datasets demonstrate that TaxaDiffusion outperforms existing approaches, achieving superior fidelity in fine-grained animal image generation. Our model and code will be publicly available.
Poster
Zhenjun Yu · Wenqiang Xu · Pengfei Xie · Yutong Li · Brian Anthony · Zhuorui Zhang · Cewu Lu
[ Exhibit Hall I ]
Abstract
We present ViTaM-D, a novel visual-tactile framework for reconstructing dynamic hand-object interaction with distributed tactile sensing to enhance contact modeling. Existing methods, relying solely on visual inputs, often fail to capture occluded interactions and object deformation. To address this, we introduce DF-Field, a distributed force-aware contact representation leveraging kinetic and potential energy in hand-object interactions. ViTaM-D first reconstructs interactions using a visual network with contact constraint, then refines contact details through force-aware optimization, improving object deformation modeling. To evaluate deformable object reconstruction, we introduce the HOT dataset, featuring 600 hand-object interaction sequences in a high-precision simulation environment. Experiments on DexYCB and HOT datasets show that ViTaM-D outperforms state-of-the-art methods in reconstruction accuracy for both rigid and deformable objects. DF-Field also proves more effective in refining hand poses and enhancing contact modeling than previous refinement methods. The code, models, and datasets will be made public.
Poster
Shijie Zhou · Alexander Vilesov · Xuehai He · Ziyu Wan · Shuwang Zhang · Aditya Nagachandra · Di Chang · Dongdong Chen · Xin Wang · Achuta Kadambi
[ Exhibit Hall I ]
Abstract
Vision-language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts—abilities essential for robust real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs’ spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.
Poster
Evan Casey · Tianyu Zhang · Shu Ishida · John Thompson · Amir Khasahmadi · Joseph Lambourne · Pradeep Kumar Jayaraman · Karl Willis
[ Exhibit Hall I ]
Abstract
We adapt alignment techniques from reasoning LLMs to the task of generating engineering sketch constraints found in computer-aided design (CAD) models.Engineering sketches consist of geometric primitives (e.g. points, lines) connected by constraints (e.g. perpendicular, tangent) that define the relationships between them. For a design to be easily editable, the constraints must effectively capture design intent, ensuring the geometry updates predictably when parameters change. Although current approaches can generate CAD designs, an open challenge remains to align model outputs with design intent, we label this problem `design alignment'. A critical first step towards aligning generative CAD models is to generate constraints which fully-constrain all geometric primitives, without over-constraining or distorting sketch geometry. Using alignment techniques to train an existing constraint generation model with feedback from a constraint solver, we are able to fully-constrain 93\% of sketches compared to 34\% when using a naïve supervised fine-tuning (SFT) baseline and only 8.9\% without alignment. Our approach can be applied to any existing constraint generation model and sets the stage for further research bridging alignment strategies between the language and design domains.
Poster
Qian Liang · Ruixu Geng · Jinbo Chen · Haoyu Wang · Yan Chen · Yang Hu
[ Exhibit Hall I ]
Abstract
Remote physiological measurement based on video and radar has made significant progress in recent years. However, unimodal methods based solely on video or radar sensor have notable limitations due to their measurement principles, and multimodal remote photoplethysmography (rPPG) that combines these modalities has emerged as a promising direction. Despite its potential, the lack of large-scale multimodal data and the significant modality gap between video and radar pose substantial challenges in building robust video-radar rPPG models. To handle these problems, we suggest leveraging unimodal pre-training and present the Spatial alignment and Temporal Matching (SATM) Adapter to effectively fine-tune pre-trained unimodal backbones into a multimodal rPPG model. Given the distinct measurement principles of video- and radar-based methods, we propose Spatial Alignment to align the spatial distribution of their features. Furthermore, Temporal Matching is applied to mitigate waveform discrepancies between video and radar signals. By integrating these two modules into adapters, the unimodal backbones could retain their modality-specific knowledge while effectively extracting complementary features from each other. Extensive experiments across various challenging scenarios, including low light conditions and head motions, demonstrate that our approach significantly surpasses the state-of-the-art methods. Code will be released upon acceptance.
Poster
Yusuke Hirota · Ryo Hachiuma · Boyi Li · Ximing Lu · Michael Boone · Boris Ivanovic · Yejin Choi · Marco Pavone · Yu-Chiang Frank Wang · Noa Garcia · Yuta Nakashima · Chao-Han Yang
[ Exhibit Hall I ]
Abstract
Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do confounding features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias measurements. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to confounders rather than true gender bias, undermining their reliability. Since creating confounder-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside confounder-sensitivity measurements to enable a more reliable assessment of gender bias in VLMs.
Poster
Peizheng Li · Shuxiao Ding · You Zhou · Qingwen Zhang · Onat Inak · Larissa Triess · Niklas Hanselmann · Marius Cordts · Andreas Zell
[ Exhibit Hall I ]
Abstract
Open-world 3D semantic occupancy prediction aims to generate a voxelized 3D representation from sensor inputs while recognizing both known and unknown objects. Transferring open-vocabulary knowledge from vision-language models (VLMs) offers a promising direction but remains challenging. However, methods based on VLM-derived 2D pseudo-labels with traditional supervision are limited by a predefined label space and lack general prediction capabilities.Direct alignment with pretrained image embeddings, on the other hand, fails to achieve reliable performance due to often inconsistent image and text representations in VLMs.To address these challenges, we propose AGO, a novel 3D occupancy prediction framework with adaptive grounding to handle diverse open-world scenarios.AGO first encodes surrounding images and class prompts into 3D and text embeddings, respectively, leveraging similarity-based grounding training with 3D pseudo-labels. Additionally, a modality adapter maps 3D embeddings into a space aligned with VLM-derived image embeddings, reducing modality gaps.Experiments on Occ3D-nuScenes show that AGO improves unknown object prediction in zero-shot and few-shot transfer while achieving state-of-the-art closed-world self-supervised performance, surpassing prior methods by 4.09 mIoU.The code will be released upon acceptance.
Poster
Quanmin Liang · Qiang Li · Shuai Liu · Xinzi Cao · Jinyi Lu · Feidiao Yang · Wei Zhang · Kai Huang · Yonghong Tian
[ Exhibit Hall I ]
Abstract
Applying pretraining-finetuning paradigm to event cameras presents significant challenges due to the scarcity of large-scale event datasets and the inherently sparse nature of event data, which increases the risk of overfitting during extensive pretraining.In this paper, we explore the transfer of pretrained image knowledge to the domain of event cameras to address this challenge. The key to our approach lies in adapting event data representations to align with image pretrained models while simultaneously integrating spatiotemporal information and mitigating data sparsity. To achieve this, we propose a lightweight SpatioTemporal information fusion Prompting (STP) method, which progressively fuses the spatiotemporal characteristics of event data through a dynamic perception module with multi-scale spatiotemporal receptive fields, enabling compatibility with image pretrained models.STP enhances event data representation by capturing local information within a large receptive field and performing global information exchange along the temporal dimension. This strategy effectively reduces sparse regions in event data while refining fine-grained details, all while preserving its inherent spatiotemporal structure. Our method significantly outperforms previous state-of-the-art approaches across classification, semantic segmentation, and optical flow estimation tasks. For instance, it achieves a top-1 accuracy of 68.87\% (+4.04\%) on N-ImageNet with only 1/10 of the pretraining parameters and 1/3 of the training …
Poster
Jiahao Xia · Yike Wu · Wenjian Huang · Jianguo Zhang · Jian Zhang
[ Exhibit Hall I ]
Abstract
Part-level features are crucial for image understanding, but few studies focus on them because of the lack of fine-grained labels. Although unsupervised part discovery can eliminate the reliance on labels, most of them cannot maintain robustness across various categories and scenarios, which restricts their application range. To overcome this limitation, we present a more effective paradigm for unsupervised part discovery, named Masked Part Autoencoder (MPAE). It first learns part descriptors as well as a feature map from the inputs and produces patch features from a masked version of the original images. Then, the masked regions are filled with the learned part descriptors based on the similarity between the local features and descriptors. By restoring these masked patches using the part descriptors, they become better aligned with their part shapes, guided by appearance features from unmasked patches. Finally, MPAE robustly discovers meaningful parts that closely match the actual object shapes, even in complex scenarios. Moreover, several looser yet more effective constraints are proposed to enable MPAE to identify the presence of parts across various scenarios and categories in an unsupervised manner. This provides the foundation for addressing challenges posed by occlusion and for exploring part similarity across multiple categories. Extensive experiments …
Poster
Shaobo Zhang · Yuhang Huang · Wanqing Zhao · Wei Zhao · Ziyu Guan · Jinye Peng
[ Exhibit Hall I ]
Abstract
This paper introduces EA6D, a novel diffusion-based framework for 6D pose estimation that operates effectively in any environment. Traditional pose estimation methods struggle with the variability and complexity of real-world scenarios, often leading to overfitting on controlled datasets and poor generalization to new scenes. To address these challenges, we propose a generative pose estimation paradigm that generates environment-independent object representations for pose estimation, which are robust to environmental variations such as illumination, occlusion, and background clutter. Specifically, we propose the novel Environment Decoupling Diffusion Models (EDDM) which separates object representations from environmental factors while enabling efficient few-step sampling by leveraging input image priors instead of pure noise initialization. We validate our approach on four standard benchmarks and a self-made dataset DiverseScenes. The results demonstrate that EA6D, trained using only synthetic data, can outperform the state-of-the-art methods with both synthetic and realistic data. In particular, for fair comparisons with synthetic data, we can exceed the previous SOTA by $18.1\%$ and $33.5\%$ on LINEMOD and Linemod-Occluded datasets respectively.
Poster
Peng-Hao Hsu · Ke Zhang · Fu-En Wang · Tao Tu · Ming-Feng Li · Yu-Lun Liu · Albert Y. C. Chen · Min Sun · Cheng-Hao Kuo
[ Exhibit Hall I ]
Abstract
Open-vocabulary (OV) 3D object detection is an emerging field, yet its exploration through image-based methods remains limited compared to 3D point cloud-based methods. We introduce OpenM3D, a novel open-vocabulary multi-view indoor 3D object detector trained without human annotations. In particular, OpenM3D is a single-stage detector adapting the 2D-induced voxel features from the ImGeoNet model. To support OV, it is jointly trained with a class-agnostic 3D localization loss requiring high-quality 3D pseudo boxes and a voxel-semantic alignment loss requiring diverse pre-trained CLIP features. We follow the training setting of OV-3DET where posed RGB-D images are given but no human annotations of 3D boxes or classes are available. We propose a 3D Pseudo Box Generation method using a graph embedding technique that combines 2D segments into coherent 3D structures. Our pseudo-boxes achieve higher precision and recall than other methods, including the method proposed in OV-3DET. We further sample diverse CLIP features from 2D segments associated with each coherent 3D structure to align with the corresponding voxel feature. The key to training a highly accurate single-stage detector requires both losses to be learned toward high-quality targets. At inference, OpenM3D, a highly efficient detector, requires only multi-view images for input and demonstrates superior accuracy …
Poster
Duo Wu · Jinghe Wang · Yuan Meng · Yanning Zhang · Le Sun · Zhi Wang
[ Exhibit Hall I ]
Abstract
Utilizing large language models (LLMs) for tool planning has emerged as a promising avenue for developing general AI systems, where LLMs automatically schedule external tools (e.g., vision models) to tackle complex tasks based on task descriptions. To push this paradigm toward practical applications, it is crucial for LLMs to consider tool execution costs (e.g., execution time) for tool planning. Unfortunately, prior studies overlook the tool execution costs, leading to the generation of expensive plans whose costs outweigh their benefits in terms of task performance. To fill this gap, we propose the Cost-Aware Tool Planning with LLMs (CATP-LLM) framework, which for the first time provides a coherent design to empower LLMs for cost-aware tool planning. Specifically, To facilitate efficient concurrent tool execution and cost reduction, we design a tool planning language to enhance the LLM for creating multi-branch non-sequential plans.Moreover, we propose a cost-aware offline reinforcement learning algorithm to fine-tune the LLM to optimize the performance-cost trade-off in tool planning. In the lack of public cost-related datasets, we further present OpenCATP, the first dataset for cost-aware planning, which comprises 11,100 evaluation samples from diverse tasks. Extensive experiments show that CATP-LLM outperforms GPT-4 even when using Llama2-7B as its backbone, with the …
Poster
Qiaole Dong · Yanwei Fu
[ Exhibit Hall I ]
Abstract
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video, even in the presence of occlusions. Traditional methods use optical flow models to directly estimate long-range motion, but they often suffer from appearance drifting without considering temporal consistency. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one, which is slow and less effective for long-range tracking. To account for temporal consistency and enable efficient information propagation, we present a lightweight and fast model with $\textbf{S}$treaming memory for dense $\textbf{PO}$int $\textbf{T}$racking and online video processing. The $\textbf{SPOT}$ framework features three core components: a customized memory reading module for feature enhancement, a sensory memory for short-term motion dynamics modeling, and a visibility-guided splatting module for accurate information propagation. This combination enables SPOT to perform dense point tracking with state-of-the-art accuracy on the CVO benchmark, as well as comparable or superior performance to offline models on sparse tracking benchmarks such as TAP-Vid and RoboTAP. Notably, SPOT with 10$\times$ smaller parameter numbers operates at least 2$\times$ faster than previous state-of-the-art models while maintaining the best performance on …
Poster
Dadong Jiang · Zhi Hou · Zhihui Ke · Xianghui Yang · Xiaobo Zhou · Tie Qiu
[ Exhibit Hall I ]
Abstract
Dynamic scene reconstruction is a long-term challenge in 3D vision. Recent methods extend 3D Gaussian Splatting to dynamic scenes via additional deformation fields and apply explicit constraints like motion flow to guide the deformation. However, they learn motion changes from individual timestamps independently, making it challenging to reconstruct complex scenes, particularly when dealing with violent movement, extreme-shaped geometries, or reflective surfaces.To address the above issue, we design a plug-and-play module called TimeFormer to enable existing deformable 3D Gaussians reconstruction methods with the ability to implicitly model motion patterns from a learning perspective.Specifically, TimeFormer includes a Cross-Temporal Transformer Encoder, which adaptively learns the temporal relationships of deformable 3D Gaussians.Furthermore, we propose a two-stream optimization strategy that transfers the motion knowledge learned from TimeFormer to the base stream during the training phase. This allows us to remove TimeFormer during inference, thereby preserving the original rendering speed.Extensive experiments in the multi-view and monocular dynamic scenes validate qualitative and quantitative improvement brought by TimeFormer.Project Page: https://anonymous-create-ui.github.io/TimeFormer
Poster
Ryan Po · Yotam Nitzan · Richard Zhang · Berlin Chen · Tri Dao · Eli Shechtman · Gordon Wetzstein · Xun Huang
[ Exhibit Hall I ]
Abstract
Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our model through spatial retrieval and reasoning tasks over extended horizons. Experiments on Memory Maze and Minecraft datasets demonstrate that our approach surpasses baselines in preserving long-range memory, while maintaining practical inference speeds suitable for interactive applications.
Poster
Shaojie Ma · Yawei Luo · Wei Yang · Yi Yang
[ Exhibit Hall I ]
Abstract
3D reconstruction and simulation, although interrelated, have distinct objectives: reconstruction requires a flexible 3D representation that can adapt to diverse scenes, while simulation needs a structured representation to model motion principles effectively. This paper introduces the Mesh-adsorbed Gaussian Splatting (MaGS) method to address this challenge. MaGS constrains 3D Gaussians to roam near the mesh, creating a mutually adsorbed mesh-Gaussian 3D representation. Such representation harnesses both the rendering flexibility of 3D Gaussians and the structured property of meshes. To achieve this, we introduce RMD-Net, a network that learns motion priors from video data to refine mesh deformations, alongside RGD-Net, which models the relative displacement between the mesh and Gaussians to enhance rendering fidelity under mesh constraints. To generalize to novel, user-defined deformations beyond input video without reliance on temporal data, we propose MPE-Net, which leverages inherent mesh information to bootstrap RMD-Net and RGD-Net. Due to the universality of meshes, MaGS is compatible with various deformation priors such as ARAP, SMPL, and soft physics simulation. Extensive experiments on the D-NeRF, DG-Mesh, and PeopleSnapshot datasets demonstrate that MaGS achieves state-of-the-art performance in both reconstruction and simulation.
Poster
Yunwei Lan · Zhigao Cui · Xin Luo · Chang Liu · Nian Wang · Menglin Zhang · Yanzhao Su · Dong Liu
[ Exhibit Hall I ]
Abstract
Recent advancements in unpaired dehazing, particularly those using GANs, show promising performance in processing real-world hazy images. However, these methods tend to face limitations due to the generator's limited transport mapping capability, which hinders the full exploitation of their effectiveness in unpaired training paradigms. To address these challenges, we propose DehazeSB, a novel unpaired dehazing framework based on the Schrödinger Bridge. By leveraging optimal transport (OT) theory, DehazeSB directly bridges the distributions between hazy and clear images. This enables optimal transport mappings from hazy to clear images in fewer steps, thereby generating high-quality dehazed results. To ensure the consistency of structural information and localized details in the restored images, we introduce detail-preserving regularization, which enforces pixel-level alignment between hazy inputs and dehazed outputs. Furthermore, we propose a novel prompt learning to leverage pre-trained CLIP models in distinguishing hazy images and clear ones, by learning a haze-aware vision-language alignment. Extensive experiments on multiple real-world datasets demonstrate our method's superiority. Our code will be open-sourced.
Poster
Minwen Liao · Hao Dong · Xinyi Wang · Kurban Ubul · Ziyang Yan · Yihua Shao
[ Exhibit Hall I ]
Abstract
Low-light enhancement has wide applications in autonomous driving, 3D reconstruction, remote sensing, surveillance, and so on, which can significantly improve information utilization. However, most existing methods lack generalization and are limited to specific tasks such as image recovery. To address these issues, we propose Gated-Mechanism Mixture-of-Experts (GM-MoE), the first framework to introduce a mixture-of-experts network for low-light image enhancement. GM-MoE comprises a dynamic gated weight conditioning network and three sub-expert networks, each specializing in a distinct enhancement task. Combining a self-designed gated mechanism that dynamically adjusts the weights of the sub-expert networks for different data domains. Additionally, we integrate local and global feature fusion within sub-expert networks to enhance image quality by capturing multi-scale features. Experimental results demonstrate that the GM-MoE achieves superior generalization with respect to 25 compared approaches, reaching state-of-the-art performance on PSNR on 5 benchmarks and SSIM on 4 benchmarks, respectively.
Poster
Abhinav Kumar · Yuliang Guo · Zhihao Zhang · Xinyu Huang · Liu Ren · Xiaoming Liu
[ Exhibit Hall I ]
Abstract
Monocular 3D object detectors, while effective on data from one ego camera height, struggle with unseen or out-of-distribution camera heights. Existing methods often rely on Plucker embeddings, image transformations or data augmentation. This paper takes a step towards this understudied problem by investigating the impact of camera height variations on state-of-the-art (SoTA) Mono3D models. With a systematic analysis on the extended CARLA dataset with multiple camera heights, we observe that depth estimation is a primary factor influencing performance under height variations. We mathematically prove and also empirically observe consistent negative and positive trends in mean depth error of regressed and ground-based depth models, respectively, under camera height changes. To mitigate this, we propose Camera Height Robust Monocular 3D Detector (CHARM3R), which averages both depth estimates within the model. CHARM3R significantly improves generalization to unseen camera heights, achieving SoTA performance on the CARLA dataset. Our code, models, and extended datasets will be publicly available.
Poster
Tongshun Zhang · Pingping Liu · Yubing Lu · Mengen Cai · Zijian Zhang · Zhe Zhang · Qiuzhan Zhou
[ Exhibit Hall I ]
Abstract
Traditional Low-Light Image Enhancement (LLIE) methods primarily focus on uniform brightness adjustment, often neglecting instance-level semantic information and the inherent characteristics of different features. To address these limitations, we propose CWNet (Causal Wavelet Network), a novel architecture that leverages wavelet transforms for causal reasoning. Specifically, our approach comprises two key components: 1) Inspired by the concept of intervention in causality, we adopt a causal reasoning perspective to reveal the underlying causal relationships in low-light enhancement. From a global perspective, we employ a metric learning strategy to ensure causal embeddings adhere to causal principles, separating them from non-causal confounding factors while focusing on the invariance of causal factors. At the local level, we introduce an instance-level CLIP semantic loss to precisely maintain causal factor consistency. 2) Based on our causal analysis, we present a wavelet transform-based backbone network that models high-frequency information through an SS2D scanning strategy aligned with high-frequency components, enabling precise recovery of high-frequency details, while complex modeling of low-frequency information is achieved by combining the advantages of Fast Fourier Convolution and wavelet convolution. Extensive experiments demonstrate that CWNet significantly outperforms current state-of-the-art methods across multiple datasets, showcasing its robust performance across diverse scenes.
Poster
Hongyi Zhou · Xiaogang Wang · Yulan Guo · Kai Xu
[ Exhibit Hall I ]
Abstract
Accurately analyzing the motion parts and their motion attributes in dynamic environments is crucial for advancing key areas such as embodied intelligence. Addressing the limitations of existing methods that rely on dense multi-view images or detailed part-level annotations, we propose an innovative framework that can analyze 3D mobility from monocular videos in a zero-shot manner. This framework can precisely parse motion parts and motion attributes only using a monocular video, completely eliminating the need for annotated training data. Specifically, our method first constructs the scene geometry and roughly analyzes the motion parts and their initial motion attributes combining depth estimation, optical flow analysis and point cloud registration method, then employs 2D Gaussian splatting for scene representation. Building on this, we introduce an end-to-end dynamic scene optimization algorithm specifically designed for articulated objects, refining the initial analysis results to ensure the system can handle ‘rotation’, ‘translation’, and even complex movements (‘rotation+translation’), demonstrating high flexibility and versatility. To validate the robustness and wide applicability of our method, we created a comprehensive dataset comprising both simulated and real-world scenarios. Experimental results show that our framework can effectively analyze articulated object motions in an annotation-free manner, showcasing its significant potential in future embodied intelligence …
Poster
Xinqi Fan · Xueli CHEN · Luoxiao Yang · Chuin Hong Yap · Rizwan Qureshi · Qi Dou · Moi Hoon Yap · Mubarak Shah
[ Exhibit Hall I ]
Abstract
Vision-language models (VLMs) have shown promise in test-time adaptation tasks due to their remarkable capabilities in understanding and reasoning about visual content through natural language descriptions. However, training VLMs typically demands substantial computational resources, and they often struggle to adapt efficiently to new domains or tasks. Additionally, dynamically estimating the test distribution from streaming data at test time remains a significant challenge. In this work, we propose a novel test-time retrieval-augmented adaption (TT-RAA) method that enables VLMs to maintain high performance across diverse visual recognition tasks without the need for task-specific training or large computational overhead. During inference, TT-RAA employs a streaming mixture of Gaussian database (SMGD) to continuously estimate test distributions, requiring minimal storage. Then, TT-RAA retrieves the most relevant information from the SMGD, enhancing the original VLM outputs. A key limitation of CLIP-based VLMs is their inter-modal vision-language optimization, which does not optimize vision-space similarity, leading to larger intra-modal variance. To address this, we propose a multimodal retrieval augmentation module that transforms the SMGD into a unified multimodal space, enabling retrieval that aligns both vision and language modalities. Extensive experiments across both cross-domain and out-of-distribution benchmarks comprising fourteen datasets demonstrate TT-RAA’s superior performance compared to state-of-the-art methods. Ablation …
Poster
Jinxi Li · Ziyang Song · Bo Yang
[ Exhibit Hall I ]
Abstract
In this paper, we aim to model 3D scene geometry, appearance, and physical information just from dynamic multi-view videos in the absence of any human labels. By leveraging physics-informed losses as soft constraints or integrating simple physics models into neural networks, existing works often fail to learn complex motion physics, or doing so requires additional labels such as object types or masks. In this paper, we propose a new framework named **TRACE** to model the motion physics of complex dynamic 3D scenes. The key novelty of our approach is that, by formulating each 3D point as a rigid particle with size and orientation in space, we choose to directly learn a translation rotation dynamics system for each particle, explicitly estimating a complete set of physical parameters to govern the particle's motion over time. Extensive experiments on three existing dynamic datasets and one newly created challenging synthetic datasets demonstrate the extraordinary performance of our method over baselines in the task of future frame extrapolation. A nice property of our framework is that multiple objects or parts can be easily segmented just by clustering the learned physical parameters. Our datasets and code will be released at https://github.com/.
Poster
Kevin Tandi · Xiang Dai · Chinmay Talegaonkar · Gal Mishne · Nicholas Antipa
[ Exhibit Hall I ]
Abstract
Compressive video capture encodes a short high-speed video into a single measurement using a low-speed sensor, then computationally reconstructs the original video. Prior implementations rely on expensive hardware and are restricted to imaging sparse scenes with empty backgrounds. We propose RnGCam, a system that fuses measurements from low-speed consumer-grade rolling-shutter (RS) and global-shutter (GS) sensors into video at kHz frame rates. The RS sensor is combined with a pseudorandom optic, called a diffuser, which spatially multiplexes scene information. The GS sensor is coupled with a conventional lens. The RS-diffuser provides low spatial detail and high temporal detail, complementing the GS-lens system's high spatial detail and low temporal detail. We propose a reconstruction method using implicit neural representations (INR) to fuse the measurements into a high-speed video. Our INR method separately models the static and dynamic scene components, while regularizing dynamics explicitly. In simulation, we show that our approach significantly outperforms previous RS compressive video methods, as well as state-of-the-art frame interpolators. We validate our approach in a dual-camera hardware setup, which generates 230 frames of video at 4,800 frames per second for dense scenes, using hardware that costs 10x less than previous compressive video systems.
Poster
Shengjie Lin · Jiading Fang · Muhammad Zubair Irshad · Vitor Campagnolo Guizilini · Rares Ambrus · Greg Shakhnarovich · Matthew Walter
[ Exhibit Hall I ]
Abstract
Reconstructing articulated objects prevalent in daily environments is crucial for applications in augmented/virtual reality and robotics. However, existing methods face scalability limitations (requiring 3D supervision or costly annotations), robustness issues (being susceptible to local optima), and rendering shortcomings (lacking speed or photorealism). We introduce SplArt, a self-supervised, category-agnostic framework that leverages 3D Gaussian Splatting (3DGS) to reconstruct articulated objects and infer kinematics from two sets of posed RGB images captured at different articulation states, enabling real-time photorealistic rendering for novel viewpoints and articulations. SplArt augments 3DGS with a differentiable mobility parameter per Gaussian, achieving refined part segmentation. A multi-stage optimization strategy is employed to progressively handle reconstruction, part segmentation, and articulation estimation, significantly enhancing robustness and accuracy. SplArt exploits geometric self-supervision, effectively addressing challenging scenarios without requiring 3D annotations or category-specific priors. Evaluations on established and newly proposed benchmarks, along with applications to real-world scenarios using a handheld RGB camera, demonstrate SplArt's state-of-the-art performance and real-world practicality.
Poster
qiusheng huang · Xiaohui Zhong · Xu Fan · Hao Li
[ Exhibit Hall I ]
Abstract
Similar to conventional video generation, current deep learning-based weather prediction frameworks often lack explicit physical constraints, leading to unphysical outputs that limit their reliability for operational forecasting. Among various physical processes requiring proper representation, radiation plays a fundamental role as it drives Earth's weather and climate systems. However, accurate simulation of radiative transfer processes remains challenging for traditional numerical weather prediction (NWP) models due to their inherent complexity and high computational costs. Here, we propose FuXi-RTM, a hybrid physics-guided deep learning framework designed to enhance weather forecast accuracy while enforcing physical consistency. FuXi-RTM integrates a primary forecasting model (FuXi) with a fixed deep learning-based radiative transfer model (DLRTM) surrogate that efficiently replaces conventional radiation parameterization schemes. This represents the first deep learning-based weather forecasting framework to explicitly incorporate physical process modeling. Evaluated over a comprehensive 5-year dataset, FuXi-RTM outperforms its unconstrained counterpart in 88.51\% of 3320 variable and lead time combinations, with improvements in radiative flux predictions. By incorporating additional physical processes, FuXi-RTM paves the way for next-generation weather forecasting systems that are both accurate and physically consistent.
Poster
Chengbo Yuan · Geng Chen · Li Yi · Yang Gao
[ Exhibit Hall I ]
Abstract
Egocentric videos provide valuable insights into human interactions with the physical world, which has sparked growing interest in the computer vision and robotics communities. A critical challenge in fully understanding the geometry and dynamics of egocentric videos is dense scene reconstruction. However, the lack of high-quality labeled datasets in this field has hindered the effectiveness of current supervised learning methods. In this work, we aim to address this issue by exploring an self-supervised dynamic scene reconstruction approach. We introduce **EgoMono4D**, a novel model that unifies the estimation of multiple variables necessary for *Ego*centric *Mono*cular *4D* reconstruction, including camera intrinsic, camera poses, and video depth, all within a fast feed-forward framework. Starting from pretrained single-frame depth and intrinsic estimation model, we extend it with camera poses estimation and align multi-frame results on large-scale unlabeled egocentric videos. We evaluate EgoMono4D in both in-domain and zero-shot generalization settings, achieving superior performance in dense pointclouds sequence reconstruction compared to all baselines. EgoMono4D represents the first attempt to apply self-supervised learning for pointclouds sequence reconstruction to the label-scarce egocentric field, enabling fast, dense, and generalizable reconstruction. The code and trained models will be released in the future.
Poster
Ruofan Wang · Juncheng Li · Yixu Wang · Bo Wang · Xiaosen Wang · Yan Teng · Yingchun Wang · Xingjun Ma · Yu-Gang Jiang
[ Exhibit Hall I ]
Abstract
As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical. Recent studies have explored VLM robustness against jailbreak attacks—techniques that exploit model vulnerabilities to elicit harmful outputs. However, the limited availability of diverse multimodal data has constrained current approaches to rely heavily on adversarial or manually crafted images derived from harmful text datasets, which often lack effectiveness and diversity across different contexts. In this paper, we propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the insight that VLMs themselves could serve as powerful red team models for generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. Extensive experiments demonstrate IDEATOR’s high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Chameleon, respectively. Building on IDEATOR’s strong transferability and automated process, we introduce the VLBreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results on 11 recently released VLMs reveal significant …
Poster
Tatiana Zemskova · Dmitry Yudin
[ Exhibit Hall I ]
Abstract
A 3D scene graph represents a compact scene model, storing information about the objects and the semantic relationships between them, making its use promising for robotic tasks. When interacting with a user, an embodied intelligent agent should be capable of responding to various queries about the scene formulated in natural language. Large Language Models (LLMs) are beneficial solutions for user-robot interaction due to their natural language understanding and reasoning abilities. Recent methods for creating learnable representations of 3D scenes have demonstrated the potential to improve the quality of LLMs responses by adapting to the 3D world. However, the existing methods do not explicitly utilize information about the semantic relationships between objects, limiting themselves to information about their coordinates. In this work, we propose a method 3DGraphLLM for constructing a learnable representation of a 3D scene graph. The learnable representation is used as input for LLMs to perform 3D vision-language tasks. In our experiments on popular ScanRefer, Multi3DRefer, ScanQA, Sqa3D, and Scan2cap datasets, we demonstrate the advantage of this approach over baseline methods that do not use information about the semantic relationships between objects.
Poster
Qirui Wu · Denys Iliash · Daniel Ritchie · Manolis Savva · Angel Chang
[ Exhibit Hall I ]
Abstract
Reconstructing structured 3D scenes from RGB images using CAD objects unlocks efficient and compact scene representations that maintain compositionality and interactability. Existing works propose training-heavy methods relying on either expensive yet inaccurate real-world annotations or controllable yet monotonous synthetic data that do not generalize well to unseen objects or domains. We present Diorama, the first zero-shot open-world system that holistically models 3D scenes from single-view RGB observations without requiring end-to-end training or human annotations. We show the feasibility of our approach by decomposing the problem into subtasks and introduce better solutions to each: architecture reconstruction, 3D shape retrieval, object pose estimation, and scene layout optimization. We evaluate our system on both synthetic and real-world data to show we significantly outperform baselines from prior work. We also demonstrate generalization to real-world internet images and the text-to-scene task.
Poster
Tim Seizinger · Florin-Alexandru Vasluianu · Marcos Conde · Zongwei Wu · Radu Timofte
[ Exhibit Hall I ]
Abstract
Bokeh rendering methods play a key role in creating the visually appealing, softly blurred backgrounds seen in professional photography. While recent learning-based approaches show promising results, generating realistic Bokeh with controllable strength remains challenging. Existing methods require additional inputs and suffer from unrealistic Bokeh reproduction due to reliance on synthetic data. In this work, we propose Bokehlicious, a highly efficient network that provides intuitive control over Bokeh strength through an Aperture-Aware Attention mechanism, mimicking the physical lens aperture. To further address the lack of high-quality real-world data, we present RealBokeh, a novel dataset featuring 23,000 high-resolution (24-MP) images captured by professional photographers, covering diverse scenes with varied aperture and focal length settings. Evaluations on both our new RealBokeh and established Bokeh rendering benchmarks show that Bokehlicious consistently outperforms SOTA methods while significantly reducing computational cost and exhibiting strong zero-shot generalization. Our method and dataset further extend to defocus deblurring, achieving competitive results on the RealDOF benchmark. Our code and data will be public.
Poster
Yannick Burkhardt · Simon Schaefer · Stefan Leutenegger
[ Exhibit Hall I ]
Abstract
Event-based keypoint detection and matching holds significant potential, enabling the integration of event sensors into highly optimized Visual SLAM systems developed for frame cameras over decades of research. Unfortunately, existing approaches struggle with the motion-dependent appearance of keypoints and the complex noise prevalent in event streams, resulting in severely limited feature matching capabilities and poor performance on downstream tasks. To mitigate this problem, we propose SuperEvent, a data-driven approach to predict stable keypoints with expressive descriptors. Due to the absence of event datasets with ground truth keypoint labels, we leverage existing frame-based keypoint detectors on readily available event-aligned and synchronized gray-scale frames for self-supervision: we generate temporally sparse keypoint pseudo-labels considering that events are a product of both scene appearance and camera motion. Combined with our novel, information-rich event representation, we enable SuperEvent to effectively learn robust keypoint detection and description in event streams. Finally, we demonstrate the usefulness of SuperEvent by its integration into a modern sparse keypoint and descriptor-based SLAM framework originally developed for traditional cameras, surpassing the state-of-the-art in event-based SLAM by a wide margin. The source code and model weights will be published after acceptance.
Poster
Runyang Feng · Hyung Jin Chang · Tze Ho Elden Tse · Boeun Kim · Yi Chang · Yixing Gao
[ Exhibit Hall I ]
Abstract
Modeling high-resolution spatiotemporal representations, including both global dynamic contexts (e.g., holistic human motion tendencies) and local motion details (e.g., high-frequency changes of keypoints), is essential for video-based human pose estimation (VHPE). Current state-of-the-art methods typically unify spatiotemporal learning within a single type of modeling structure (convolution or attention-based blocks), which inherently have difficulties in balancing global and local dynamic modeling and may bias the network to one of them, leading to suboptimal performance. Moreover, existing VHPE models suffer from quadratic complexity when capturing global dependencies, limiting their applicability especially for high-resolution sequences. Recently, the state space models (known as Mamba) have demonstrated significant potential in modeling long-range contexts with linear complexity; however, they are restricted to 1D sequential data. In this paper, we present a novel framework that extends Mamba from two aspects to separately learn global and local high-resolution spatiotemporal representations for VHPE. Specifically, we first propose a Global Spatiotemporal Mamba, which performs 6D selective space-time scan and spatial- and temporal-modulated scan merging to efficiently extract global representations from high-resolution sequences. We further introduce a windowed space-time scan-based Local Refinement Mamba to enhance the high-frequency details of localized keypoint motions. Extensive experiments on four benchmark datasets demonstrate that the …
Poster
Xiaorong Qin · Xinhang Song · Sixian Zhang · Xinyao Yu · Xinmiao Zhang · Shuqiang Jiang
[ Exhibit Hall I ]
Abstract
Object navigation tasks require an agent to locate a target object using visual observations in unseen environments, where unfamiliar layouts and novel object appearances can hinder navigation. Most existing methods lack the adaptability needed to handle these uncertainties, as their navigation models remain fixed during testing. In this paper, we address this challenge by examining object-conditioned trajectory distribution shifts in navigation caused by changes in environmental dynamics. We propose learning a central conditional distribution as a prior that approximates the specific distributions of diverse environments. To retain environment-specific information during navigation, we allow each environment-specific distribution to approximate this central distribution rather than relying on it directly. To implement this, we introduce a meta-learning mechanism that integrates with traditional navigation methods, offering tailored solutions for various types of navigation approaches. Our approach, Learning on the Go (LOG), enables agents to learn on the go, allowing for flexible, adaptive, real-time learning during navigation. Our theoretical analysis highlights the benefits of learning a central distribution for effective generalization across environments, and empirical results confirm the proposed method’s effectiveness, demonstrating superior performance compared to existing approaches.
Poster
Zhihao ZHU · Yifan Zheng · Siyu Pan · Yaohui Jin · Yao Mu
[ Exhibit Hall I ]
Abstract
The fragmentation between high-level task semantics and low-level geometric features remains a persistent critical challenge in robotic manipulation. While vision-language models (VLMs) have demonstrated their potential in generating affordance-aware visual representations, the lack of semantic grounding in canonical spaces and reliance on manually annotated severely limit their ability to capture dynamic semantic-affordance relationships. To address these limitations, we propose Primitive-Aware Semantic Grounding (PASG), a closed-loop framework that introduces: (1) Automatic primitive extraction through geometric feature aggregation, enabling cross-category detection of keypoints and axes; (2) VLM-driven semantic anchoring that dynamically couples geometric primitives with functional affordances and task-relevant description; (3) A spatial-semantic reasoning benchmark and a fine-tuned VLM (Qwen2.5VL-PA). Extensive experiments demonstrate PASG achieves a finer-grained semantic-affordance understanding of objects, establishing a unified paradigm for bridging geometric primitives with task semantics in robotic manipulation.
Poster
Yihong Cao · Jiaming Zhang · Xu Zheng · Hao Shi · Kunyu Peng · Hang Liu · Kailun Yang · Hui Zhang
[ Exhibit Hall I ]
Abstract
Panoramic image processing is essential for omni-context perception, yet faces constraints like distortions, perspective occlusions, and limited annotations. Previous unsupervised domain adaptation methods transfer knowledge from labeled pinhole data to unlabeled panoramic images, but they require access to source pinhole data. To address these, we introduce a more practical task, ie, Source-Free Occlusion-Aware Seamless Segmentation (SFOASS), and propose its first solution, called UNconstrained Learning Omni-Context Knowledge (UNLOCK). Specifically, UNLOCK includes two key modules: Omni Pseudo-Labeling Learning and Amodal-Driven Context Learning. While adapting without relying on source data or target labels, this framework enhances models to achieve segmentation with 360° viewpoint coverage and occlusion-aware reasoning. Furthermore, we benchmark the proposed SFOASS task through both real-to-real and synthetic-to-real adaptation settings. Experimental results show that our source-free method achieves performance comparable to source-dependent methods, yielding SOTA scores of 10.9 in mAAP and 11.6 in mAP, along with an absolute improvement of +4.3 in mAPQ over the source-only method. All data and code will be made publicly available.
Poster
Mahmoud Ahmed · Junjie Fei · Jian Ding · Eslam Abdelrahman · Mohamed Elhoseiny
[ Exhibit Hall I ]
Abstract
In this paper, we introduce Part-Aware Point Grounded Description (PaPGD), a challenging task aimed at advancing 3D multimodal learning for fine-grained, part-aware segmentation grounding and detailed explanation of 3D objects. Existing 3D datasets largely focus on either vision-only part segmentation or vision-language scene segmentation, lacking the fine-grained multimodal segmentation needed for robotic navigation and interaction in real-world environments. To address this gap, we present the 3DCoMPaT Grounded Instructions (3DCoMPaT-GrIn) Dataset, a comprehensive resource that pairs rich point cloud descriptions with corresponding part-level segmentation masks. This dataset encompasses extensive samples designed for both PaPGD and fine-grained single-part grounding tasks. To tackle the inherent challenges of grounding objects and generating grounded descriptions at the part level, we propose Kestrel, a part-aware 3D multimodal large language model that integrates an advanced language model for nuanced language comprehension with multi-level point feature propagation and query refinement mechanism to enhance spatial reasoning at the part level. The extensive experiments demonstrate that Kestrel effectively bridges the gap between part-aware language understanding and 3D segmentation grounding, paving the way for more robust and interpretable 3D object comprehension that meets the demands of real-world robotic applications.
Poster
Xiangyu Yin · Boyuan Yang · Weichen Liu · Qiyao Xue · Abrar Alamri · Goeran Fiedler · Wei Gao
[ Exhibit Hall I ]
Abstract
Prosthetic legs play a pivotal role in clinical rehabilitation, allowing individuals with lower-limb amputations the ability to regain mobility and improve their quality of life. Gait analysis is fundamental for optimizing prosthesis design and alignment, directly impacting the mobility and life quality of individuals with lower-limb amputations. Vision-based machine learning (ML) methods offer a scalable and non-invasive solution to gait analysis, but face challenges in correctly detecting and analyzing prosthesis, due to their unique appearances and new movement patterns. In this paper, we aim to bridge this gap by introducing a multi-purpose dataset, namely ProGait, to support multiple vision tasks including Video Object Segmentation, 2D Human Pose Estimation, and Gait Analysis (GA). ProGait provides 412 video clips from four above-knee amputees when testing multiple newly-fitted prosthetic legs through walking trials, and depicts the presence, contours, poses, and gait patterns of human subjects with transfemoral prosthetic legs. Alongside the dataset itself, we also present benchmark tasks and fine-tuned baseline models to illustrate the practical application and performance of the ProGait dataset. We compared our baseline models against pre-trained vision models, demonstrating improved generalizability when applying the ProGait dataset for prosthesis-specific tasks.
Poster
Zhiyuan Yang · Anqi Cheng · Haiyue Zhu · Tianjiao Li · Pey Yuen Tao · Kezhi Mao
[ Exhibit Hall I ]
Abstract
Depth completion, the task of reconstructing dense depth maps from sparse depth and RGB images, plays a critical role in 3D scene understanding. However, existing methods often struggle to recover high-frequency details, such as regions with fine structures or weak signals, since depth sensors may fail to capture accurate depth maps in those regions, leading to imperfect supervision ground truth. To overcome this limitation, it is essential to introduce an alternative training source for the models. Emerging depth foundation models excel at producing high-frequency details from RGB images, yet their depth maps suffer from inconsistent scaling. Therefore, we propose a novel teacher-student framework that enhances depth completion by distilling high-frequency knowledge from depth foundation models across multiple scales. Our approach introduces two key innovations: Adaptive Local Wavelet Decomposition, which dynamically adjusts wavelet decomposition level based on local complexity for efficient feature extraction, and Topological Constraints, which apply persistent homology to enforce structural coherence and suppress spurious depth edges. Experiment results demonstrate that our method outperforms state-of-the-art methods, preserving high-frequency details and overall depth fidelity.
Poster
Miroslav Purkrabek · Jiri Matas
[ Exhibit Hall I ]
Abstract
Human pose estimation methods work well on isolated people but struggle with multiple-bodies-in-proximity scenarios. Previous work has addressed this problem by conditioning pose estimation by detected bounding boxes or keypoints, but overlooked instance masks. We propose to iteratively enforce mutual consistency of bounding boxes, instance masks, and poses. The introduced BBox-Mask-Pose (BMP) method uses three specialized models that improve each other's output in a closed loop. All models are adapted for mutual conditioning, which improves robustness in multi-body scenes. MaskPose, a new mask-conditioned pose estimation model, is the best among top-down approaches on OCHuman. BBox-Mask-Pose pushes SOTA on OCHuman dataset in all three tasks -- detection, instance segmentation, and pose estimation. It also achieves SOTA performance on COCO pose estimation. The method is especially good in scenes with large instances overlap, where it improves detection by 39% over the baseline detector. With small specialized models and faster runtime, BMP is an effective alternative to large human-centered foundational models. Code and models will be published.
Poster
Xinkuan Qiu · Meina Kan · Yongbin Zhou · Shiguang Shan
[ Exhibit Hall I ]
Abstract
Multimodal Large Language Models (MLLMs) have made significant strides in visual and language tasks. However, despite their impressive performance on standard datasets, these models encounter considerable robustness challenges when processing corrupted images, raising concerns about their reliability in safety-critical applications. To address this issue, we introduce the MLLM-IC benchmark, specifically designed to assess the performance of MLLMs under image corruption scenarios. MLLM-IC offers a more comprehensive evaluation of corruption robustness compared to existing benchmarks, enabling a multi-dimensional assessment of various MLLM capabilities across a broad range of corruption types. It includes 40 distinct corruption types and 34 low-level multimodal capabilities, each organized into a three-level hierarchical structure. Notably, it is the first corruption robustness benchmark designed to facilitate the evaluation of fine-grained MLLM capabilities. We further evaluate several prominent MLLMs and derive valuable insights into their characteristics. We believe the MLLM-IC benchmark will provide crucial insights into the robustness of MLLMs in handling corrupted images and contribute to the development of more resilient MLLMs.
Poster
Xinhang Liu · Jiawei Shi · Zheng Dang · Yuchao Dai
[ Exhibit Hall I ]
Abstract
We present MixRI, a lightweight network that solves the CAD-based novel object pose estimation problem in RGB images. It can be instantly applied to a novel object at test time without finetuning. We design our network to meet the demands of real-world applications, emphasizing reduced memory requirements and fast inference time. Unlike existing works that utilize many reference images and have large network parameters, we directly match points based on the multi-view information between the query and reference images with a lightweight network. Thanks to our reference image fusion strategy, we significantly decrease the number of reference images, thereby decreasing the time needed to process these images and the memory required to store them. Furthermore, with our lightweight network, our method requires less inference time. Though with fewer reference images, experiments on seven core datasets in the BOP challenge show that our method achieves comparable results with other methods requiring more reference images and larger network parameters.
Poster
Chenwei Lin · Hanjia Lyu · Xian Xu · Jiebo Luo
[ Exhibit Hall I ]
Abstract
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications and have shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain—characterized by rich application scenarios and abundant multimodal data—has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for 4 representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first hierarchical LVLMs benchmark tailored for the insurance domain. INS-MMBench encompasses 22 fundamental tasks, 12 meta-tasks and 5 scenario tasks—enabling a comprehensive and progressive assessment from basic tasks to real-world insurance scenarios. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like LLaVA. Our evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain …
Poster
ADEELA ISLAM · Stefano Fiorini · Stuart James · Pietro Morerio · ALESSIO DEL BUE
[ Exhibit Hall I ]
Abstract
The task of reassembly is a significant challenge across multiple domains, including archaeology, genomics, and molecular docking, requiring the precise placement and orientation of elements to reconstruct an original structure. In this work, we address key limitations in state-of-the-art Deep Learning methods for reassembly, namely i) scalability; ii) multimodality; and iii) real-world applicability: beyond square or simple geometric shapes, realistic and complex erosion, or other real-world problems. We propose ReassembleNet, a method that reduces complexity by representing each input piece as a set of contour keypoints and learning to select the most informative ones by Graph Neural Networks pooling inspired techniques. ReassembleNet effectively lowers computational complexity while enabling the integration of features from multiple modalities, including both geometric and texture data. Further enhanced through pretraining on a semi-synthetic dataset. We then apply diffusion-based pose estimation to recover the original structure. We improve on prior methods by 55% and 86% for RMSE Rotation and Translation, respectively.
Poster
Wenqi Wang · Reuben Tan · Pengyue Zhu · Jianwei Yang · Zhengyuan Yang · Lijuan Wang · Andrey Kolobov · Jianfeng Gao · Boqing Gong
[ Exhibit Hall I ]
Abstract
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships, underpinning disciplines from neuroscience to robotics. We introduce SITE, a benchmark dataset towards SI Thorough Evaluation in a standardized format of multi-choice visual question-answering, designed to assess large vision-language models’s spatial intelligence across diverse visual modalities (single-image, multi-image, and video) and SI factors (figural to environmental scales, spatial visualization and orientation, intrinsic and extrinsic, static and dynamic). Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science, which prompt us to design two novel types of tasks about view-taking and dynamic scenes. Extensive experiments reveal that leading models fall behind human experts especially in spatial orientation, a fundamental SI factor. Moreover, we demonstrate a positive correlation between a model’s spatial reasoning proficiency and its performance on an em bodied AI task. Code and data will be publicly available.
Poster
Shuai Jin · Yuhua Qian · Feijiang Li · Guoqing Liu · Xinyan Liang
[ Exhibit Hall I ]
Abstract
Unsupervised low-light image enhancement presents the challenge of preserving both local texture details and global illumination consistency. Existing methods often rely on uniform, predefined strategies within fixed neighborhoods (e.g., fixed convolution kernels or average pooling), which are limited in their ability to adaptively capture the dynamic interdependencies between pixels during the enhancement process. As a result, these methods may lead to oversaturation or the loss of fine details. To address these issues, we introduce PASD, a novel pixel-adaptive adjustment approach inspired by swarm dynamics. PASD establishes inter-pixel cooperative constraints that adjust pixel intensities based on dynamic neighborhood interactions, thereby forming a population dynamics system for image enhancement that ensures a balance between local enhancement and global consistency. Furthermore, a distributed multi-agent reinforcement learning mechanism is employed to optimize the interactions within the dynamic system, while a multi-scale coordination framework ensures strategy consistency and stability. Experimental results demonstrate that PASD significantly outperforms existing state-of-the-art methods, providing a more flexible and efficient solution for low-light image enhancement.
Poster
Alexander Mai · Peter Hedman · George Kopanas · Dor Verbin · David Futschik · Qiangeng Xu · Falko Kuester · Jonathan Barron · Yinda Zhang
[ Exhibit Hall I ]
Abstract
We present Exact Volumetric Ellipsoid Rendering (EVER), a method for real-time 3D reconstruction.EVER accurately blends an unlimited number of overlapping primitives together in 3D space, eliminating the popping artifacts that 3D Gaussian Splatting (3DGS) and other related methods exhibit.EVER represents a radiance field as a set of constant-density volumetric ellipsoids, which are raytraced by intersecting each primitive twice (once upon ray entrance and another on ray exit) and accumulating the derivatives of the densities and colors along the ray.Because EVER is built around ray tracing, it also enables effects such as defocus blur and fish-eye camera distortion, while still achieving frame rates of ~30 FPS at 720p on an NVIDIA RTX4090. We show that our method is more accurate on the challenging large-scale scenes from the Zip-NeRF dataset, where it achieves state of the art SSIM, even higher than Zip-NeRF.
Poster
Uranik Berisha · Jens Mehnert · Alexandru Condurache
[ Exhibit Hall I ]
Abstract
Increasingly expensive training of ever larger models such as Vision Transfomers motivate reusing the vast library of already trained state-of-the-art networks. However, their latency, high computational costs and memory demands pose significant challenges for deployment, especially on resource-constrained hardware. While structured pruning methods can reduce these factors, they often require costly retraining, sometimes for up to hundreds of epochs, or even training from scratch to recover the lost accuracy resulting from the structural modifications. Maintaining the provided performance of trained models after structured pruning and thereby avoiding extensive retraining remains a challenge. To solve this, we introduce Variance-Based Pruning, a simple and structured one-shot pruning technique for efficiently compressing networks, with minimal finetuning. Our approach first gathers activation statistics, which are then used to select neurons for pruning. Simultaneously the mean activations are integrated back into the model to preserve a high degree of performance. On ImageNet-1k recognition tasks, we demonstrate that directly after pruning DeiT-Base retains over 70% of its original performance and requires only 10 epochs of fine-tuning to regain 99% of the original accuracy while simultaneously reducing MACs by 35% and model size by 36%, thus speeding up the model by 1.44 times.
Poster
Zizhang Li · Hong-Xing Yu · Wei Liu · Yin Yang · Charles Herrmann · Gordon Wetzstein · Jiajun Wu
[ Exhibit Hall I ]
Abstract
WonderPlay is a novel framework integrating physics simulation with video generation for generating action-conditioned dynamic 3D scenes from a single image. Our hybrid generative simulator first uses a physics solver to simulate coarse 3D dynamics, which subsequently conditions a video generator to produce a video with finer, more realistic motion. The generated video is then used to update the simulated dynamic 3D scene, closing the loop between the physics solver and the video generator. This approach enables intuitive user control to be combined with the accurate dynamics of physics-based simulators and the expressivity of diffusion-based video generators. Experimental results demonstrate that WonderPlay enables users to interact with various scenes of diverse content, including cloth, sand, snow, liquid, smoke, elasticity, and rigid bodies -- all using a single image input. Code will be made public.
Poster
Kaixuan Jiang · Yang Liu · Weixing Chen · Jingzhou Luo · Ziliang Chen · Ling Pan · Guanbin Li · Liang Lin
[ Exhibit Hall I ]
Abstract
Embodied Question Answering (EQA) is a challenging task in embodied intelligence that requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions. However, current EQA approaches suffer from critical limitations in exploration efficiency, dataset design, and evaluation metrics. Moreover, existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning, while frontier-based exploration strategies struggle in cluttered environments and fail to ensure fine-grained exploration of task-relevant areas. To address these challenges, we construct the EXPloration-awaRe Embodied queStion anSwering Benchmark (EXPRESS-Bench), the largest dataset designed specifically to evaluate both exploration and reasoning capabilities. EXPRESS-Bench consists of 777 exploration trajectories and 2,044 question-trajectory pairs. To improve exploration efficiency, we propose Fine-EQA, a hybrid exploration model that integrates frontier-based and goal-oriented navigation to guide agents toward task-relevant regions more effectively. Additionally, we introduce a novel evaluation metric, Exploration-Answer Consistency (EAC), which ensures faithful assessment by measuring the alignment between answer grounding and exploration reliability. Extensive experimental comparisons with state-of-the-art EQA models demonstrate the effectiveness of our EXPRESS-Bench in advancing embodied exploration and question reasoning.
Poster
Junwen Huang · Shishir Reddy Vutukur · Peter Yu · Nassir Navab · Slobodan Ilic · Benjamin Busam
[ Exhibit Hall I ]
Abstract
Typical template-based object pose pipelines first find the closest template and then align it to the current observation.The failure to find the closest template results in the wrong pose estimate. Instead, we reformulate object pose estimation with template images as a ray alignment problem where viewing directions from multiple posed template views need to mutually align with a non-posed object query.Inspired by recent advancements in denoising diffusion frameworks for camera pose estimation, we integrate this formulation into a diffusion transformer architecture capable of aligning a single query image of an object to a set of template views. Our method reparametrizes object rotation by introducing object-centered camera rays and object translation by extending Scale-Invariant Translation Estimation (SITE) to dense translation offsets. Our method leverages view priors from template images to enhance the model's ability to accurately infer query object poses. Using a coarse-to-fine training strategy with narrowed template sampling, our approach improves performance without modifying the network architecture, increasing robustness in 6D object pose estimation.Extensive evaluations on various benchmark datasets demonstrate the superiority of our method over state-of-the-art approaches in unseen object pose estimation. Our code will be made publicly available.
Poster
Mateusz Michalkiewicz · Xinyue Bai · Mahsa Baktashmotlagh · Varun Jampani · Guha Balakrishnan
[ Exhibit Hall I ]
Abstract
In this paper, we analyze the viewpoint stability of foundational models - specifically, their sensitivity to changes in viewpoint- and define instability as significant feature variations resulting from minor changes in viewing angle, leading to generalization gaps in 3D reasoning tasks. We investigate nine foundational models, focusing on their responses to viewpoint changes, including the often-overlooked accidental viewpoints where specific camera orientations obscure an object's true 3D structure. Our methodology enables recognizing and classifying accidental, stable and other viewpoints using feature representations alone, without accessing the actual images. Our findings indicate that while foundation models consistently encode accidental viewpoints, they vary in their interpretation of other viewpoints due to inherent biases, at times leading to object misclassifications based on geometric resemblance. Through quantitative and qualitative evaluations on three downstream tasks - classification, VQA, and 3D reconstruction - we illustrate the impact of viewpoint instability and underscore the importance of feature robustness across diverse viewing conditions.
Poster
Arindam Dutta · Meng Zheng · Zhongpai Gao · Benjamin Planche · Anwesa Choudhuri · Terrence Chen · Amit Roy-Chowdhury · Ziyan Wu
[ Exhibit Hall I ]
Abstract
Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of …
Poster
Yingying Zhang · Lixiang Ru · Kang Wu · Lei Yu · Lei Liang · Yansheng Li · Jingdong Chen
[ Exhibit Hall I ]
Abstract
The multi-modal remote sensing foundation model (MM-RSFM) has significantly advanced various Earth observation tasks, such as urban planning, environmental monitoring, and natural disaster management. However, most existing approaches generally require the training of separate backbone networks for each data modality, leading to redundancy and inefficient parameter utilization. Moreover, prevalent pre-training methods typically apply self-supervised learning (SSL) techniques from natural images without adequately accommodating the characteristics of remote sensing (RS) images, such as the complicated semantic distribution within a single RS image. In this work, we present SkySense V2, a unified MM-RSFM that employs a single transformer backbone to handle multiple modalities. This backbone is pre-trained with a novel SSL strategy tailored to the distinct traits of RS data. In particular, SkySense V2 incorporates an innovative adaptive patch merging module and learnable modality prompt tokens to address challenges related to varying resolutions and limited feature diversity across modalities. In additional, we incorporate the mixture of experts (MoE) module to further enhance the performance of the foundation model. SkySense V2 demonstrates impressive generalization abilities through an extensive evaluation involving 16 datasets over 7 tasks, outperforming SkySense by an average of 1.8 points.
Poster
Mengxue Qu · Yibo Hu · Kunyang Han · Yunchao Wei · Yao Zhao
[ Exhibit Hall I ]
Abstract
Recent advancements in Large Vision-Language Models (LVLMs) have greatly improved their ability to understand both visual and text information. However, a common problem in LVLMs is confirmation bias, where models tend to repeat previous assumptions and follow earlier viewpoints instead of reflecting and correcting themselves. This problem is more common in smaller-scale LVLMs, as they are usually fine-tuned with training data that is mostly positive, focusing on generating coherent dialogue. To address this issue, we introduce ReCoT, a method designed to mitigate confirmation bias in smaller-scale LVLMs through Reflective Self-Correction Training.The method follows a two-stage SFT-DPO paradigm: the first SFT stage aims to cultivate the model's reflective correction abilities, while the DPO stage focuses on enhancing the consistency between answers and reflections. Specifically, we construct dialogue-based reflective samples, which serve as adversarial samples during SFT. In this process, the model is initially presented with a potentially incorrect answer, followed by a reflection and correction phase to generate the final answer. To enhance answer-reflection consistency, we propose the consistency direct preference optimization. To comprehensively evaluate the effectiveness of our ReCoT, we introduce a set of novel metrics to measure the accuracy of the reflection and correction process. Extensive experiments show that …
Poster
Xingyu Chen · Yue Chen · Yuliang Xiu · Andreas Geiger · Anpei Chen
[ Exhibit Hall I ]
Abstract
Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets.In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or fine-tuned on extensive dynamic datasets.
Poster
Ciyu Ruan · Ruishan Guo · Zihang GONG · Jingao Xu · Wenhan Yang · Xinlei Chen
[ Exhibit Hall I ]
Abstract
Event cameras excel in high temporal resolution and dynamic range but suffer from dense noise in rainy conditions.Existing event deraining methods face trade-offs between temporal precision, deraining effectiveness, and computational efficiency. In this paper, we propose PRE-Mamba, a novel point-based event camera deraining framework that fully exploits the spatiotemporal characteristics of raw event and rain. Our framework introduces a 4D event cloud representation that integrates dual temporal scales to preserve high temporal precision, a Spatio-Temporal Decoupling and Fusion module (STDF) that enhances deraining capability by enabling shallow decoupling and interaction of temporal and spatial information, and a Multi-Scale State Space Model (MS3M) that captures deeper rain dynamics across dual-temporal and multi-spatial scales with linear computational complexity. Enhanced by frequency-domain regularization, PRE-Mamba achieves superior performance (0.95 SR, 0.91 NR, and 0.4s/M events) with only 0.26M parameters on EventRain-27K, a comprehensive dataset with labeled synthetic and real-world sequences. Moreover, our method generalizes well across varying rain intensities, viewpoints, and even snowy conditions. Code and dataset will be publicly available upon acceptance.
Poster
Tianhao Wu · Chuanxia Zheng · Frank Guan · Andrea Vedaldi · Tat-Jen Cham
[ Exhibit Hall I ]
Abstract
Most image-based 3D object reconstructors assume that objects are fully visible, ignoring occlusions that commonly occur in real-world scenarios. In this paper, we introduce Amodal3R, a conditional 3D generative model designed to reconstruct 3D objects from partial observations. We start from a "foundation" 3D generative model and extend it to recover plausible 3D geometry and appearance from occluded objects. We introduce a mask-weighted multi-head cross-attention mechanism followed by an occlusion-aware attention layer that explicitly leverages occlusion priors to guide the reconstruction process. We demonstrate that, by training solely on synthetic data, Amodal3R learns to recover full 3D objects even in the presence of occlusions in real scenes.It substantially outperforms existing methods that independently perform 2D amodal completion followed by 3D reconstruction, thereby establishing a new benchmark for occlusion-aware 3D reconstruction.
Poster
Sixiang Chen · Tian Ye · Yunlong Lin · Yeying Jin · Yijun Yang · Haoyu Chen · Jianyu Lai · Song Fei · Zhaohu Xing · Fugee Tsung · Lei Zhu
[ Exhibit Hall I ]
Abstract
Real-world image dehazing is crucial for enhancing visual quality in computer vision applications. However, existing physics-based haze generation paradigms struggle to model the complexities of real-world haze and lack controllability, limiting the performance of existing baselines on real-world images. In this paper, we introduce GenHaze, a pioneering haze generation framework that enables the one-step generation of high-quality, reference-controllable hazy images. GenHaze leverages the pre-trained latent diffusion model (LDM) with a carefully designed clean-to-haze generation protocol to produce realistic hazy images. Additionally, by leveraging its fast, controllable generation of paired high-quality hazy images, we illustrate that existing dehazing baselines can be unleashed in a simple and efficient manner. Extensive experiments indicate that GenHaze achieves visually convincing and quantitatively superior hazy images. It also {significantly improves} multiple existing dehazing models across 7 non-reference metrics with minimal fine-tuning epochs.Our work demonstrates that LDM possesses the potential to generate realistic degradations, providing an effective alternative to prior generation pipelines.
Poster
Junwei Luo · Yingying Zhang · Xue Yang · Kang Wu · Qi Zhu · Lei Liang · Jingdong Chen · Yansheng Li
[ Exhibit Hall I ]
Abstract
Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. The code and dataset will be made publicly available.
Poster
Florin-Alexandru Vasluianu · Tim Seizinger · Zongwei Wu · Radu Timofte
[ Exhibit Hall I ]
Abstract
Illumination in practical scenarios is inherently complex, involving colored light sources, occlusions, and diverse material interactions that produce intricate reflectance and shading effects. However, existing methods often oversimplify this challenge by assuming a single light source or uniform, white-balanced lighting, leaving many of these complexities unaddressed. In this paper, we introduce CL3AN, the first large-scale, high-resolution dataset of its kind designed to facilitate the restoration of images capturedunder multiple Colored Light sources to their Ambient-Normalized counterparts. Through benchmarking, we find that leading approaches often produce artifacts—such as illumination inconsistencies, texture leakage, and color distortion—primarily due to their limited ability to precisely disentangle illumination from reflectance. Motivated by this insight, we achieve such a desired decomposition through a novel learning framework that leverages explicit chromaticity-luminance components guidance, drawing inspiration from the principles of the Retinex model. Extensive evaluations on existing benchmarks and our dataset demonstrate the effectiveness of our approach, showcasingenhanced robustness under non-homogeneous color lighting and material-specific reflectance variations, all while maintaining a highly competitive computational cost. Our code and dataset will be made public upon acceptance.
Poster
Ruiyang Zhang · Hu Zhang · Zhedong Zheng
[ Exhibit Hall I ]
Abstract
Unsupervised 3D object detection aims to identify objects of interest from unlabeled raw data, such as LiDAR points. Recent approaches usually adopt pseudo 3D bounding boxes (3D bboxes) from clustering algorithm to initialize the model training. However, pseudo bboxes inevitably contain noise, and such inaccuracies accumulate to the final model, compromising the performance. Therefore, in an attempt to mitigate the negative impact of inaccurate pseudo bboxes, we introduce a new uncertainty-aware framework for unsupervised 3D object detection, dubbed UA3D. In particular, our method consists of two phases: uncertainty estimation and uncertainty regularization. (1) In the uncertainty estimation phase, we incorporate an extra auxiliary detection branch alongside the original primary detector. The prediction disparity between the primary and auxiliary detectors could reflect fine-grained uncertainty at the box coordinate level. (2) Based on the assessed uncertainty, we adaptively adjust the weight of every 3D bbox coordinate via uncertainty regularization, refining the training process on pseudo bboxes. For pseudo bbox coordinate with high uncertainty, we assign a relatively low loss weight. Extensive experiments verify that the proposed method is robust against the noisy pseudo bboxes, yielding substantial improvements on nuScenes and Lyft compared to existing approaches, with increases of +3.9\% AP$_{BEV}$ and +1.5\% …
Poster
Phillip Y. Lee · Jihyeon Je · Chanho Park · Mikaela Uy · Leonidas Guibas · Minhyuk Sung
[ Exhibit Hall I ]
Abstract
We present a framework for perspective-aware reasoning in vision-language models (VLMs) through mental imagery simulation. Perspective-taking - the ability to perceive an environment or situation from an alternative viewpoint - is a key benchmark for human-level visual understanding, essential for environmental interaction and collaboration with autonomous agents. Despite advancements in spatial reasoning within VLMs, recent research has shown that modern VLMs significantly lack perspective-aware reasoning capabilities and exhibit a strong bias toward egocentric interpretations. To bridge the gap between VLMs and human perception, we focus on the role of mental imagery, where humans perceive the world through abstracted representations that facilitate perspective shifts. Motivated by this, we propose a framework for perspective-aware reasoning that effectively leverages vision foundation models, such as object detection, segmentation, and orientation estimation, to construct scene abstractions and enable perspective transformations. Our experiments on synthetic and real-image benchmarks, tested across various VLMs, demonstrate consistent improvements in perspective-aware reasoning with our framework, outperforming fine-tuned spatial reasoning models and novel-view-synthesis-based approaches.
Poster
Jianzhe Gao · Rui Liu · Wenguan Wang
[ Exhibit Hall I ]
Abstract
Vision-language navigation (VLN) task requires an agent to traverse complex 3D environments based on natural language instructions, necessitating a thorough scene understanding. While existing works provide agents with various scene maps to enhance their spatial awareness, integrating 3D geometric priors and semantics into a unified map remains challenging. Moreover, these methods often neglect to account for the complex spatial relationships and the open nature of VLN scenarios in their map design, which limits their ability to generalize across diverse and unseen environments. To address these challenges, this work proposes a 3D Gaussian Map that represents the environment as a set of differentiable 3D Gaussians and accordingly develops a navigation strategy for VLN. Specifically, Egocentric Gaussian Map is constructed online by initializing 3D Gaussians from sparse pseudo-lidar point clouds, providing informative geometric priors to boost spatial awareness. Each Gaussian primitive is further enriched through Open-Set Semantic Grouping operation, which groups 3D Gaussians based on their membership in object instances or stuff categories within the open world. These processes result in a unified 3D Gaussian Map that integrates geometric priors with open-set semantics. Building on this map, Multi-Level Action Prediction strategy, which combines spatial-semantic cues at multiple granularities, is designed to assist …
Poster
Guanxing Lu · Baoxiong Jia · Puhao Li · Yixin Chen · Ziwei Wang · Yansong Tang · Siyuan Huang
[ Exhibit Hall I ]
Abstract
Training robot policies within a learned world model is trending due to the inefficiency of real-world interactions. The established image-based world models and policies have shown prior success, but lack robust geometric information that requires consistent spatial and physical understanding of the three-dimensional world, even pre-trained on internet-scale video sources.To this end, we propose a novel branch of world model named **Gaussian World Model (GWM)** for robotic manipulation, which reconstructs the future state by inferring the propagation of Gaussian primitives under the effect of robot actions.At its core is a latent Diffusion Transformer (DiT) combined with a 3D variational autoencoder, enabling fine-grained scene-level future state reconstruction with Gaussian Splatting.GWM can not only enhance the visual representation for imitation learning agent by self-supervised future prediction training, but can serve as a neural simulator that supports model-based reinforcement learning.Both simulated and real-world experiments depict that GWM can precisely predict future scenes conditioned on diverse robot actions, and can be further utilized to train policies that outperform the state-of-the-art by impressive margins, showcasing the initial data scaling potential of 3D world model.
Poster
Haochen Wang · Yucheng Zhao · Tiancai Wang · Haoqiang Fan · Xiangyu Zhang · Zhaoxiang Zhang
[ Exhibit Hall I ]
Abstract
The rapid development of Large Multimodal Models (LMMs) for 2D images and videos has spurred efforts to adapt these models for interpreting 3D scenes. However, the absence of large-scale 3D vision-language datasets has posed a significant obstacle. To address this issue, typical approaches focus on injecting 3D awareness into 2D LMMs by designing 3D input-level scene representations. This work provides a new perspective. We introduce reconstructive visual instruction tuning with 3D-awareness (ROSS3D), which integrates 3D aware visual supervision into the training procedure. Specifically, it incorporates cross-view and global-view reconstruction. The former requires reconstructing masked views by aggregating overlapping information from other views. The latter aims to aggregate information from all available views to recover Bird’s-Eye-View images, contributing to a comprehensive overview of the entire scene. Empirically, ROSS3D achieves state-of-the-art performance across various 3D scene understanding benchmarks. More importantly, our semi-supervised experiments demonstrate significant potential in leveraging large amounts of unlabeled 3D vision-only data. The code will be made publicly available upon acceptance.
Poster
Yiming Zuo · Willow Yang · Zeyu Ma · Jia Deng
[ Exhibit Hall I ]
Abstract
Depth completion (DC) aims to predict a dense depth map from an RGB image and a sparse depth map. Existing DC methods generalize poorly to new datasets or unseen sparse depth patterns, limiting their real-world applications. We propose OMNI-DC, a highly robust DC model that generalizes well zero-shot to various datasets. The key design is a novel Multi-Resolution Depth Integrator, allowing our model to deal with very sparse depth inputs. We also introduce a novel Laplacian loss to model the ambiguity in the training process. Moreover, we train OMNI-DC on a mixture of high-quality datasets with a scale normalization technique and synthetic depth patterns. Extensive experiments on 7 datasets show consistent improvements over baselines, reducing errors by as much as 43%. Codes and checkpoints will be made public.
Poster
Yichen Shen · Yijin Li · Shuo Chen · Guanglin Li · Zhaoyang Huang · Hujun Bao · Zhaopeng Cui · Guofeng Zhang
[ Exhibit Hall I ]
Abstract
Event cameras, known for their high temporal resolution and ability to capture asynchronous changes, have gained significant attention for their potential in feature tracking, especially in challenging conditions. However, event cameras lack the fine-grained texture information that conventional cameras provide, leading to error accumulation in tracking. To address this, we propose a novel framework, BlinkTrack, which integrates event data with grayscale images for high-frequency feature tracking. Our method extends the traditional Kalman filter into a learning-based framework, utilizing differentiable Kalman filters in both event and image branches. This approach improves single-modality tracking and effectively solves the data association and fusion from asynchronous event and image data. We also introduce new synthetic and augmented datasets to better evaluate our model. Experimental results indicate that BlinkTrack significantly outperforms existing methods, exceeding 80 FPS with multi-modality data and 100 FPS with preprocessed event data.
Poster
Regine Hartwig · Dominik Muhle · Riccardo Marin · Daniel Cremers
[ Exhibit Hall I ]
Abstract
Recent advancements in feature computation have revealed that self-supervised feature extractors can recognize semantic correspondences. However, these features often lack an understanding of objects' underlying 3D geometry. In this paper, we focus on learning features capable of semantically characterizing parts distinguished by their geometric properties, e.g., left/right eyes or front/back legs. We propose GECO, a novel, optimal-transport-based learning method that obtains features geometrically coherent, well-characterizing symmetric points. GECO uses a lightweight model architecture that results in a fast inference, capable of processing images at 30fps. Our method is interpretable and generalizes across datasets, achieving state-of-the-art performance on PFPascal, APK, and CUB datasets improving by 6.0%, 6.2%, and 4.1% respectively. We achieve a \final{speed-up of 98.2% compared to previous methods by using a smaller backbone and a more efficient training scheme. Finally, we find PCK insufficient to analyze the geometrical properties of the features. Hence, we expand our analysis, proposing novel metrics and insights that will be instrumental in developing more geometrically-aware methods.
Poster
Quankai Gao · Iliyan Georgiev · Tuanfeng Wang · Krishna Kumar Singh · Ulrich Neumann · Jae Shin Yoon
[ Exhibit Hall I ]
Abstract
3D generation has made significant progress, however, it still largely remains at the object-level. Feedforward 3D scene-level generation has been rarely explored due to the lack of models capable of scaling-up latent representation learning on 3D scene-level data. Unlike object-level generative models, which are trained on well-labeled 3D data in a bounded canonical space, scene-level generations with 3D scenes represented by 3D Gaussian Splatting (3DGS) are unbounded and exhibit scale inconsistency across different scenes, making unified latent representation learning for generative purposes extremely challenging. In this paper, we introduce Can3Tok, the first 3D scene-level variational autoencoder (VAE) capable of encoding a large number of Gaussian primitives into a low-dimensional latent embedding, which effectively captures both semantic and spatial information of the inputs. Beyond model design, we propose a general pipeline for 3D scene data processing to address scale inconsistency issue. We validate our method on the recent scene-level 3D dataset DL3DV-10K, where we found that only Can3Tok successfully generalizes to novel 3D scenes, while compared methods fail to converge on even a few hundred scene inputs during training and exhibit zero generalization ability during inference. Finally, we demonstrate image-to-3DGS and text-to-3DGS generation as our applications to demonstrate it's ability to …
Poster
Pradyumn Goyal · Dmitrii Petrov · Sheldon Andrews · Yizhak Ben-Shabat · Hsueh-Ti Derek Liu · Evangelos Kalogerakis
[ Exhibit Hall I ]
Abstract
We present GEOPARD, a transformer-based architecture for predicting articulation from a single static snapshot of a 3D shape. The key idea of our method is a pretraining strategy that allows our transformer to learn plausible candidate articulations for 3D shapes based on a geometric-driven searchwithout manual articulation annotation. The search automatically discovers physically valid part motions that do not cause detachments or collisions with other shape parts. Our experiments indicate that this geometric pretraining strategy, along with carefully designed choices in our transformer architecture, yields state-of-the-art results in articulation inference in the popular shape Part-Mobility dataset.
Poster
Zengyu Wan · Wei Zhai · Yang Cao · Zheng-Jun Zha
[ Exhibit Hall I ]
Abstract
Visual 3D motion estimation aims to infer the motion of 2D pixels in 3D space based on visual cues. The key challenge arises from depth variation induced spatio-temporal motion inconsistencies, disrupting the assumptions of local spatial or temporal motion smoothness in previous motion estimation frameworks. In contrast, event cameras offer new possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes. This paper presents EMoTive, a novel event-based framework that models spatio-temporal trajectories via event-guided non-uniform parametric curves, effectively characterizing locally heterogeneous spatio-temporal motion. Specifically, we first introduce Event Kymograph - an event projection method that leverages a continuous temporal projection kernel and decouples spatial observations to encode fine-grained temporal evolution explicitly. For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance, coupled with a non-uniform rational curve parameterization framework to adaptively model heterogeneous trajectories. The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, yielding optical flow and depth motion fields. To facilitate evaluation, we introduce CarlaEvent3D, a multi-dynamic synthetic dataset for comprehensive validation. Extensive experiments on both this dataset and a real-world benchmark demonstrate the effectiveness of the proposed method.
Poster
Philipp Wulff · Felix Wimbauer · Dominik Muhle · Daniel Cremers
[ Exhibit Hall I ]
Abstract
Volumetric scene reconstruction from a single image is crucial for a broad range of applications like autonomous driving and robotics. Recent volumetric reconstruction methods achieve impressive results, but generally require expensive 3D ground truth or multi-view supervision. We propose to leverage pre-trained 2D diffusion models and depth prediction models to generate synthetic scene geometry from a single image. This can then be used to distill a feed-forward scene reconstruction model. Our experiments on the challenging KITTI-360 and Waymo datasets demonstrate that our method matches or outperforms state-of-the-art baselines that use multi-view supervision, and offers unique advantages, for example regarding dynamic scenes.
Poster
Chenhang Ying · Huiyu Yang · Jieyi Ge · Zhaodong Sun · Xu Cheng · Kui Ren · Xiaobai Li
[ Exhibit Hall I ]
Abstract
Remote physiological measurement using visible light cameras has emerged as a powerful tool for non-contact health monitoring, yet its reliability degrades under challenging conditions such as low-light environments or diverse skin tones. These limitations have motivated the exploration of alternative sensing modalities, such as near-infrared sensors and radar systems, which offer complementary physiological information due to their distinct sensing principles. While alternative modalities capture complementary physiological cues through distinct sensing principles, existing methods fail to holistically integrate these heterogeneous data. Our key insight is that while visible light, near-infrared, and radar operate on distinct physical principles, they all capture temporally dynamic physiological signatures that can be represented as time-varying signals reflecting underlying physiological processes. Based on this insight, we propose FusionPhys, a novel framework that implements an adaptive integration mechanism to refine physiological information across complementary modalities. We further introduce a sub-modality embedding technique that extends fusion principles to single-modality videos. Extensive experiments across five benchmark datasets demonstrate that FusionPhys achieves competitive performance in diverse sensing configurations, representing a significant advancement toward more reliable and versatile remote physiological measurement systems.
Poster
Yuanhong Yu · Xingyi He · Chen Zhao · Junhao Yu · Jiaqi Yang · Ruizhen Hu · Yujun Shen · Xing Zhu · Xiaowei Zhou · Sida Peng
[ Exhibit Hall I ]
Abstract
This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications. The code will be released for the reproducibility.
Poster
Shuo LIANG · Yiwu Zhong · Zi-Yuan Hu · Yeyao Tao · Liwei Wang
[ Exhibit Hall I ]
Abstract
Spatiotemporal video grounding aims to localize target entities in videos based on textual queries, yet existing studies predominantly focus on exocentric videos. In comparison, egocentric video grounding remains underexplored despite its wide applications such as augmented reality and robotics. In this work, we conduct a systematic analysis of the discrepancies between egocentric and exocentric videos, revealing key challenges such as shorter object durations, sparser trajectories, smaller object sizes, and larger positional shifts. Further, we introduce EgoMask, the first pixel-level benchmark for fine-grained spatiotemporal grounding in egocentric videos. It is constructed by our proposed automatic annotation pipeline which annotates referring expressions and object masks across short-, mid-, and long-term videos. Additionally, we create EgoMask-Train, a large-scale training dataset to facilitate model development. Experiments demonstrate that the state-of-the-art spatiotemporal grounding models perform poorly on our benchmark EgoMask, but fine-tuning on EgoMask-Train yields significant improvements, while preserving performance on exocentric datasets. Our work thus provides essential resources and insights for advancing egocentric video understanding.
Poster
Shunya Nagashima · Komei Sugiura
[ Exhibit Hall I ]
Abstract
Accurate, reliable solar flare prediction is crucial for mitigating potential disruptions to critical infrastructure, while predicting solar flares remains a significant challenge. Existing methods based on heuristic physical features often lack representation learning from solar images. On the other hand, end-to-end learning approaches struggle to model long-range temporal dependencies in solar images.In this study, we propose Deep Space Weather Model (Deep SWM), which is based on multiple deep state space models for handling both ten-channel solar images and long-range spatio-temporal dependencies. Deep SWM also features a sparse masked autoencoder, a novel pretraining strategy that employs a two-phase masking approach to preserve crucial regions such as sunspots while compressing spatial information.Furthermore, we built FlareBench, a new public benchmark for solar flare prediction covering a full 11-year solar activity cycle, to validate our method.Our method outperformed baseline methods and even human expert performance on standard metrics in terms of performance and reliability. The project page can be found at https://iccv25-6qrol.kinsta.page.
Poster
Youngho Kim · Hoonhee Cho · Kuk-Jin Yoon
[ Exhibit Hall I ]
Abstract
Human pose estimation is critical for applications such as rehabilitation, sports analytics, and AR/VR systems. However, rapid motion and low-light conditions often introduce motion blur, significantly degrading pose estimation due to the domain gap between sharp and blurred images. Most datasets assume stable conditions, making models trained on sharp images struggle in blurred environments. To address this, we introduce a novel domain adaptation approach that leverages event cameras, which capture high temporal resolution motion data and are inherently robust to motion blur. Using event-based augmentation, we generate motion-aware blurred images, effectively bridging the domain gap between sharp and blurred domains without requiring paired annotations. Additionally, we develop a student-teacher framework that iteratively refines pseudo-labels, leveraging mutual uncertainty masking to eliminate incorrect labels and enable more effective learning. Experimental results demonstrate that our approach outperforms conventional domain-adaptive human pose estimation methods, achieving robust pose estimation under motion blur without requiring annotations in the target domain. Our findings highlight the potential of event cameras as a scalable and effective solution for domain adaptation in real-world motion blur environments. We will make our code publicly available.
Poster
James Amato · Yunan Xie · Leonel Medina-Varela · Ammar Aljerwi · Adam McCutcheon · T. Rippentrop · Kristian Gonzalez · Jacques Delabrouille · Mustapha Ishak · Nicholas Ruozzi
[ Exhibit Hall I ]
Abstract
The Cosmic Microwave Background (CMB) radiation is a pillar of modern cosmology. This GHz-range signal gives rise to better understanding of the fundamental parameters of the universe, but requires sophisticated signal separation. While the astrophysics community has developed computational methods, the adoption of computer-vision methods for these tasks has been proposed by several groups. Results are difficult to compare, as the underlying datasets and evaluations are inconsistent and have not been made publicly available. We propose CMB-ML, a dataset and library that integrates dataset creation, model inference, and result evaluation into a pipeline. The library and links for data are available on GitHub at https://github.com/iccv-author-5412/cmb-ml.
Poster
Yue Li · Meng Tian · Zhenyu Lin · Jiangtong Zhu · Dechang Zhu · Haiqiang Liu · Yueyi Zhang · Zhiwei Xiong · Xinhai Zhao
[ Exhibit Hall I ]
Abstract
Existing benchmarks for Vision-Language Model (VLM) on autonomous driving (AD) primarily assess interpretability through open-form visual question answering (QA) within coarse-grained tasks, which remain insufficient to assess capabilities in complex driving scenarios. To this end, we introduce $\textbf{VLADBench}$, a challenging and fine-grained dataset featuring close-form QAs that progress from static foundational knowledge and elements to advanced reasoning for dynamic on-road situations. The elaborate $\textbf{VLADBench}$ spans 5 key domains: Traffic Knowledge Understanding, General Element Recognition, Traffic Graph Generation, Target Attribute Comprehension, and Ego Decision-Making and Planning. These domains are further broken down into 11 secondary aspects and 29 tertiary tasks for a granular evaluation. A thorough assessment of general and domain-specific (DS) VLMs on this benchmark reveals both their strengths and critical limitations in AD contexts. To further exploit the cognitive and reasoning interactions among the 5 domains for AD understanding, we start from a small-scale VLM and train the DS models on individual domain datasets (collected from 1.4M DS QAs across public sources).The experimental results demonstrate that the proposed benchmark provides a crucial step toward a more comprehensive assessment of VLMs in AD, paving the way for the development of more cognitively sophisticated and reasoning-capable AD systems.
Poster
Chanhwi Jeong · Inhwan Bae · Jin-Hwi Park · Hae-Gon Jeon
[ Exhibit Hall I ]
Abstract
Zero-shot depth completion with metric scales poses significant challenges, primarily due to performance limitations such as domain specificity and sensor characteristics. One recent emerging solution is to integrate monocular depth foundation models into depth completion frameworks, yet these efforts still face issues with suboptimal performance and often require further adaptation to the target task. Surprisingly, we find that a simple test-time training, which fine-tunes monocular depth foundation models on sparse depth measurements from sensors just as it is, yields reasonable results. However, this test-time training obviously incurs high computational costs and introduces biases towards specific conditions, making it impractical for real-world scenarios. In this paper, we introduce a new approach toward parameter-efficient zero-shot depth completion. Our key idea of this work is to leverage visual prompt tuning, achieving sensor-specific depth scale adaptation without forgetting foundational knowledge. Experimental results on diverse datasets demonstrate that our approach outperforms relevant state-of-the-art methods, showing superior generalization and efficiency. Our source code is available in the supplementary materials.
Poster
Liuyi Wang · Xinyuan Xia · Hui Zhao · Hanqing Wang · Tai Wang · Yilun Chen · Chengju Liu · Qijun Chen · Jiangmiao Pang
[ Exhibit Hall I ]
Abstract
Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment's overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. Our code will be publicly released.
Poster
Matan Kichler · Shai Bagon · Mark Sheinin
[ Exhibit Hall I ]
Abstract
Computer vision seeks to infer a wide range of information about scene objects and events. However, vision systems based on conventional imaging are limited to extracting information only from the visible surfaces of scene objects. For instance, a vision system can detect and identify a Coke can in the scene but cannot determine whether it is full or empty. In this paper, we seek to extend the scope of computer vision to include the novel task of inferring the hidden liquid levels of opaque containers by sensing the tiny vibrations on their surfaces. First, we propose a novel speckle-based vibration sensing system for capturing scene vibrations on a 2D grid of points, at once. We use our system to efficiently and remotely capture a dataset of vibration responses for a plurality of everyday liquid containers. Then, we develop a transformer-based approach for analyzing the captured vibrations and classifying the container type and its hidden liquid level at measurement time. Our architecture is invariant to the vibration source, yielding correct liquid level estimates for controlled and ambient scene sound sources. Moreover, we show that the model can generalize to unseen container instances and fluid levels. We demonstrate our method by recovering …
Poster
Ren-Jie Lu · Yu Zhou · hao cheng · Jingke Meng · Wei-Shi Zheng
[ Exhibit Hall I ]
Abstract
Vision and Language Navigation(VLN) requires agents to navigate 3D environments by following natural language instructions. While existing methods predominantly assume access to panoramic observations, many practical robotics are equipped with monocular RGBD cameras, creating a significant configuration disparity. In this work, we address this critical gap by developing a novel 3DGS-based framework for monocular VLN agents, focusing on the intrinsic information incompleteness challenge. Our approach incorporates two key innovations: (1) implicit partial completion module for inferring representations of missing regions in incompletely rendered panoramic feature maps, and (2) an uncertainty-aware active perception strategy that enables the agent to actively acquire visual observation when uncertain about its decision. Extensive experiments on R2R-CE and RxR-CE datasets demonstrate that our monoVLN outperforms all existing monocular methods, significantly improve 8\% success rate on R2R-CE compared to previous monocular methods. We also validate our monoVLN in real-world environments, providing a practical solution for real-world VLN. Furthermore, our findings challenge the conventional wisdom regarding panoramic observations, suggesting they may not be the optimal configuration and providing insights for future research directions in VLN literature. Code will be released.
Poster
Seunggeun Chi · Pin-Hao Huang · Enna Sachdeva · Kwonjoon Lee
[ Exhibit Hall I ]
Abstract
Amodal completion, the task of inferring the complete appearance of objects despite partial occlusions, is crucial for understanding complex human–object interactions (HOI) in computer vision and robotics. Existing methods, including pre-trained diffusion models, often struggle to generate plausible completions in dynamic scenarios due to their limited understanding of HOI. To address this challenge, we propose a novel approach that leverages physical prior knowledge alongside a specialized multi-regional inpainting technique tailored for HOI. By incorporating physical constraints derived from human topology and contact information, we define two distinct regions: the primary region, where occluded object parts are most likely to reside, and the secondary region, where occlusions are less probable. Our multi-regional inpainting method employs customized denoising strategies across these regions within a diffusion model, thereby enhancing the accuracy and realism of generated completions in both shape and visual detail. Experimental results demonstrate that our approach substantially outperforms existing methods in HOI scenarios, advancing machine perception toward a more human-like understanding of dynamic environments. Furthermore, we show that our pipeline remains robust even without ground-truth contact annotations, broadening its applicability to tasks such as 3D reconstruction and novel view/pose synthesis. Code will be made publicly available upon acceptance.
Poster
HIroyasu Akada · Jian Wang · Vladislav Golyanik · Christian Theobalt
[ Exhibit Hall I ]
Abstract
Egocentric 3D human pose estimation has been actively studied using cameras installed in front of a head-mounted device (HMD). While frontal placement is the optimal and the only option for some tasks, such as hand tracking, it remains unclear if the same holds for full-body tracking due to self-occlusion and limited field-of-view coverage. Notably, even the state-of-the-art methods often fail to estimate accurate 3D poses in many scenarios, such as when HMD users tilt their heads upward---a common motion in human activities. A key limitation of existing HMD designs is their neglect of the back of the body, despite its potential to provide crucial 3D reconstruction cues. Hence, this paper investigates the usefulness of rear cameras in the HDM design for full-body tracking. We also show that simply adding rear views to the frontal inputs is not optimal for existing methods due to their dependence on individual 2D joint detectors without effective multi-view integration. To address this issue, we propose a new transformer-based method that refines 2D joint heatmap estimation with multi-view information and heatmap uncertainty, thereby improving 3D pose tracking. Moreover, we introduce two new large-scale datasets, Ego4View-Syn and Ego4View-RW, for a rear-view evaluation. Our experiments show that the …
Poster
Chengxuan Zhu · Qingnan Fan · Qi Zhang · Jinwei Chen · Huaqi Zhang · Chao Xu · Boxin Shi
[ Exhibit Hall I ]
Abstract
We introduce a novel lens blur rendering approach with the help of generative diffusion prior, to achieve physically accurate outcomes. Previous lens blur methods are bounded by the accuracy of depth estimation methods, thus introducing artifacts in depth discontinuities. Our method employs a physics-inspired self-attention module that aligns with the image formation process, incorporating depth-dependent circle of confusion constraint and self-occlusion effects. We adapt the diffusion model to the one-step inference scheme without introducing additional noise, and achieves results of high quality and fidelity. To address the lack of scalable paired training data, we propose to synthesize photorealistic foregrounds with transparency with diffusion models, balancing image authenticity and scene diversity.
Poster
Jiahao Wu · Rui Peng · Jianbo Jiao · Jiayu Yang · Luyang Tang · Kaiqiang Xiong · Jie Liang · Jinbo Yan · runling liu · Ronggang Wang
[ Exhibit Hall I ]
Abstract
Due to the complex and highly dynamic motions in the real world, synthesizing dynamic videos from multi-view inputs for arbitrary viewpoints is challenging. Previous works based on neural radiance field or 3D Gaussian splatting are limited to modeling fine-scale motion, greatly restricting their application. In this paper, we introduce \ourname, which consists of two parts to adapt our method to both large-scale and fine-scale motion scenes:1) We decompose a complex dynamic scene into streamlined local spaces defined by seeds, enabling global modeling by capturing motion within each local space.2) We decouple static and dynamic features for local space motion modeling. A static feature shared across time steps captures static information, while a dynamic residual field provides time-specific features. These are combined and decoded to generate Temporal Gaussians, modeling motion within each local space.As a result, we propose a novel dynamic scene reconstruction framework to model highly dynamic real-world scenes more realistically. Our method not only demonstrates competitive performance on various fine-scale datasets compared to state-of-the-art (SOTA) methods, but also represents the first attempt to model larger and more complex highly dynamic scenes. Code and models will be made publicly available.
Poster
Jiajin Tang · Zhengxuan Wei · Ge Zheng · Sibei Yang
[ Exhibit Hall I ]
Abstract
Humans can perform previously unexperienced interactions with novel objects simply by observing others engage with them. Weakly-supervised affordance grounding mimics this process by learning to locate object regions that enable actions on egocentric images, using exocentric interaction images with image-level annotations. However, extracting affordance knowledge solely from exocentric images and transferring it one-way to egocentric images limits the applicability of previous works in complex interaction scenarios. Instead, this study introduces TransLoop, a novel closed-loop framework that not only transfers knowledge from exocentric to egocentric, but also transfers back to enhance exocentric knowledge extraction. Within TransLoop, several innovative mechanisms are introduced, including unified cross-modal localization and denoising knowledge distillation, to bridge domain gaps between object-centered egocentric and interaction-centered exocentric images, while enhancing knowledge transfer. Experiments show that LoopTrans achieves consistent improvements across all metrics on image and video benchmarks, even handling challenging scenarios where object interaction regions are fully occluded by the human body.
Poster
Nahyuk Lee · Juhong Min · Junhong Lee · Chunghyun Park · Minsu Cho
[ Exhibit Hall I ]
Abstract
This paper introduces a new shape-matching methodology, combinative matching, to combine interlocking parts for geometric shape assembly. Previous methods for geometric assembly typically rely on aligning parts by finding identical surfaces between the parts as in conventional shape matching and registration. In contrast, we explicitly model two distinct properties of interlocking shapes: 'identical surface shape' and 'opposite volume occupancy.' Our method thus learns to establish correspondences across regions where their surface shapes appear identical but their volumes occupy the inverted space to each other. To facilitate this process, we also learn to align regions in rotation by estimating their shape orientations via equivariant neural networks. The proposed approach significantly reduces local ambiguities in matching and allows a robust combination of parts in assembly. Experimental results on geometric assembly benchmarks demonstrate the efficacy of our method, consistently outperforming the state of the art.
Poster
Yihan Cao · Jiazhao Zhang · Zhinan Yu · Shuzhen Liu · Zheng Qin · Qin Zou · Bo Du · Kai Xu
[ Exhibit Hall I ]
Abstract
Object goal navigation (ObjectNav) is a fundamental task in embodied AI, requiring an agent to locate a target object in previously unseen environments. This task is particularly challenging because it requires both perceptual and cognitive processes, including object recognition and decision-making. While substantial advancements in perception have been driven by the rapid development of visual foundation models, progress on the cognitive aspect remains constrained, primarily limited to either implicit learning through simulator rollouts or explicit reliance on predefined heuristic rules. Inspired by neuroscientific findings demonstrating that humans maintain and dynamically update fine-grained cognitive states during object search tasks in novel environments, we propose CogNav, a framework designed to mimic this cognitive process using large language models. Specifically, we model the cognitive process using a finite state machine comprising fine-grained cognitive states, ranging from exploration to identification. Transitions between states are determined by a large language model based on a dynamically constructed heterogeneous cognitive map, which contains spatial and semantic information about the scene being explored. Extensive evaluations on the HM3D, MP3D, and RoboTHOR benchmarks demonstrate that our cognitive process modeling significantly improves the success rate of ObjectNav at least by relative 14% over the state-of-the-arts. The code has been submitted …
Poster
Xinggang Hu · Chenyangguang Zhang · Mingyuan Zhao · Yuanze Gui · Xiangkui Zhang · Xiangyang Ji
[ Exhibit Hall I ]
Abstract
In dynamic scenes, achieving accurate camera localization and reconstructing a long-term consistent map containing only the static background are two major challenges faced by Visual Simultaneous Localization and Mapping (VSLAM). In current traditional dynamic VSLAM systems, the methods used to handle dynamic objects are primarily designed for localization; if applied to reconstruction, they are prone to introducing motion artifacts. Meanwhile, mask compensation strategies in NeRF- or 3DGS-based dynamic VSLAM systems also face challenges, such as the inability to completely eliminate dynamic object artifacts and low real-time performance. To address these issues, we leverage object detection to extract semantic information and propose a dynamic feature detection algorithm based on both geometry and appearance. This algorithm accurately identifies known and unknown moving objects and determines their actual motion states. To mitigate the issue of insufficient detection box coverage, we design a dynamic object box correction algorithm based on clustering and Gaussian mixture models to comprehensively identify moving object regions. Furthermore, to overcome the limitations of sparse features in texture-scarce environments, we introduce a feature densification strategy based on image texture complexity, enhancing reconstruction quality while maintaining real-time performance. Extensive experimental evaluations demonstrate that our system achieves state-of-the-art localization and reconstruction performance in …
Poster
Sivan Doveh · Nimrod Shabtay · Eli Schwartz · Leonid Karlinsky · Raja Giryes · Hilde Kuehne · Rogerio Feris · James Glass · Assaf Arbelle · Shimon Ullman · Muhammad Jehanzeb Mirza
[ Exhibit Hall I ]
Abstract
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks, including image recognition, video understanding, and Visual Question Answering (VQA) when explicitly trained for these tasks. Despite these advances, we find that present-day VLMs (including the proprietary GPT-4o) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.In this work, we focus on the task of few-shot personalized localization, where a model is given a small set of annotated images (in-context examples) -- each with a category label and bounding box -- and is tasked with localizing the same object type in a query image. Personalized localization can be particularly important in cases of ambiguity of several related objects that can respond to a text or an object that is hard to describe with words.To provoke personalized localization abilities in models, we present a data-centric solution that fine-tunes them using carefully curated data from video object tracking datasets. By leveraging sequences of frames tracking the same object across multiple shots, we simulate instruction-tuning dialogues that promote context awareness. To reinforce this, we introduce a novel regularization technique that replaces object labels with pseudo-names, ensuring the model relies on visual context …
Poster
Jiasheng Guo · Xin Gao · Yuxiang Yan · Guanghao Li · Jian Pu
[ Exhibit Hall I ]
Abstract
Low-light Object detection is crucial for many real-world applications but remains challenging due to degraded image quality. While recent studies have shown that RAW images offer superior potential over RGB images, existing approaches either use RAW-RGB images with information loss or employ complex frameworks. To address these, we propose a lightweight and self-adaptive Image Signal Processing (ISP) plugin, Dark-ISP, which directly processes Bayer RAW images in dark environments, enabling seamless end-to-end training for object detection. Our key innovations are: (1) We deconstruct conventional ISP pipelines into sequential linear (sensor calibration) and nonlinear (tone mapping) sub-modules, recasting them as differentiable components optimized through task-driven losses. Each module is equpped with content-aware adaptability and physics-informed priors, enabling automatic RAW-to-RGB conversion aligned with detection objectives. (2) By exploiting the ISP pipeline’s intrinsic cascade structure, we devise a self-boosting strategy that facilitates cooperation between sub-modules. Through extensive experiments on three RAW image datasets, we demonstrate that our method outperforms state-of-the-art RGB- and RAW-based detection approaches, achieving superior results with minimal parameters in challenging low-light environments.
Poster
Romain Thoreau · Valerio Marsocci · Dawa Derksen
[ Exhibit Hall I ]
Abstract
As large-scale heterogeneous data sets become increasingly available, adapting Foundation Models at low cost has become a key issue.Seminal works in natural language processing, e.g. Low-Rank Adaptation (LoRA), leverage the low "intrinsic rank" of parameter updates during adaptation. In this paper, we argue that stronger inductive biases on the data and on the models can improve the adaptation of Foundation Models pretrained on RGB satellite images to other sources of satellite data. The pretrained parameters of Geospatial Foundation Models (GFMs) indeed provide a strong prior on the spatial dimension of multispectral images. For this reason, we introduce DEFLECT (Deflecting Embeddings for Finetuning Latent representations for Earth and Climate Tasks), a novel strategy for adapting GFMs to multispectral satellite imagery with very few additional parameters. DEFLECT improves the representation capabilities of the extracted features, particularly enhancing spectral information, which is essential for geoscience and environmental-related tasks. We demonstrate the effectiveness of our method across three different GFMs and five diverse datasets, ranging from forest monitoring to marine environment segmentation. Compared to competing methods, DEFLECT achieves on-par or higher accuracy with 5-10x fewer parameters for classification and segmentation tasks. The code will be made publicly available.
Poster
Ran Zhao · Xinxin Dai · Pengpeng Hu · Vasile Palade · Adrian Munteanu
[ Exhibit Hall I ]
Abstract
While automatic anthropometric measurement extraction has witnessed growth in recent years, effective, non-contact, and precise measurement methods for dressed humans in arbitrary poses are still lacking, limiting the widespread application of this technology. The occlusion caused by clothing and the adverse influence of posture on body shape significantly increase the complexity of this task. Additionally, current methods often assume the availability of a complete 3D body mesh in a canonical pose (e.g., "A" or "T" pose), which is not always the case in practice. To address these challenges, we propose MeasureXpert, a novel learning-based model that requires only two unregistered, partial, and dressed body scans as input, and accommodates entirely independent and arbitrary poses for each scan. MeasureXpert computes a comprehensive representation of the naked body shape by synergistically fusing features from the front- and back-view partial point clouds. The comprehensive representation obtained is mapped onto a 3D undressed body shape space, assuming a canonical posture and incorporating predefined measurement landmarks. A point-based offset optimization is also developed to refine the reconstructed complete body shape, enabling accurate regression of measurement values. To train the proposed model, a new large-scale dataset, consisting of 300K samples, was synthesized. The proposed model was …
Poster
Dimitrije Antić · Georgios Paschalidis · Shashank Tripathi · Theo Gevers · Sai Kumar Dwivedi · Dimitrios Tzionas
[ Exhibit Hall I ]
Abstract
Recovering 3D object pose and shape from a single image is a challenging and highly ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intra- and inter-class shape variance, and lack of 3D ground truth for natural images. While existing methods train deep networks on synthetic datasets to predict 3D shapes, they often struggle to generalize to real-world scenarios, lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without explicitly considering pixel alignment. To this end, we make two key observations: (1) a robust solution requires a model that imposes a strong category-specific shape prior to constrain the search space, and (2) foundational models embed 2D images and 3D shapes in joint spaces; both help resolve ambiguities. Hence, we propose SDFit, a novel optimization framework that is built on three key innovations: First, we use a learned morphable signed-distance-function (mSDF) model that acts as a strong shape prior, thus constraining the shape space. Second, we use foundational models to establish rich 2D-to-3D correspondences between image features and the mSDF. Third, we develop a fitting pipeline that iteratively refines both shape and pose, aligning the mSDF to the image. We evaluate SDFit on …
Poster
Sanghun Jung · Jingjing Zheng · Ke Zhang · Nan Qiao · Albert Y. C. Chen · Lu Xia · Chi Liu · Yuyin Sun · Xiao Zeng · Hsiang-Wei Huang · Byron Boots · Min Sun · Cheng-Hao Kuo
[ Exhibit Hall I ]
Abstract
Unlike closed-vocabulary 3D instance segmentation that is trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200, S3DIS, and Replica across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.
Poster
Gene Chou · Wenqi Xian · Guandao Yang · Mohamed Abdelfattah · Bharath Hariharan · Noah Snavely · Ning Yu · Paul Debevec
[ Exhibit Hall I ]
Abstract
A versatile video depth estimation model should be consistent and accurate across frames, produce high-resolution depth maps, and support real-time streaming. We propose a method, FlashDepth, that satisfies all three requirements, performing depth estimation for a 2044x1148 streaming video at 24 FPS. We show that, with careful modifications to pretrained single-image depth models, these capabilities are enabled with relatively little data and training. We validate our approach across multiple unseen datasets against state-of-the-art depth models, and find that our method outperforms them in terms of boundary sharpness and speed by a significant margin, while maintaining competitive accuracy. We hope our model will enable various applications that require high-resolution depth, such as visual effects editing, and online decision-making, such as robotics.
Poster
Zhirui Gao · Renjiao Yi · Yuhang Huang · Wei Chen · Chenyang Zhu · Kai Xu
[ Exhibit Hall I ]
Abstract
Low-level 3D representations, such as point clouds, meshes, NeRFs and 3D Gaussians, are commonly used for modeling 3D objects and scenes. However, cognitive studies indicate that human perception operates at higher levels and interprets 3D environments by decomposing them into meaningful structural parts, rather than low-level elements like points or voxels. Structured geometric decomposition enhances scene interpretability and facilitates downstream tasks requiring component-level manipulation. In this work, we introduce $\textit{\textbf{PartGS}}$, a self-supervised part-aware reconstruction framework that integrates 2D Gaussians and superquadrics to parse objects and scenes into an interpretable decomposition, leveraging multi-view image inputs to uncover 3D structural information. Our method jointly optimizes superquadric meshes and Gaussians by coupling their parameters within a hybrid representation. On one hand, superquadrics enable the representation of a wide range of shape primitives, facilitating flexible and meaningful decomposition. On the other hand, 2D Gaussians capture detailed texture and geometric details, ensuring high-fidelity appearance and geometry reconstruction. Operating in a self-supervised manner, our approach demonstrates superior performance compared to state-of-the-art methods across extensive experiments on the DTU, ShapeNet, and real-world datasets.
Poster
Qianqian Wang · Vickie Ye · Hang Gao · Weijia Zeng · Jake Austin · Zhengqi Li · Angjoo Kanazawa
[ Exhibit Hall I ]
Abstract
Monocular dynamic reconstruction is a challenging and long-standing vision problem due to the highly ill-posed nature of the task. Existing approaches depend on templates, are effective only in quasi-static scenes, or fail to model 3D motion explicitly. We introduce a method for reconstructing generic dynamic scenes, featuring explicit, persistent 3D motion trajectories in the world coordinate frame, from casually captured monocular videos.We tackle the problem with two key insights: First, we exploit the low-dimensional structure of 3D motion by representing scene motion with a compact set of SE(3) motion bases. Each point's motion is expressed as a linear combination of these bases, facilitating soft decomposition of the scene into multiple rigidly-moving groups. Second, we take advantage of off-the-shelf data-driven priors such as monocular depth maps and long-range 2D tracks, and devise a method to effectively consolidate these noisy supervisory signals, resulting in a globally consistent representation of the dynamic scene. Experiments show that our method achieves state-of-the-art performance for both long-range 3D/2D motion estimation and novel view synthesis on dynamic scenes.
Poster
Zhenyu Li · Mykola Lavreniuk · Jian Shi · Shariq Bhat · Peter Wonka
[ Exhibit Hall I ]
Abstract
Amodal depth estimation aims to predict the depth of occluded (invisible) parts of objects in a scene. This task addresses the question of whether models can effectively perceive the geometry of occluded regions based on visible cues. Prior methods primarily rely on synthetic datasets and focus on metric depth estimation, limiting their generalization to real-world settings due to domain shifts and scalability challenges. In this paper, we propose a novel formulation of amodal depth estimation in the wild, focusing on relative depth prediction to improve model generalization across diverse natural images. We introduce a new large-scale dataset, Amodal Depth In the Wild (ADIW), created using a scalable pipeline that leverages segmentation datasets and compositing techniques. Depth maps are generated using large pre-trained depth models, and a scale-and-shift alignment strategy is employed to refine and blend depth predictions, ensuring consistency in ground-truth annotations. To tackle the amodal depth task, we present two complementary frameworks: Amodal-DAV2, a deterministic model based on Depth Anything V2, and Amodal-DepthFM, a generative model that integrates conditional flow matching principles. Our proposed frameworks effectively leverage the capabilities of large pre-trained models with minimal modifications to achieve high-quality amodal depth predictions. Experiments validate our design choices, demonstrating the …
Poster
Deepayan Das · Davide Talon · Yiming Wang · Massimiliano Mancini · Elisa Ricci
[ Exhibit Hall I ]
Abstract
Vision Language Models (VLMs) have lead to major improvements in multimodal reasoning, yet they still struggle to understand user-specific concepts. Existing personalization methods address this limitation butheavily rely on training procedures, that can be either costly or unpleasant to individual users.We depart from existing work, and for the first time explore the training-free setting in the context of personalization. We propose a novel method, Retrieval and Reasoning for Personalization (R2P), leveraging internal knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint, i.e., key attributes uniquely defining the concept within its semantic class. When a query arrives, the most similar fingerprints are retrieved and scored via chain of thought reasoning. To reduce the risk of hallucinations, the scores are validated through cross-modal verification at the attribute level:in case of a discrepancy between the scores, R2P refines the concept association viapairwise multimodal matching, where the retrieved fingerprints and their images aredirectly compared with the query.We validate R2P on two publicly available benchmarks and a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA), for concept identification highlighting challenges in visual ambiguity. R2P consistently outperforms state-of-the-art approaches on various downstream tasks across all benchmarks. Code will be available upon acceptance.
Poster
Artem Zholus · Carl Doersch · Yi Yang · Skanda Koppula · Viorica Patraucean · Xu He · Ignacio Rocco · Mehdi S. M. Sajjadi · Sarath Chandar · Ross Goroshin
[ Exhibit Hall I ]
Abstract
Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training.
Poster
Minghua Liu · Mikaela Uy · Donglai Xiang · Hao Su · Sanja Fidler · Nicholas Sharp · Jun Gao
[ Exhibit Hall I ]
Abstract
We propose PartField, a feedforward approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy without relying on predefined templates or text-based names, and can be applied to open-world 3D shapes across various modalities. PartField requires only a 3D feedforward pass at inference time, significantly improving runtime and robustness compared to prior approaches. Our model is trained by distilling 2D and 3D part proposals from a mix of labeled datasets and image segmentations on large unsupervised datasets, via a contrastive learning formulation. It produces a continuous feature field which can be clustered to yield a hierarchical part decomposition. Comparisons show that PartField is up to 20\% more accurate and often orders of magnitude faster than other recent class-agnostic part-segmentation methods. Beyond single-shape part decomposition, consistency in the learned field emerges across shapes, enabling tasks such as co-segmentation and correspondence, which we demonstrate in several applications of these general-purpose, hierarchical, and consistent 3D feature fields.
Poster
Meiao Wang · Xuejing Kang · Yaxi Lu · Jie Xu
[ Exhibit Hall I ]
Abstract
Low-light video enhancement (LLVE) aims to restore videos degraded by insufficient illumination.While existing methods have demonstrated their effectiveness, they often face challenges with intra-frame noise, overexposure, and inter-frame inconsistency since they fail to exploit the temporal continuity across frames.Inspired by the progressive video understanding mechanism of human, we propose a novel end-to-end two-stage memory controller (MC) dominated network (RetinexMCNet). Specifically, we first define the overall optimization objective for Retinex-based LLVE, and accordingly design our framework.In stage one, aided by a dual-perspective Lightness-Texture Stability (LTS) loss, we perform per-frame enhancement without the MC, which uses a channel-aware Illumination Adjustment Module (IAM) and an illumination-guided Reflectance Denoising Module (RDM) based on Retinex theory to mitigate intra-frame noise and overexposure.In stage two, we activate the MC to simulate human temporal memory and integrate it with high-quality single frames for global consistency.Extensive qualitative and quantitative experiments on common low-light sRGB datasets demonstrate our method significantly outperforms state-of-the-art approaches. Code is available at xxx/xxx/xxx.
Poster
Ronggang Huang · Haoxin Yang · Yan Cai · Xuemiao Xu · Huaidong Zhang · Shengfeng He
[ Exhibit Hall I ]
Abstract
3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies in spatial descriptions caused by perspective variations.To tackle these challenges, we propose ViewSRD, a framework that formulates 3D visual grounding as a structured multi-view decomposition process. First, the Simple Relation Decoupling (SRD) module restructures complex multi-anchor queries into a set of targeted single-anchor statements, generating a structured set of perspective-aware descriptions that clarify positional relationships. These decomposed representations serve as the foundation for the Multi-view Textual-Scene Interaction (Multi-TSI) module, which integrates textual and scene features across multiple viewpoints using shared, Cross-modal Consistent View Tokens (CCVTs) to preserve spatial correlations. Finally, a Textual-Scene Reasoning module synthesizes multi-view predictions into a unified and robust 3D visual grounding.Experiments on 3D visual grounding datasets show that ViewSRD significantly outperforms state-of-the-art methods, particularly in complex queries requiring precise spatial differentiation.
Poster
Shunsuke Yasuki · Taiki Miyanishi · Nakamasa Inoue · Shuhei Kurita · Koya Sakamoto · Daichi Azuma · Masato Taki · Yutaka Matsuo
[ Exhibit Hall I ]
Abstract
The advancement of 3D language fields has enabled intuitive interactions with 3D scenes via natural language. However, existing approaches are typically limited to small-scale environments, lacking the scalability and compositional reasoning capabilities necessary for large, complex urban settings. To overcome these limitations, we propose GeoProg3D, a visual programming framework that enables natural language-driven interactions with city-scale high-fidelity 3D scenes. GeoProg3D consists of two key components: (i) a Geography-aware City-scale 3D Language Field (GCLF) that leverages a memory-efficient hierarchical 3D model to handle large-scale data, integrated with geographic information for efficiently filtering vast urban spaces using directional cues, distance measurements, elevation data, and landmark references; and (ii) Geographical Vision APIs (GV-APIs), specialized geographic vision tools such as area segmentation and object detection. Our framework employs large language models (LLMs) as reasoning engines to dynamically combine GV-APIs and operate GCLF, effectively supporting diverse geographic vision tasks. To assess performance in city-scale reasoning, we introduce GeoEval3D, a comprehensive benchmark dataset containing 952 query-answer pairs across five challenging tasks: grounding, spatial reasoning, comparison, counting, and measurement. Experiments demonstrate that GeoProg3D significantly outperforms existing 3D language fields and vision-language models across multiple tasks. To our knowledge, GeoProg3D is the first framework enabling compositional geographic reasoning …
Poster
Wajahat Khalid · Bin Liu · Xulin Li · MUHAMMAD WAQAS · MUHAMMAD SHER AFGAN
[ Exhibit Hall I ]
Abstract
Aerial-Ground Person Re-Identification (AG-ReID) is a practical yet challenging task that involves cross-platform matching between aerial and ground cameras. Existing person Re-Identification (Re-ID) methods are primarily designed for homogeneous camera settings, such as ground-to-ground or aerial-to-aerial matching. Therefore, these conventional Re-ID approaches underperform due to the significant viewpoint discrepancies introduced by cross-platform cameras in the AG-ReID task. To address this limitation, we propose a novel and efficient approach, termed View-Invariant Feature Learning for Aerial-Ground Person Re-Identification (VIF-AGReID), which explores view-invariant features without leveraging any auxiliary information. Our approach introduces two key components: (1) Patch-Level RotateMix (PLRM), an augmentation strategy that enhances rotational diversity within local regions of training samples, enabling the model to capture fine-grained view-invariant features, and (2) View-Invariant Angular Loss (VIAL), which mitigates the impact of perspective variations by imposing angular constraints that exponentially penalize large angular deviations, optimizing the similarity of positive pairs while enhancing dissimilarity for hard negatives. These components interact synergistically to drive view-invariant feature learning, enhancing robustness across diverse viewpoints. We conduct extensive experiments on benchmark AG-ReID datasets, including CARGO and AG-ReID, to evaluate the effectiveness of our proposed method. Experimental results demonstrate that VIF-AGReID significantly outperforms existing state-of-the-art methods, achieving superior performance in …
Poster
Jinming Li · Yichen Zhu · Zhibin Tang · Junjie Wen · Minjie Zhu · Xiaoyu Liu · Chengmeng Li · Ran Cheng · Yaxin Peng · Yan Peng · Feifei Feng
[ Exhibit Hall I ]
Abstract
Robot foundation models, particularly Vision-Language-Action (VLA) models, have garnered significant attention for their ability to enhance robot policy learning, greatly improving robot's generalization and robustness. OpenAI’s recent model, O1, showcased impressive capabilities in solving complex problems by utilizing extensive reasoning chains. This prompts an important question: can robot models achieve better performance in multi-task, complex environments by reviewing prior observations and then providing task-specific reasoning to guide action prediction?In this paper, we introduce \textbf{Chain-of-Affordance (CoA-VLA)}, a novel approach to scaling robot models by incorporating reasoning in the format of sequential robot affordances to facilitate task completion. Specifically, we prompt the model to consider the following four types of affordances before taking action: (1) \textit{object affordance} — what object to manipulate and where it is; (2) \textit{grasp affordance} — the specific object part to grasp; (3) \textit{spatial affordance} — the optimal space to place the object; and (4) \textit{movement affordance} — the collision-free path for movement. We further transform each affordance into two prompting formats: \textbf{\textit{visual affordance and textual affordance}}. We introduce a novel vision-language co-injection module that integrates this knowledge into the policy network. This allows the robot to leverage essential contextual information during action inference, resulting in improved precision …
Poster
Fengbo Lan · Chang Wen Chen
[ Exhibit Hall I ]
Abstract
Reflective flares are common artifacts in photography that degrade image quality, introducing in-focus flares, which appear as bright, regular spot patterns, and out-of-focus flares, which are diffuse and semi-transparent, obscuring the underlying scene. While previous methods have achieved some success in removing in-focus flares, they struggle with the diffuse nature of out-of-focus flares. The lack of an out-of-focus flare dataset has further hindered the development of effective flare removal models. In this work, we construct a large-scale out-of-focus flare dataset generated based on physical principles. We propose a novel color alignment approach using diffusion models to address the challenges of out-of-focus reflective flare removal. Rather than reconstructing flare-affected regions, our method adjusts the color distribution to reduce artifact visibility while preserving image content. Specifically, we introduce a differentiable histogram loss, derived from the Earth Mover's Distance (EMD), to effectively align color distributions. The proposed approach outperforms existing methods on both synthetic and real-world data, demonstrating improved performance in flare removal.
Poster
Baicheng Li · Zike Yan · Dong Wu · Hongbin Zha
[ Exhibit Hall I ]
Abstract
Human behaviors are the major causes of scene dynamics and inherently contain rich cues regarding the dynamics. This paper formalizes a new task of proactive scene decomposition and reconstruction, an online approach that leverages human-object interactions to iteratively disassemble and reconstruct the environment. By observing these intentional interactions, we can dynamically refine the decomposition and reconstruction process, addressing inherent ambiguities in static object-level reconstruction. The proposed system effectively integrates multiple tasks in dynamic environments such as accurate camera and object pose estimation, instance decomposition, and online map updating, capitalizing on cues from human-object interactions in egocentric live streams for a flexible, progressive alternative to conventional object-level reconstruction methods. Aided by the Gaussian splatting technique, accurate and consistent dynamic scene modeling is achieved with photorealistic and efficient rendering. The efficacy is validated in multiple real-world scenarios with promising advantages.
Poster
Tom Fischer · Xiaojie Zhang · Eddy Ilg
[ Exhibit Hall I ]
Abstract
Recognizing objects in images is a fundamental problem in computer vision. While detecting objects in 2D images is common, many applications require determining their pose in 3D space. Traditional category-level methods rely on RGB-D inputs, which may not always be available, or employ two-stage approaches that use separate models and representations for detection and pose estimation. For the first time, we introduce a unified model that integrates detection and pose estimation into a single framework for RGB images by leveraging neural mesh models with learned features and multi-model RANSAC. Our approach achieves state-of-the-art results on RGB category-level pose on REAL275, outperforming the current state-of-the-art by 5.5\%, averaged across all scale-agnostic metrics. Finally, we demonstrate that our unified method exhibits significantly greater robustness compared to single-stage baselines.
Poster
Danila Rukhovich · Elona Dupont · Dimitrios Mallis · Kseniya Cherenkova · Anis Kacem · Djamila Aouada
[ Exhibit Hall I ]
Abstract
Computer-Aided Design (CAD) models are typically constructed by sequentially drawing parametric sketches and applying CAD operations to obtain a 3D model. The problem of 3D CAD reverse engineering consists of reconstructing the sketch and CAD operation sequences from 3D representations such as point clouds. In this paper, we address this challenge through novel contributions across three levels: CAD sequence representation, network design, and dataset. In particular, we represent CAD sketch-extrude sequences as Python code. The proposed CAD-Recode translates a point cloud into Python code that, when executed, reconstructs the CAD model. Taking advantage of the exposure of pre-trained Large Language Models (LLMs) to Python code, we leverage a relatively small LLM as a decoder for CAD-Recode and combine it with a lightweight point cloud projector. CAD-Recode is trained solely on a proposed synthetic dataset of one million diverse CAD sequences. CAD-Recode significantly outperforms existing methods across three datasets while requiring fewer input points. Notably, it achieves 10 times lower mean Chamfer distance than state-of-the-art methods on DeepCAD and Fusion360 datasets. Furthermore, we show that our CAD Python code output is interpretable by off-the-shelf LLMs, enabling CAD editing and CAD-specific question answering from point clouds.
Poster
Dmitrii Torbunov · Yihui Ren · Animesh Ghose · Odera Dim · Yonggang Cui
[ Exhibit Hall I ]
Abstract
Event-based cameras (EBCs) have emerged as a bio-inspired alternative to traditional cameras, offering advantages in power efficiency, temporal resolution, and high dynamic range.However, the development of image analysis methods for EBCs is challenging due to the sparse and asynchronous nature of the data.This work addresses the problem of object detection for EBC cameras.The current approaches to EBC object detection focus on constructing complex data representations and rely on specialized architectures.We introduce I2EvDet (Image-to-Event Detection), a novel adaptation framework that bridges mainstream object detection with temporal event data processing.First, we demonstrate that a Real-Time DEtection TRansformer, or RT-DETR, a state-of-the-art natural image detector, trained on a simple image-like representation of the EBC data achieves performance comparable to specialized EBC methods.Next, as part of our framework, we develop an efficient adaptation technique that transforms image-based detectors into event-based detection models by modifying their frozen latent representation space through minimal architectural additions.The resulting EvRT-DETR model reaches state-of-the-art performance on the standard benchmark datasets Gen1 (mAP $+2.3$) and 1Mpx/Gen4 (mAP $+1.4$).These results demonstrate a fundamentally new approach to EBC object detection through principled adaptation of mainstream architectures, offering an efficient alternative with potential applications to other temporal visual domains.
Poster
Connor Malone · Somayeh Hussaini · Tobias Fischer · Michael Milford
[ Exhibit Hall I ]
Abstract
Visual Place Recognition (VPR) enables coarse localization by comparing query images to a reference database of geo-tagged images. Recent breakthroughs in deep learning architectures and training regimes have led to methods with improved robustness to factors like environment appearance change, but with the downside that the required training and/or matching compute scales with the number of distinct environmental conditions encountered. Here, we propose Hyperdimensional One Place Signatures (HOPS) to simultaneously improve the performance, compute and scalability of these state-of-the-art approaches by fusing the descriptors from multiple reference sets captured under different conditions. HOPS scales to any number of environmental conditions by leveraging the Hyperdimensional Computing framework. Extensive evaluations demonstrate that our approach is highly generalizable and consistently improves recall performance across all evaluated VPR methods and datasets by large margins. Arbitrarily fusing reference images without compute penalty enables numerous other useful possibilities, three of which we demonstrate here: descriptor dimensionality reduction with no performance penalty, stacking synthetic images, and coarse localization to an entire traverse or environmental section.
Poster
Fangqi Zhu · Hongtao Wu · Song Guo · Yuxiao Liu · Chilam Cheang · Tao Kong
[ Exhibit Hall I ]
Abstract
World models allow autonomous agents to plan and explore by predicting the visual outcomes of different actions. However, for robot manipulation, it is challenging to accurately model the fine-grained robot-object interaction within the visual space using existing methods which overlooks precise alignment between each action and the corresponding frame.In this paper, we present IRASim, a novel world model capable of generating videos with fine-grained robot-object interaction details, conditioned on historical observations and robot action trajectories.We train a diffusion transformer and introduce a novel frame-level action-conditioning module within each transformer block to explicitly model and strengthen the action-frame alignment.Extensive experiments show that: (1) the quality of the videos generated by our method surpasses all the comparing baseline methods and scales effectively with increased model size and computation;(2) policy evaluations using IRASim exhibit a strong correlation with those using the ground-truth simulator, highlighting its potential to accelerate real-world policy evaluation; (3) testing-time scaling through model-based planning with IRASim significantly enhances policy performance, as evidenced by an improvement in the IoU metric on the Push-T benchmark from 0.637 to 0.961;(4) IRASim provides flexible action controllability, allowing virtual robotic arms in datasets to be controlled via a keyboard or VR controller. Video and code …
Poster
Zhiqiang Yuan · Ting Zhang · Yeshuang Zhu · Jiapei Zhang · Ying Deng · Zexi Jia · Peixiang Luo · Xiaoyue Duan · Jie Zhou · Jinchao Zhang
[ Exhibit Hall I ]
Abstract
Approximately 200 million individuals around the world suffer from varying degrees of visual impairment, making it crucial to leverage AI technology to offer walking assistance for these people.With the recent progress of vision-language models (VLMs), applying VLMs to offer walking guidance has become popular. However, the existing methods of walking guidance are mainly based on self-curated question-answering datasets that are not publicly accessible, without a standardized benchmark for training or evaluation. Moreover, walking assistance often requires real-time streaming video analysis and the generation of concise yet informative reminders, making VLMs struggle due to excessive responses and low efficiency in inferences. In this paper, we introduce the first large-scale dataset dedicated to walking assistance, comprising 12,000 video-annotation pairs, to provide a unified benchmark for training and evaluating systems to help visually-impaired individuals walk. Furthermore, a WalkVLM model is proposed, which employs chain of thought for hierarchical planning to generate concise but informative reminders and utilizes temporal-aware adaptive prediction to reduce the temporal redundancy of reminders. Finally, we have established a solid benchmark for blind walking task and verified the advantages of WalkVLM in stream video processing for this task compared to other VLMs.
Poster
Lei Tian · Xiaomin Li · Liqian Ma · Hao Yin · Zirui Zheng · Hefei Huang · Taiqing Li · Huchuan Lu · Xu Jia
[ Exhibit Hall I ]
Abstract
Recent advances in 3D reconstruction techniques and vision-language models have fueled significant progress in 3D semantic understanding—a capability critical to robotics, autonomous driving, and virtual/augmented reality. However, methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies induced by occlusion, image blur, and view-dependent variations. These inconsistencies, when propagated via projection supervision, deteriorate the quality of 3D Gaussian semantic fields and introduce artifacts in the rendered outputs. To mitigate this limitation, we propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues. Specifically, our approach first employs a zero-shot tracker to align a set of 2D masks—provided by SAM—to reliably identify their corresponding categories. Next, we utilize CLIP to extract robust semantic encodings across views. Finally, our Contrastive Codebook Learning (CCL) module distills discriminative semantic features by enforcing intra-class compactness and inter-class distinctiveness. In contrast to previous methods that directly apply CLIP to imperfect masks, our framework explicitly resolves semantic conflicts while preserving category discriminability. Extensive experiments demonstrate CCL-LGS's superiority over previous state-of-the-art methods.
Poster
Jiaying Ying · Heming Du · Kaihao Zhang · Lincheng Li · Xin Yu
[ Exhibit Hall I ]
Abstract
Human pose estimation aims to predict the location of body keypoints and enable various practical applications.However, existing research focuses solely on individuals with full physical bodies and overlooks those with limb deficiencies. As a result, current pose estimation annotation formats cannot be generalized to individuals with limb deficiencies.In this paper, we introduce the \textbf{Limb-Deficient Pose Estimation task}, which not only predicts the locations of standard human body keypoints, but also estimates the endpoints of missing limbs.To support this task, we present \textbf{Limb-Deficient Pose (LDPose), the first-ever human pose estimation dataset for individuals with limb deficiencies}.LDPose comprises over 28k images for approximately 100k individuals across diverse limb deficiency types and ethnic backgrounds. The annotation process is guided by internationally accredited para-athletics classifiers to ensure high precision.In addition, we propose a \textbf{Limb-Deficient Loss (LDLoss)} to better distinguish residual limb keypoints by contrasting residual limb keypoints and intact limb keypoints.Furthermore, we design a \textbf{Limb-Deficient Metric (LD Metrics)} to quantitatively measure the keypoint predictions of both residual and intact limbs and benchmark our dataset using state-of-the-art human pose estimation methods.Experiment results indicate that LDPose is a challenging dataset, and we believe that it will foster further research and ultimately support individuals with limb deficiencies …