Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 6 & Exhibit Hall with Coffee Break

Exhibit Hall I
Thu 23 Oct 5:30 p.m. PDT — 7:30 p.m. PDT
Abstract:
Chat is not available.

Thu 23 Oct. 17:30 - 19:30 PDT

#427
Stealthy Backdoor Attack in Federated Learning via Adaptive Layer-wise Gradient Alignment

Qingqian Yang · Peishen Yan · Xiaoyu Wu · Jiaru Zhang · Tao Song · Yang Hua · Hao Wang · Liangliang Wang · Haibing Guan

The distributed nature of federated learning exposes it to significant security threats, among which backdoor attacks are one of the most prevalent. However, existing backdoor attacks face a trade-off between attack strength and stealthiness: attacks maximizing the attack strength are often detectable, while stealthier approaches significantly reduce the effectiveness of the attack itself. Both of them result in ineffective backdoor injection. In this paper, we propose an adaptive layer-wise gradient alignment strategy to effectively evade various robust defense mechanisms while preserving attack strength. Without requiring additional knowledge, we leverage the previous global update as a reference for alignment to ensure stealthiness during dynamic FL training. This fine-grained alignment strategy applies appropriate constraints to each layer, which helps to significantly maintain attack impact. To demonstrate the effectiveness of our method, we conduct exhaustive evaluations across a wide range of datasets and networks. Our experimental results show that the proposed attack effectively bypasses eight state-of-the-art (SOTA) defenses and achieves high backdoor accuracy, outperforming existing attacks by up to 54.76%. Additionally, it significantly preserves attack strength and maintains robust performance across diverse scenarios, highlighting its adaptability and generalizability.

Thu 23 Oct. 17:45 - 19:45 PDT

#1
X2-Gaussian: 4D Radiative Gaussian Splatting for Continuous-time Tomographic Reconstruction

Weihao Yu · Yuanhao Cai · Ruyi Zha · Zhiwen Fan · Chenxin Li · Yixuan Yuan

Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging.

Thu 23 Oct. 17:45 - 19:45 PDT

#2
Highlight
Inverse Image-Based Rendering for Light Field Generation from Single Images

Hyunjun Jung · Hae-Gon Jeon

A concept of light-fields computed from multiple view images on regular grids has proven its benefit for scene representations, and supported realistic renderings of novel views and photographic effects such as refocusing and shallow depth of field. In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named $\textit{inverse image-based rendering}$. Unlike previous attempts to implicitly rebuild 3D geometry or to explicitly represent objective scenes, our method reconstructs light flows in a space from image pixels, which behaves in the opposite way to image-based rendering. To accomplish this, we design a neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our neural renderer first stores the light flow of source rays from the input image, then computes the relationships among them through cross-attention, and finally predicts the color of the target ray based on these relationships. After the rendering pipeline generates the first novel view from a single input image, the generated out-of-view contents are updated to the set of source rays, and this procedure is iteratively performed while ensuring the consistent generation of occluded contents. We demonstrate that our inverse image-based rendering works well with various challenging datasets without any retraining or finetuning after once trained on synthetic dataset. In addition, our method outperforms relevant state-of-the-art novel view synthesis methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#3
HyperGCT: A Dynamic Hyper-GNN-Learned Geometric Constraint for 3D Registration

Xiyu Zhang · Jiayi Ma · Jianwei Guo · Wei Hu · Zhaoshuai Qi · Fei HUI · Jiaqi Yang · Yanning Zhang

Geometric constraints between feature matches are critical in 3D point cloud registration problems. Existing approaches typically model unordered matches as a consistency graph and sample consistent matches to generate hypotheses. However, explicit graph construction introduces noise, posing great challenges for handcrafted geometric constraints to render consistency among matches. To overcome this, we propose HyperGCT, a flexible dynamic $\bf{Hyper}$-$\bf{G}$NN-learned geometric $\bf{C}$onstrain$\bf{T}$ that leverages high-order consistency among 3D correspondences. To our knowledge, HyperGCT is the first method that mines robust geometric constraints from dynamic hypergraphs for 3D registration. By dynamically optimizing the hypergraph through vertex and edge feature aggregation, HyperGCT effectively captures the correlations among correspondences, leading to accurate hypothesis generation. Extensive experiments on 3DMatch, 3DLoMatch, KITTI-LC, and ETH show that HyperGCT achieves state-of-the-art performance. Furthermore, our method is robust to graph noise, demonstrating a significant advantage in terms of generalization. The code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#4
Highlight
CounterPC: Counterfactual Feature Realignment for Unsupervised Domain Adaptation on Point Clouds

Feng Yang · Yichao Cao · Xiu Su · Dan Niu · Xuanpeng Li

Understanding real-world 3D point clouds is challenging due to domain shifts, causing geometric variations like density changes, noise, and occlusions. The key challenge is disentangling domain-invariant semantics from domain-specific geometric variations, as point clouds exhibit local inconsistency and global redundancy, making direct alignment ineffective. To address this, we propose CounterPC, a counterfactual intervention-based domain adaptation framework, which formulates domain adaptation within a causal latent space, identifying category-discriminative features entangled with intra-class geometric variation confounders. Through counterfactual interventions, we generate counterfactual target samples that retain domain-specific characteristics while improving class separation, mitigating domain bias for optimal feature transfer. To achieve this, we introduce two key modules: i) Joint Distribution Alignment, which leverages 3D foundation models (3D-FMs) and a self-supervised autoregressive generative prediction task to unify feature alignment, and ii) Counterfactual Feature Realignment, which employs Optimal Transport (OT) to align category-relevant and category-irrelevant feature distributions, ensuring robust sample-level adaptation while preserving domain and category properties. CounterPC outperforms state-of-the-art methods on PointDA and GraspNetPC-10, achieving accuracy improvements of 4.7 and 3.6, respectively. Code and pre-trained weights will be publicly released.

Thu 23 Oct. 17:45 - 19:45 PDT

#5
AD-GS: Object-Aware B-Spline Gaussian Splatting for Self-Supervised Autonomous Driving

Jiawei Xu · Kai Deng · Zexin Fan · Shenlong Wang · Jin Xie · jian Yang

Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.

Thu 23 Oct. 17:45 - 19:45 PDT

#6
EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry Images

Wangbo Yu · Chaoran Feng · Jianing Li · Jiye Tang · Jiashu Yang · Zhenyu Tang · Meng Cao · Xu Jia · Yuchao Yang · Li Yuan · Yonghong Tian

3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in synthesizing novel views of 3D scenes. However, its training is heavily reliant on high-quality images and precise camera poses. Meeting these criteria can be challenging in non-ideal real-world conditions, where motion-blurred images frequently occur due to high-speed camera movements or low-light environments.To address these challenges, we introduce Event Stream Assisted Gaussian Splatting (EvaGaussians), a novel approach that harnesses event streams captured by event cameras to facilitate the learning of high-quality 3D-GS from blurred images. Capitalizing on the high temporal resolution and dynamic range offered by event streams, we seamlessly integrate them into the initialization and optimization of 3D-GS, thereby enhancing the acquisition of high-fidelity novel views with intricate texture details. We also contribute two novel datasets comprising RGB frames, event streams, and corresponding camera parameters, featuring a wide variety of scenes and various camera motions. The comparison results reveal that our approach not only excels in generating high-fidelity novel views, but also offers faster training and inference speeds. Video results are available at the supplementary project page.

Thu 23 Oct. 17:45 - 19:45 PDT

#7
Axis-level Symmetry Detection with Group-Equivariant Representation

Wongyun Yu · Ahyun Seo · Minsu Cho

Symmetry is a fundamental concept that has been studied extensively; however, its detection in complex scenes remains challenging in computer vision. Recent heatmap-based methods identify potential regions of symmetry axes but lack precision for individual axis. In this work, we introduce a novel framework for axis-level detection of the most common symmetry types—reflection and rotation—representing them as explicit geometric primitives i.e., lines and points. We formulate a dihedral group-equivariant dual-branch architecture, where each branch exploits the properties of dihedral group-equivariant features in a novel, specialized manner for each symmetry type. Specifically, for reflection symmetry, we propose orientational anchors aligned with group components to enable orientation-specific detection, and reflectional matching that computes similarity between patterns and their mirrored counterparts across potential reflection axes. For rotational symmetry, we propose rotational matching that computes the similarity between patterns at fixed angular intervals. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#8
STD-GS: Exploring Frame-Event Interaction for SpatioTemporal-Disentangled Gaussian Splatting to Reconstruct High-Dynamic Scene

Hanyu Zhou · Haonan Wang · Haoyue Liu · Yuxing Duan · Luxin Yan · Gim Hee Lee

High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous temporal features of objects due to frame imaging and the heterogeneous spatial features between background and objects. To address this issue, we disentangle the spatiotemporal features into various latent representations to alleviate the spatiotemporal mismatching between background and objects. In this work, we introduce event camera to compensate for frame camera, and propose a spatiotemporal-disentangled Gaussian splatting framework for high-dynamic scene reconstruction. As for dynamic scene, we figure out that background and objects have appearance discrepancy in frame-based spatial features and motion discrepancy in event-based temporal features, which motivates us to distinguish the spatiotemporal features between background and objects via clustering. As for dynamic object, we discover that Gaussian representations and event data share the consistent spatiotemporal characteristic, which could serve as a prior to guide the spatiotemporal disentanglement of object Gaussians. Within Gaussian splatting framework, the cumulative scene-object disentanglement can improve the spatiotemporal discrimination between background and objects to render the time-continuous dynamic scene. Extensive experiments have been performed to verify the superiority of the proposed method.

Thu 23 Oct. 17:45 - 19:45 PDT

#9
Monocular Semantic Scene Completion via Masked Recurrent Networks

Xuzhi Wang · Xinran Wu · Song Wang · Lingdong Kong · Ziping Zhao

Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The code will be publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#10
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Haoyu Fu · Diankun Zhang · Zongchuang Zhao · Jianfeng Cui · DINGKANG LIANG · Chong Zhang · Dingyuan Zhang · Hongwei Xie · BING WANG · Xiang Bai

End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation.ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.47 Driving Score (DS) and 54.62\% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 28.08\% SR.

Thu 23 Oct. 17:45 - 19:45 PDT

#11
All in One: Visual-Description-Guided Unified Point Cloud Segmentation

Zongyan Han · Mohamed El Amine Boudjoghra · Jiahua Dong · Jinhong Wang · Rao Anwer

Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances.To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation.We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance and panoptic segmentation, offering a scalable and practical solution for 3D understanding.We will release our code and models.

Thu 23 Oct. 17:45 - 19:45 PDT

#12
Bolt3D: Generating 3D Scenes in Seconds

Stanislaw Szymanowicz · Jason Y. Zhang · Pratul Srinivasan · Ruiqi Gao · Arthur Brussee · Aleksander Holynski · Ricardo Martin Brualla · Jonathan Barron · Philipp Henzler

We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of 300 times.

A reliable, hard-landmark-sensitive loss is urgently needed in the field of heatmap-based facial landmark detection, as existing standard regression losses are ineffective at capturing small errors caused by peak mismatches and struggle to adaptively focus on hard-to-detect landmarks. These limitations potentially result in misguided model training, impacting both the coverage and accuracy of the model. To this end, we propose a novel POsition-aware and Sample-Sensitive Loss, named PossLoss, for reliable, hard-landmark sensitive landmark detection. Specifically, our PossLoss is position-aware, incorporating relative positional information to accurately differentiate and locate the peak of the heatmap, while adaptively balancing the influence of landmarks and background pixels through self-weighting, addressing the extreme imbalance between landmarks and non-landmarks. More advanced is that our PossLoss is sample-sensitive, which can distinguish easy and hard landmarks and adaptively make the model focused more on hard landmarks. Moreover, it addresses the difficulty of accurately evaluating heatmap distribution, especially in addressing small errors due to peak mismatches. We analyzed and evaluated our PossLoss on three challenging facial landmark detection tasks. The experimental results show that our PossLoss significantly improves the performance of landmark detection and outperforms the state-of-the-art methods. The source code will be made available on GitHub.

Thu 23 Oct. 17:45 - 19:45 PDT

#14
Unsupervised RGB-D Point Cloud Registration for Scenes with Low Overlap and Photometric Inconsistency

yejun Shou · Haocheng Wang · Lingfeng Shen · Qian Zheng · Gang Pan · Yanlong Cao

Point cloud registration is a fundamental task in 3D vision, playing a crucial role in various fields. With the rapid advancement of RGB-D sensors, unsupervised point cloud registration methods based on RGB-D sequences have demonstrated excellent performance. However, existing methods struggle in scenes with low overlap and photometric inconsistency. Low overlap results in numerous correspondence outliers, while photometric inconsistency hinders the model's ability to extract discriminative features. To address these challenges, we first propose the Overlapping Constraint for Inliers Detection (OCID) module, which filters and optimizes the initial correspondence set using an overlappping constraint. This module robustly selects reliable correspondences within the overlapping region while maintaining a balance between accuracy and efficiency. Additionally, we introduce a novel scene representation, 3DGS, which integrates both geometric and texture information, making it particularly well-suited for RGB-D registration tasks. Building on this, we propose the Gaussian Rendering for Photometric Adaptation (GRPA) module, which refines the geometric transformation and enhances the model's adaptability to scenes with inconsistent photometric information. Extensive experiments on ScanNet and ScanNet1500 demonstrate that our method achieves state-of-the-art performance.

Thu 23 Oct. 17:45 - 19:45 PDT

#15
Semantic Causality-Aware Vision-Based 3D Occupancy Prediction

Dubing Chen · Huan Zheng · Yucheng Zhou · Xianfei Li · Wenlong Liao · Tao He · Pai Peng · Jianbing Shen

Vision-based 3D semantic occupancy prediction is essential for autonomous systems, converting 2D camera data into 3D semantic grids. Current methods struggle to align 2D evidence with 3D predictions, undermining reliability and interpretability. This limitation drives a new exploration of the task’s causal foundations. We propose a novel approach that leverages causal principles to enhance semantic consistency in 2D-to-3D geometric transformation. Our framework introduces a causal loss that backpropagates 3D class features to 2D space for semantic alignment, ensuring 3D locations accurately reflect corresponding 2D regions. Building on this, we develop a Semantic Causality-Aware Lifting (SCA Lifting) method with three components, all guided by our causal loss: Channel-Grouped Lifting to adaptively map distinct semantics to appropriate positions, Learnable Camera Parameters to enhance camera perturbation robustness, and Normalized Convolution to propagate features to sparse regions. The evaluations demonstrate substantial gains in accuracy and robustness, positioning our method as a versatile solution for advancing 3D vision. Experimental results demonstrate that our approach significantly improves robustness to camera perturbations, enhances the semantic causal consistency in 2D-to-3D transformations, and yields substantial accuracy gains on the Occ3D dataset.

Thu 23 Oct. 17:45 - 19:45 PDT

#16
U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration

Xiaofan Li · Zhihao Xu · Chenming Wu · Zhao Yang · Yumeng Zhang · Jiang-Jiang Liu · Haibao Yu · Xiaoqing Ye · YuAn Wang · Shirui Li · Xun Sun · Ji Wan · Jun Wang

Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird’s-Eye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By effectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving fleets and has demonstrated stable performance in various challenging urban scenarios. The source code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#17
Benchmarking Burst Super-Resolution for Polarization Images: Noise Dataset and Analysis

Inseung Hwang · Kiseok Choi · Hyunho Ha · Min H. Kim

Snapshot polarization imaging calculates polarization states from linearly polarized subimages. To achieve this, a polarization camera employs a double Bayer-patterned sensor to capture both color and polarization. It demonstrates low light efficiency and low spatial resolution, resulting in increased noise and compromised polarization measurements. Although burst super-resolution effectively reduces noise and enhances spatial resolution, applying it to polarization imaging poses challenges due to the lack of tailored datasets and reliable ground truth noise statistics. To address these issues, we introduce PolarNS and PolarBurstSR, two innovative datasets developed specifically for polarization imaging. PolarNS provides characterization of polarization noise statistics, facilitating thorough analysis, while PolarBurstSR functions as a benchmark for burst super-resolution in polarization images. These datasets, collected under various real-world conditions, enable comprehensive evaluation. Additionally, we present a model for analyzing polarization noise to quantify noise propagation, tested on a large dataset captured in a darkroom environment. As part of our application, we compare the latest burst super-resolution models, highlighting the advantages of training tailored to polarization compared to RGB-based methods. This work establishes a benchmark for polarization burst super-resolution and offers critical insights into noise propagation, thereby enhancing polarization image reconstruction.

Thu 23 Oct. 17:45 - 19:45 PDT

#18
Group Inertial Poser: Multi-Person Pose and Global Translation from Sparse Inertial Sensors and Ultra-Wideband Ranging

Ying Xue · Jiaxi Jiang · Rayan Armani · Dominik Hollidt · Yi-Chi Liao · Christian Holz

Tracking human motion using wearable inertial measurement units (IMUs) overcomes occlusion and environmental limitations inherent in vision-based approaches.However, such sparse IMU tracking also compromises translation estimates and accurate relative positioning between multiple individuals, as inertial cues are inherently self-referential and provide no direct spatial reference or relational information about others.In this paper, we present a novel approach that leverages the distances between the IMU sensors worn by one person as well as between those across multiple people.Our method Inter Inertial Poser derives these absolute inter-sensor distances from ultra-wideband ranging (UWB) and inputs them into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation.Our novel coarse-to-fine optimization process further leverages these inter-sensor distances for accurately estimating the trajectories between individuals. To evaluate our method, we introduce Inter-UWB, the first IMU+UWB dataset for two-person tracking, which comprises 200\,minutes of motion recordings from 14\,participants. Our results show that Inter Inertial Poser outperforms the state-of-the-art methods in both accuracy and robustness across synthetic and real-world captures, demonstrating the promise of IMU+UWB-based multi-human motion capture in the wild.

Thu 23 Oct. 17:45 - 19:45 PDT

#19
DreamCube: RGB-D Panorama Generation via Multi-plane Synchronization

Yukun Huang · Yanning Zhou · Jianan Wang · Kaiyi Huang · Xihui Liu

3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. Based on this design, we further introduce DreamCube, a multi-plane RGB-D diffusion model for 3D panorama generation, which maximizes the reuse of 2D foundation model priors to achieve diverse appearances and accurate geometry while maintaining multi-view consistency. Extensive experiments demonstrate the effectiveness of our approach in panoramic image generation, panoramic depth estimation, and 3D scene generation.

Thu 23 Oct. 17:45 - 19:45 PDT

#20
OcRFDet: Object-Centric Radiance Fields for Multi-View 3D Object Detection in Autonomous Driving

Mingqian Ji · Jian Yang · Shanshan Zhang

Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector's ability of 3D geometry estimation. However, we observe a decline in detection performance, when we directly use them for 3D rendering as an auxiliary task. From our analysis, we find the performance drop is caused by the strong responses on the background when rendering the whole scene. To address this problem, we propose object-centric radiance fields, focusing on modeling foreground objects while discarding background noises. Specifically, we employ object-centric radiance fields (OcRF) to enhance 3D voxel features via an auxiliary task of rendering foreground objects. We further use opacity - the side-product of rendering- to enhance the 2D foreground BEV features via height-aware opacity-based attention (HOA), where attention maps at different height levels are generated separately via multiple networks in parallel. Extensive experiments on the nuScenes validation and test datasets demonstrate that our OcRFDet achieves superior performance, outperforming previous state-of-the-art methods with 57.2\% mAP and 64.8\% NDS on the nuScenes test benchmark.

Thu 23 Oct. 17:45 - 19:45 PDT

#21
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow

Simon Boeder · Fabian Gigengack · Benjamin Risse

Occupancy estimation has become a prominent task in 3D computer vision, particularly within the autonomous driving community.In this paper, we present a novel approach to occupancy estimation, termed GaussianFlowOcc, which is inspired by Gaussian Splatting and replaces traditional dense voxel grids with a sparse 3D Gaussian representation.Our efficient model architecture based on a Gaussian Transformer significantly reduces computational and memory requirements by eliminating the need for expensive 3D convolutions used with inefficient voxel-based representations that predominantly represent empty 3D spaces.GaussianFlowOcc effectively captures scene dynamics by estimating temporal flow for each Gaussian during the overall network training process, offering a straightforward solution to a complex problem that is often neglected by existing methods.Moreover, GaussianFlowOcc is designed for scalability, as it employs weak supervision and does not require costly dense 3D voxel annotations based on additional data (e.g., LiDAR).Through extensive experimentation, we demonstrate that GaussianFlowOcc significantly outperforms all previous methods for weakly supervised occupancy estimation on the nuScenes dataset while featuring an inference speed that is 50 times faster than current SotA.

Thu 23 Oct. 17:45 - 19:45 PDT

#22
Highlight
RESCUE: Crowd Evacuation Simulation via Controlling SDM-United Characters

Xiaolin Liu · Tianyi zhou · Hongbo Kang · Jian Ma · Ziwen Wang · Jing Huang · Wenguo Weng · Yu-Kun Lai · Kun Li

Evacuation simulations are vital for improving safety, pinpointing risks, and refining emergency protocols. However, no existing methods can simulate realistic, personalized, and online 3D evacuation motions. In this paper, aligned with the sensory-decision-motor (SDM) flow of the human brain, we propose an online SDM-united 3D evacuation simulation framework with a 3D-adaptive Social Force Model and a proxemics-aware personalization method. Additionally, we introduce Part-level Force Visualization to assist in evacuation analysis. We experimentally validate that our framework supports online personalized dynamic path planning and behaviors throughout the evacuation process, and is compatible with uneven terrain. Visually, our method generates evacuation results that are more realistic and plausible, providing enhanced insights for evacuation strategy development. The code will be released for research purposes.

Thu 23 Oct. 17:45 - 19:45 PDT

#23
SG-LDM: Semantic-Guided LiDAR Generation via Latent-Aligned Diffusion

Zhengkang Xiang · Zizhao Li · Amir Khodabandeh · Kourosh Khoshelham

Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling flexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a Semantic-Guided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-of-the-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the lidar segmentation task by addressing the domain gap between the synthetic and real data.

Thu 23 Oct. 17:45 - 19:45 PDT

#24
LookOut: Real-World Humanoid Egocentric Navigation

Boxiao Pan · Adam Harley · Francis Engelmann · Karen Liu · Leonidas Guibas

The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over a temporally aggregated 3D latent space, which implicitly models the geometric constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and provide a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments.

Thu 23 Oct. 17:45 - 19:45 PDT

#25
PointGAC: Geometric-Aware Codebook for Masked Point Modeling

Abiao Li · Chenlei Lv · Guofeng Mei · Yifan Zuo · Jian Zhang · Yuming Fang

Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online $K$-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks.

Thu 23 Oct. 17:45 - 19:45 PDT

#26
Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

Qi Xun Yeo · Yanyan Li · Gim Hee Lee

Modern 3D semantic scene graph estimation methods utilise ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighbouring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighbouring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighbourhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our code will be open-sourced upon paper acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#27
Highlight
PRM: Photometric Stereo based Large Reconstruction Model

Wenhang Ge · Jiantao Lin · Guibao SHEN · Jiawei Feng · Tao Hu · Xinli Xu · Ying-Cong Chen

We propose PRM, a novel photometric stereo based large reconstruction model to reconstruct high-quality meshes with fine-grained details. Previous large reconstruction models typically prepare training images under fixed and simple lighting, offering minimal photometric cues for precise reconstruction. Furthermore, images containing specular surfaces are treated as out-of-distribution samples, resulting in degraded reconstruction quality. To handle these challenges, PRM renders photometric stereo images by varying materials and lighting, which not only improves the local details by providing rich photometric cues but also increases the model’s robustness to variations in the appearance of input images. To offer enhanced flexibility, we incorporate a real-time physically-based rendering (PBR) method and mesh rasterization for ground-truth rendering. By using an explicit mesh as 3D representation, PRM ensures the application of differentiable PBR for predicted rendering. This approach models specular color more accurately for photometric stereo images than previous neural rendering methods and supports multiple supervisions for geometry optimization. Extensive experiments demonstrate that PRM significantly outperforms other models.

Thu 23 Oct. 17:45 - 19:45 PDT

#28
4D Gaussian Splatting SLAM

Yanyan Li · Youxu Fang · Zunjie Zhu · Kunyi Li · Yong Ding · Federico Tombari

Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world.Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images.First, by generating motion masks, we obtain static and dynamic priors for each pixel. .To eliminate the influence of static scenes and improve the efficiency on learning the motion of dynamic objects, we classify the Gaussian primitives into static and dynamic Gaussian sets, while the the sparse control points along with an MLP is utilized to model the transformation fields of the dynamic Gaussians.To more accurately learn the motion of dynamic Gaussians, a novel 2D optical flow map reconstruction algorithm is designed to render optical flows of dynamic objects between neighbor images, which are further used to supervise the 4D Gaussian radiance fields along with traditional photometric and geometric constraints.In experiments, qualitative and quantitative evaluation results show that the proposed method achieves robust tracking and high-quality view synthesis performance in real-world environments.

Thu 23 Oct. 17:45 - 19:45 PDT

#29
BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis

David Svitov · Pietro Morerio · Lourdes Agapito · ALESSIO DEL BUE

We present billboard Splatting (BBSplat) - a novel approach for novel view synthesis based on textured geometric primitives. BBSplat represents the scene as a set of optimizable textured planar primitives with learnable RGB textures and alpha-maps to control their shape. BBSplat primitives can be used in any Gaussian Splatting pipeline as drop-in replacements for Gaussians. The proposed primitives close the rendering quality gap between 2D and 3D Gaussian Splatting (GS), enabling the accurate extraction of 3D mesh as in the 2DGS framework. Additionally, the explicit nature of planar primitives enables the use of the ray-tracing effects in rasterization.Our novel regularization term encourages textures to have a sparser structure, enabling an efficient compression that leads to a reduction in the storage space of the model up to $\times17$ times compared to 3DGS. Our experiments show the efficiency of BBSplat on standard datasets of real indoor and outdoor scenes such as Tanks\&Temples, DTU, and Mip-NeRF-360. Namely, we achieve a state-of-the-art PSNR of 29.72 for DTU at Full HD resolution.

Thu 23 Oct. 17:45 - 19:45 PDT

#30
Generalizable Non-Line-of-Sight Imaging with Learnable Physical Priors

Shida Sun · Yue Li · Yueyi Zhang · Zhiwei Xiong

Non-line-of-sight (NLOS) imaging, recovering the hidden volume from indirect reflections, has attracted increasing attention due to its potential applications. Despite promising results, existing NLOS reconstruction approaches are constrained by the reliance on empirical physical priors, e.g., single fixed path compensation. Moreover, these approaches still possess limited generalization ability, particularly when dealing with scenes at a low signal-to-noise ratio (SNR). To overcome the above problems, we introduce a novel learning-based approach, comprising two key designs: Learnable Path Compensation (LPC) and Adaptive Phasor Field (APF). The LPC applies tailored path compensation coefficients to adapt to different objects in the scene, effectively reducing light wave attenuation, especially in distant regions. Meanwhile, the APF learns the precise Gaussian window of the illumination function for the phasor field, dynamically selecting the relevant spectrum band of the transient measurement. Experimental validations demonstrate that our proposed approach, only trained on synthetic data, exhibits the capability to seamlessly generalize across various real-world datasets captured by different imaging systems and characterized by low SNRs.

Thu 23 Oct. 17:45 - 19:45 PDT

#31
Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging

Chongjie Ye · Yushuang Wu · Ziteng Lu · Jiahao Chang · Xiaoyang Guo · Jiaqing Zhou · Hao Zhao · Xiaoguang Han

With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating high-fidelity 3D geometry from images via normal bridging. Hi3DGen consists of three key components: (1) an image-to-normal estimator that decouples the low-high frequency image pattern with noise injection and dual-stream training to achieve generalizable, stable, and sharp estimation; (2) a normal-to-geometry learning approach that uses normal-regularized latent diffusion learning to enhance 3D geometry generation fidelity; and (3) a 3D data synthesis pipeline that constructs a high-quality dataset to support training. Extensive experiments demonstrate the effectiveness and superiority of our framework in generating rich geometric details, outperforming state-of-the-art methods in terms of fidelity. Our work provides a new direction for high-fidelity 3D geometry generation from images by leveraging normal maps as an intermediate representation.

Thu 23 Oct. 17:45 - 19:45 PDT

#32
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction

Yuanhao Cai · He Zhang · Kai Zhang · Yixun Liang · Mengwei Ren · Fujun Luan · Qing Liu · Soo Ye Kim · Jianming Zhang · Zhifei Zhang · Yuqian Zhou · YULUN ZHANG · Xiaokang Yang · Zhe Lin · Alan Yuille

Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods, without using 2D diffusion prior and depth estimator. Plus, our method enjoys over 5$\times$ faster speed ($\sim$6s on an A100 GPU). Code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#33
Discretized Gaussian Representation for Tomographic Reconstruction

Shaokai Wu · Yuxiang Lu · Yapan Guo · Wei Ji · Suizhi Huang · Fengyu Yang · Shalayiding Sirejiding · Qichen He · Jing Tong · Yanbiao Ji · Yue Ding · Hongtao Lu

Computed Tomography (CT) is a widely used imaging technique that provides detailed cross-sectional views of objects. Over the past decade, Deep Learning-based Reconstruction (DLR) methods have led efforts to enhance image quality and reduce noise, yet they often require large amounts of data and are computationally intensive. Inspired by recent advancements in scene reconstruction, some approaches have adapted NeRF and 3D Gaussian Splatting (3DGS) techniques for CT reconstruction. However, these methods are not ideal for direct 3D volume reconstruction. In this paper, we propose a novel Discretized Gaussian Representation (DGR) for CT reconstruction, which directly reconstructs the 3D volume using a set of discretized Gaussian functions in an end-to-end manner. To further enhance computational efficiency, we introduce a Fast Volume Reconstruction technique that aggregates the contributions of these Gaussians into a discretized volume in a highly parallelized fashion. Our extensive experiments on both real-world and synthetic datasets demonstrate that DGR achieves superior reconstruction quality and significantly improved computational efficiency compared to existing DLR and instance reconstruction methods. Our code is available in the supplementary material.

Thu 23 Oct. 17:45 - 19:45 PDT

#34
SuperMat: Physically Consistent PBR Material Estimation at Interactive Rates

Yijia Hong · Yuan-Chen Guo · Ran Yi · Yulong Chen · Yanpei Cao · Lizhuang Ma

Decomposing physically-based materials from images into their constituent properties remains challenging, particularly when maintaining both computational efficiency and physical consistency. While recent diffusion-based approaches have shown promise, they face substantial computational overhead due to multiple denoising steps and separate models for different material properties. We present SuperMat, a single-step framework that achieves high-quality material decomposition with one-step inference. This enables end-to-end training with perceptual and re-render losses while decomposing albedo, metallic, and roughness maps at millisecond-scale speeds. We further extend our framework to 3D objects through a UV refinement network, enabling consistent material estimation across viewpoints while maintaining efficiency. Experiments demonstrate that SuperMat achieves state-of-the-art PBR material decomposition quality while reducing inference time from seconds to milliseconds per image, and completes PBR material estimation for 3D objects in approximately 3 seconds.

Thu 23 Oct. 17:45 - 19:45 PDT

#35
RI3D: Few-Shot Gaussian Splatting With Repair and Inpainting Diffusion Priors

Avinash Paliwal · xilong zhou · Wei Ye · Jinhui Xiong · Rakesh Ranjan · Nima Kalantari

In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions that outperform state-of-the-art approaches on a diverse set of scenes with extremely sparse inputs.

Thu 23 Oct. 17:45 - 19:45 PDT

#36
Dual-S3D: Hierarchical Dual-Path Selective SSM-CNN for High-Fidelity Implicit Reconstruction

Luoxi Zhang · Pragyan Shrestha · Yu Zhou · Chun Xie · Itaru Kitahara

Single-view 3D reconstruction aims to recover the complete 3D geometry and appearance of objects from a single RGB image and its corresponding camera parameters. Yet, the task remains challenging due to incomplete image information and inherent ambiguity. Existing methods primarily encounter two issues: balancing extracting local details with the construction of global topology and the interference caused by the early fusion of RGB and depth features in high-texture regions, destabilizing SDF optimization. We propose Dual-S3D, a novel single-view 3D reconstruction framework to address these challenges. Our method employs a hierarchical dual-path feature extraction strategy based on stages that utilize CNNs to anchor local geometric details. In contrast, subsequent stages leverage a Transformer integrated with selective SSM to capture global topology, enhancing scene understanding and feature representation. Additionally, we design an auxiliary branch that progressively fuses precomputed depth features with pixel-level features to decouple visual and geometric cues effectively. Extensive experiments on the 3D-FRONT and Pix3D datasets demonstrate that our approach significantly outperforms existing methods—reducing Chamfer distance by 51%, increasing F-score by 33.6%, and improving normal consistency by 10.3%—thus achieving state-of-the-art reconstruction quality.

Thu 23 Oct. 17:45 - 19:45 PDT

#37
FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction

Donghyun Lee · Dawoon Jeong · Jae W. Lee · Hongil Yoon

Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.

Thu 23 Oct. 17:45 - 19:45 PDT

#38
Highlight
Self-Calibrating Gaussian Splatting for Large Field-of-View Reconstruction

Youming Deng · Wenqi Xian · Guandao Yang · Leonidas Guibas · Gordon Wetzstein · Steve Marschner · Paul Debevec

Large field-of-view (FOV) cameras can simplify and accelerate scene capture because they provide complete coverage with fewer views. However, existing reconstruction pipelines fail to take full advantage of large-FOV input data because they convert input views to perspective images, resulting in stretching that prevents the use of the full image. Additionally, they calibrate lenses using models that do not accurately fit real fisheye lenses in the periphery. We present a new reconstruction pipeline based on Gaussian Splatting that uses a flexible lens model and supports fields of view approaching 180 degrees. We represent lens distortion with a hybrid neural field based on an Invertible ResNet and use a cubemap to render wide-FOV images while retaining the efficiency of the Gaussian Splatting pipeline. Our system jointly optimizes lens distortion, camera intrinsics, camera poses, and scene representations using a loss measured directly against the original input pixels. We present extensive experiments on both synthetic and real-world scenes, demonstrating that our model accurately fits real-world fisheye lenses and that our end-to-end self-calibration approach provides higher-quality reconstructions than existing methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#39
RobuSTereo: Robust Zero-Shot Stereo Matching under Adverse Weather

Yuran Wang · Yingping Liang · Yutao Hu · Ying Fu

Learning-based stereo matching models struggle in adverse weather conditions due to the scarcity of corresponding training data and the challenges in extracting discriminative features from degraded images. These limitations significantly hinder zero-shot generalization to out-of-distribution weather conditions. In this paper, we propose RobuSTereo a novel framework that enhances the zero-shot generalization of stereo matching models under adverse weather by addressing both data scarcity and feature extraction challenges. First, we introduce a diffusion-based simulation pipeline with a stereo consistency module, which generates high-quality stereo data tailored for adverse conditions. By training stereo matching models on our synthetic datasets, we reduce the domain gap between clean and degraded images, significantly improving the models’ robustness to unseen weather conditions. The stereo consistency module ensures structural alignment across synthesized image pairs, preserving geometric integrity and enhancing depth estimation accuracy. Second, we design a robust feature encoder that combines a specialized ConvNet with a denoising transformer to extract stable and reliable features from degraded images. The ConvNet captures fine-grained local structures, while the denoising transformer refines global representations, effectively mitigating the impact of noise, low visibility, and weather-induced distortions. This enables more accurate disparity estimation even under challenging visual conditions. Extensive experiments demonstrate that RobuSTereo significantly improves the robustness and generalization of stereo matching models across diverse adverse weather scenarios.

Thu 23 Oct. 17:45 - 19:45 PDT

#40
Transformer-based Tooth Alignment Prediction with Occlusion and Collision Constraints

DongZhenXing DongZhenXing · Jiazhou Chen

The planning of digital orthodontic treatment requires providing tooth alignment, which relays clinical experiences heavily and consumes a lot of time and labor to determine manually. In this work, we proposed an automatic tooth alignment neural network based on Swin-transformer. We first re-organized 3D point clouds based on dental arch lines and converted them into order-sorted multi-channel textures, improving both accuracy and efficiency. We then designed two new orthodontic loss functions that quantitatively evaluate the occlusal relationship between the upper and lower jaws. They are important clinical constraints, first introduced and lead to cutting-edge prediction accuracy. To train our network, we collected a large digital orthodontic dataset in more than 2 years, including various complex clinical cases. We will release this dataset after the paper's publishment and believe it will benefit the community. Furthermore, we proposed two new orthodontic dataset augmentation methods considering tooth spatial distribution and occlusion. We compared our method with most SOTA methods using this dataset, and extensive ablation studies and experiments demonstrated the high accuracy and efficiency of our method.

Thu 23 Oct. 17:45 - 19:45 PDT

#41
Gaussian Splatting with Discretized SDF for Relightable Assets

Zuo-Liang Zhu · jian Yang · Beibei Wang

3D Gaussian splatting (3DGS) has shown its detailed expressive ability and highly efficient rendering speed in the novel view synthesis (NVS) task. The application to inverse rendering still faces several challenges, as the discrete nature of Gaussian primitives makes it difficult to apply geometry constraints. Recent works introduce the signed distance field (SDF) as an extra continuous representation to regularize the geometry defined by Gaussian primitives. It improves the decomposition quality, at the cost of increasing memory usage and complicating training. Unlike these works, we introduce a discretized SDF to represent the continuous SDF in a discrete manner by encoding it within each Gaussian using a sampled value. This approach allows us to link the SDF with the Gaussian opacity through an SDF-to-opacity transformation, enabling rendering the SDF via splatting and avoiding the computational cost of ray marching. The key challenge is to regularize the discrete samples to be consistent with the underlying SDF, as the discrete representation can hardly apply the gradient-based constraints (e.g., Eikonal loss). For this, we project Gaussians onto the zero-level set of SDF and enforce alignment with the surface from splatting, namely a projection-based consistency loss. Thanks to the discretized SDF, our method achieves higher relighting quality, while requiring no extra memory beyond GS and avoiding complex manually designed optimization. The experiments reveal that our method outperforms existing Gaussian-based inverse rendering methods. We will release the code upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#42
MMGeo: Multimodal Compositional Geo-Localization for UAVs

Yuxiang Ji · Boyong He · Zhuoyue Tan · Liaoni Wu

Multimodal geo-localization methods can inherently overcome the limitations of unimodal sensor systems by leveraging complementary information from different modalities.However, existing retrieval-based methods rely on a comprehensive multimodal database, which is often challenging to fulfill in practice.In this paper, we introduce a more practical problem for localizing drone-view images by collaborating multimodal data within a satellite-view reference map, which integrates multimodal information while avoiding the need for an extensive multimodal database.We present \textsc{MMGeo} that learns to push the composition of multimodal representations to the target reference map through a unified framework.By utilizing a comprehensive multimodal query (image, point cloud/depth/text), we can achieve more robust and accurate geo-localization, especially in unknown and complex environments.Additionally, we extend two visual geo-localization datasets GTA-UAV and UAV-VisLoc to multi-modality, establishing the first UAV geo-localization datasets that combine image, point cloud, depth and text data.Experiments demonstrate the effectiveness of \textsc{MMGeo} for UAV multimodal compositional geo-localization, as well as the generalization capabilities to real-world scenarios.

Thu 23 Oct. 17:45 - 19:45 PDT

#43
AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes

Tianyi Xu · Fan Zhang · Boxin Shi · Tianfan Xue · Yujin Wang

Mainstream high dynamic range (HDR) imaging techniques typically rely on fusing multiple images captured with different exposure setups (shutter speed and ISO). A good balance between shutter speed and ISO are critical for high-quality HDR, as high ISO introduce significant noise, whereas long shutter speeds may lead to noticeable motion blur—both. However, existing methods often overlook the complex interaction between shutter speed and ISO and fail to account for motion blur effects in dynamic scenes.In this work, we propose AdaptiveAE, a reinforcement learning-based method that optimizes the selection of shutter speed and ISO combinations to maximize HDR reconstruction quality in dynamic environments. AdaptiveAE integrates an image synthesis pipeline that incorporates motion blur and noise simulation in our training procedure and leveraging semantic information and exposure histogram. It can adaptively select optimal ISO and shutter speed sequences based on a user-defined exposure time budget, find a better exposure schedule than traditional fixed exposure solution. Experimental results across multiple datasets demonstrate that AdaptiveAE achieves state-of-the-art performance.

Thu 23 Oct. 17:45 - 19:45 PDT

#44
Large Scene Generation with Cube-Absorb Discrete Diffusion

Qianjiang Hu · Wei Hu

Generating realistic 3D outdoor scenes is essential for applications in autonomous driving, virtual reality, environmental science, and urban development. Traditional 3D generation approaches using single-layer diffusion methods can produce detailed scenes for individual objects but struggle with high-resolution, large-scale outdoor environments due to scalability limitations. Recent hierarchical diffusion models tackle this by progressively scaling up low-resolution scenes. However, they often sample fine details from total noise rather than from the coarse scene, which limits the efficiency. We propose a novel cube-absorb discrete diffusion (CADD) model, which deploys low-resolution scenes as the base state in the diffusion process to generate fine details, eliminating the need to sample entirely from noise. Moreover, we introduce the Sparse Cube Diffusion Transformer (SCDT), a transformer-based model with a sparse cube attention operator, optimized for generating large-scale sparse voxel scenes. Our method demonstrates state-of-the-art performance on the CarlaSC and KITTI360 datasets, supported by qualitative visualizations and extensive ablation studies that highlight the impact of the CADD process and sparse block attention operator on high-resolution 3D scene generation.

Thu 23 Oct. 17:45 - 19:45 PDT

#45
SynAD: Enhancing Real-World End-to-End Autonomous Driving Models through Synthetic Data Integration

Jongsuk Kim · Jae Young Lee · Gyojin Han · Dong-Jae Lee · Minki Jeong · Junmo Kim

Recent advancements in deep learning and the availability of high-quality real-world driving datasets have propelled end-to-end autonomous driving. Despite this progress, relying solely on real-world data limits the variety of driving scenarios for training. Synthetic scenario generation has emerged as a promising solution to enrich the diversity of training data; however, its application within E2E AD models remains largely unexplored. This is primarily due to the absence of a designated ego vehicle and the associated sensor inputs, such as camera or LiDAR, typically provided in real-world scenarios. To address this gap, we introduce SynAD, the first framework designed to enhance real-world E2E AD models using synthetic data. Our method designates the agent with the most comprehensive driving information as the ego vehicle in a multi-agent synthetic scenario. We further project path-level scenarios onto maps and employ a newly developed Map-to-BEV Network to derive bird’s-eye-view features without relying on sensor inputs. Finally, we devise a training strategy that effectively integrates these map-based synthetic data with real driving data. Experimental results demonstrate that SynAD effectively integrates all components and notably enhances safety performance. By bridging synthetic scenario generation and E2E AD, SynAD paves the way for more comprehensive and robust autonomous driving models.

Thu 23 Oct. 17:45 - 19:45 PDT

#46
Highlight
Benchmarking Egocentric Visual-Inertial SLAM at City Scale

Anusha Krishnan · Shaohui Liu · Paul-Edouard Sarlin · Oscar Gentilhomme · David Caruso · Maurizio Monge · Richard Newcombe · Jakob Engel · Marc Pollefeys

Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visual-inertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#47
Robust Unfolding Network for HDR Imaging with Modulo Cameras

Zhile Chen · Hui Ji

High Dynamic Range (HDR) imaging with modulo cameras involves solving a challenging inverse problem, where degradation occurs due to the modulo operation applied to the target HDR image. Existing methods operate directly in the image domain, overlooking the underlying properties of the modulo operation. Motivated by Itoh's continuity condition in optics, we reformulate modulo HDR reconstruction in image gradient domain, leveraging the inherent properties of modulo-wrapped gradients to simplify the problem. Furthermore, to address possible ambiguities on large image gradients, we introduce an auxiliary variable with a learnable sparsity prior in an optimization formulation to absorb the related residuals. This is implemented within an unfolding network, where sparsity is enforced through a spiking neuron-based module. Experiments show that our method outperforms existing approaches while being among the lightest models of existing works.

Thu 23 Oct. 17:45 - 19:45 PDT

#48
Neural Shell Texture Splatting: More Details and Fewer Primitives

Xin Zhang · Anpei Chen · Jincheng Xiong · Pinxuan Dai · Yujun Shen · Weiwei Xu

Gaussian splatting techniques have shown promising results in novel view synthesis, achieving high fidelity and efficiency. However, their high reconstruction quality comes at the cost of requiring a large number of primitives. We identify this issue as stemming from the entanglement of geometry and appearance in Gaussian Splatting. To address this, we introduce a neural shell texture, a global representation that encodes texture information around the surface. We use Gaussian primitives as both a geometric representation and texture field samplers, efficiently splatting texture features into image space. Our evaluation demonstrates that this disentanglement enables high parameter efficiency, fine texture detail reconstruction, and easy textured mesh extraction, all while using significantly fewer primitives.

In autonomous driving, accurately predicting occupancy and motion is crucial for safe navigation within dynamic environments. However, existing methods often suffer from difficulties in handling complex scenes and uncertainty arising from sensor data. To address these issues, we propose a new Gaussian-based World Model (GWM), seamlessly integrating raw multi-modal sensor inputs. In 1st stage, Gaussian representation learner utilizes self-supervised pretraining to learn robust Gaussian representation. Gaussian representation integrates semantic and geometric information and establishes a robust probabilistic understanding of the environment. In 2nd stage, GWM seamlessly integrates learning, simulation, and planning into a unified framework, empowering the uncertainty-aware simulator & planner to jointly forecast future scene evolutions and vehicle trajectories. Simulator generates future scene predictions by modeling both static and dynamic elements, while planner calculates optimal paths to minimize collision risks, thus enhancing navigation safety. Overall, GWM employs a sensor-to-planning world model that directly processes raw sensor data, setting it apart from previous methods. Experiments show that GWM outperforms state-of-the-art approaches by 16.8% in semantic comprehension and 5.8% in motion prediction. Moreover, we provide an in-depth analysis of Gaussian representations under complex scenarios. Our code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#50
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction

JIXUAN FAN · Wanhua Li · Yifei Han · Tianru Dai · Yansong Tang

3D Gaussian Splatting has demonstrated notable success in large-scale scene reconstruction, but challenges persist due to high training memory consumption and storage overhead. Hybrid representations that integrate implicit and explicit features offer a way to mitigate these limitations. However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. Our method maintains a teacher Gaussian decoder updated with momentum, ensuring a stable reference during training. This teacher provides each block with global guidance in a self-distillation manner, promoting spatial consistency in reconstruction.To further ensure consistency across the blocks, we incorporate block weighting, dynamically adjusting each block’s weight according to its reconstruction accuracy. Extensive experiments on large-scale scenes show that our method consistently outperforms existing techniques, achieving a 18.7% improvement in LPIPS over CityGaussian with much fewer divided blocks and establishing a new state of the art.

Thu 23 Oct. 17:45 - 19:45 PDT

#51
JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Kwon Byung-Ki · Qi Dai · Lee Hyoseok · Chong Luo · Tae-Hyun Oh

We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accuratedepth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy.With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation.

Thu 23 Oct. 17:45 - 19:45 PDT

#52
A Real-world Display Inverse Rendering Dataset

Seokjun Choi · Hoon-Gyu Chung · Yujin Jeon · Giljoo Nam · Seung-Hwan Baek

Inverse rendering aims to reconstruct geometry and reflectance from captured images. Display-camera imaging systems offer unique advantages for this task: each pixel can easily function as a programmable point light source, and the polarized light emitted by LCD displays facilitates diffuse-specular separation. Despite these benefits, there is currently no public real-world dataset captured using display-camera systems, unlike other setups such as light stages. This absence hinders the development and evaluation of display-based inverse rendering methods.In this paper, we introduce the first real-world dataset for display-based inverse rendering. To achieve this, we construct and calibrate an imaging system comprising an LCD display and stereo polarization cameras. We then capture a diverse set of objects with diverse geometry and reflectance under one-light-at-a-time (OLAT) display patterns. We also provide high-quality ground-truth geometry. Our dataset enables the synthesis of captured images under arbitrary display patterns and different noise levels. Using this dataset, we evaluate the performance of existing photometric stereo and inverse rendering methods, {and provide a simple, yet effective baseline for display inverse rendering, outperforming state-of-the-art inverse rendering methods}. The dataset and code will be publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#53
DAA*: Deep Angular A Star for Image-based Path Planning

Zhiwei Xu

Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A$^\ast$ (DAA$^\ast$), by incorporating the proposed path angular freedom (PAF) into A$^\ast$ to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA$^\ast$ improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA$^\ast$ over neural A$^\ast$ in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by **9.0\% SPR**, **6.9\% ASIM**, and **3.9\% PSIM**. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA$^\ast$ significantly outperforms the state-of-the-art TransPath by **6.7\% SPR**, **6.5\% PSIM**, and **3.7\% ASIM**. We also discuss the minor trade-off between path optimality and search efficiency where applicable.

Thu 23 Oct. 17:45 - 19:45 PDT

#54
Neural Compression for 3D Geometry Sets

Siyu Ren · Junhui Hou · Weiyao Lin · Wenping Wang

We present NeCGS, the first neural compression paradigm, which can compress a geometry set encompassing thousands of detailed and diverse 3D mesh models by up to 900 times with high accuracy and preservation of detailed geometric structures. Specifically, we first propose TSDF-Def, a new implicit representation that is capable of accurately representing irregular 3D mesh models with various structures into regular 4D tensors of uniform and compact size, where 3D surfaces can be extracted through the deformable marching cubes. Then we construct a quantization-aware auto-decoder network architecture to regress these 4D tensors to explore the local geometric similarity within each shape and across different shapes for redundancy removal, resulting in more compact representations, including an embedded feature of a smaller size associated with each 3D model and a network parameter shared by all models. We finally encode the resulting features and network parameters into bitstreams through entropy coding. Besides, our NeCGS can handle the dynamic scenario well, where new 3D models are constantly added to a compressed set. Extensive experiments and ablation studies demonstrate the significant advantages of our NeCGS over state-of-the-art methods both quantitatively and qualitatively. We have included the source code in the Supplemental Material.

Thu 23 Oct. 17:45 - 19:45 PDT

#55
Long-term Traffic Simulation with Interleaved Autoregressive Motion and Scenario Generation

Xiuyu Yang · Shuhan Tan · Philipp Kraehenbuehl

An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment.Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene.This is problematic for long-term simulation.Agents enter and exit the scene as the ego vehicle enters new regions.We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation.InfGen automatically switches between closed-loop motion simulation and scene generation mode.It enables stable long-term rollout simulation.InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation.The code and model of InfGen will be released upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#56
RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion

Geonho Bang · Minjae Seong · Jisong Kim · Geunju Baek · Daye Oh · Junhyung Kim · Junho Koh · Jun Won Choi

Radar-camera fusion methods have emerged as a cost-effective approach for 3D object detection but still lag behind LiDAR-based methods in performance. Recent works have focused on employing temporal fusion and Knowledge Distillation (KD) strategies to overcome these limitations. However, existing approaches have not sufficiently accounted for uncertainties arising from object motion or sensor-specific errors inherent in radar and camera modalities. In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). RAKD is designed to consider the inherent errors in the range and azimuth directions, enabling effective knowledge transfer from LiDAR features to refine inaccurate BEV representations. TKD mitigates temporal misalignment caused by dynamic objects by aligning historical radar-camera BEV features with LiDAR representations. RDKD enhances feature discrimination by distilling relational knowledge from the teacher model, allowing the student to understand better and differentiate foreground and background features. RCTDistill achieves state-of-the-art radar–camera fusion performance on both the nuScenes and view-of-delft (VoD) datasets, with the fastest inference speed of 26.2 FPS.

Thu 23 Oct. 17:45 - 19:45 PDT

#57
MagicCity: Geometry-Aware 3D City Generation from Satellite Imagery with Multi-View Consistency

Xingbo YAO · xuanmin Wang · Hao WU · Chengliang PING · ZHANG Doudou · Hui Xiong

Directly generating 3D cities from satellite imagery opens up new possibilities for gaming and mapping services. However, this task remains challenging due to the limited information in satellite views, making it difficult for existing methods to achieve both photorealistic textures and geometric accuracy. To address these challenges, we propose MagicCity, a novel large-scale generative model for photorealistic 3D city generation with geometric consistency. Given a satellite image, our framework first extracts 3D geometric information and encodes it alongside textural features using a dual encoder. These features then guide a multi-branch diffusion model to generate city-scale, geometrically consistent multi-view images. To further enhance texture consistency across different viewpoints, we propose an Inter-Frame Cross Attention mechanism that enables feature sharing across different frames. Additionally, we incorporate a Hierarchical Geometric-Aware Module and a Consistency Evaluator to improve overall scene consistency. Finally, the generated images are fed into our robust 3D reconstruction pipeline to produce high-visual quality and geometrically consistent 3D cities. Moreover, we contribute CityVista, a high-quality dataset comprising 500 3D city scenes along with corresponding multi-view images and satellite imagery to advance research in 3D city generation. Experimental results demonstrate that MagicCity surpasses state-of-the-art methods in both geometric consistency and visual quality.

Thu 23 Oct. 17:45 - 19:45 PDT

#58
GCRayDiffusion: Pose-Free Surface Reconstruction via Geometric Consistent Ray Diffusion

Li-Heng Chen · Zi-Xin Zou · Chang Liu · Tianjiao Jing · Yanpei Cao · Shi-Sheng Huang · Hongbo Fu · Hua Huang

Accurate surface reconstruction from unposed images is crucial for efficient 3D object or scene creation. However, it remains challenging, particularly for the joint camera pose estimation. Previous approaches have achieved impressive pose-free surface reconstruction results in dense-view settings, but could easily fail for sparse-view scenarios without sufficient visual overlap. In this paper, we propose a new technique for pose-free surface reconstruction, which follows triplane-based signed distance field (SDF) learning but regularizes the learning by explicit points sampled from ray-based diffusion of camera pose estimation. Our key contribution is a novel Geometric Consistent Ray Diffusion model (GCRayDiffusion), where we represent camera poses as neural bundle rays and regress the distribution of noisy rays via a diffusion model. More importantly, we further condition the denoising process of RGRayDiffusion using the triplane-based SDF of the entire scene, which provides effective 3D consistent regularization to achieve multi-view consistent camera pose estimation. Finally, we incorporate RGRayDiffusion into the triplane-based SDF learning by introducing on-surface geometric regularization from the sampling points of the neural bundle rays, which leads to highly accurate pose-free surface reconstruction results even for sparse-view inputs. Extensive evaluations on public datasets show that our GCRayDiffusion achieves more accurate camera pose estimation than previous approaches, with geometrically more consistent surface reconstruction results, especially given sparse-view inputs.

Thu 23 Oct. 17:45 - 19:45 PDT

#59
GSRecon: Efficient Generalizable Gaussian Splatting for Surface Reconstruction from Sparse Views

Hang Yang · Le Hui · Jianjun Qian · Jin Xie · Jian Yang

Generalizable surface reconstruction aims to recover the scene surface from a sparse set of images in a feed-forward manner. Existing neural implicit representation-based methods evaluate numerous points along camera rays to infer the geometry, resulting in inefficient reconstruction. Recently, 3D Gaussian Splatting offers an alternative efficient scene representation and has inspired a series of surface reconstruction methods. However, these methods require dense views and can not be generalized to new scenes. In this paper, we propose a novel surface reconstruction method with Gaussian splatting, named GSRecon, which leverages the advantages of rasterization-based rendering to achieve efficient reconstruction. To obtain accurate geometry representation, we propose a geometry-aware cross-view enhancement module to improve the unreliable geometry estimation in the current view by incorporating accurate geometric information from other views. To generate the fine-grained Gaussian primitives, we propose a hybrid cross-view feature aggregation module that integrates an efficient voxel branch and a fine-grained point branch to jointly capture cross-view geometric information. Extensive experiments on the DTU, BlendedMVS, and Tanks and Temples datasets validate that GSRecon achieves state-of-the-art performance and efficient reconstruction speed.

Thu 23 Oct. 17:45 - 19:45 PDT

#60
GeoDistill: Geometry-Guided Self-Distillation for Weakly Supervised Cross-View Localization

Shaowen Tong · Zimin Xia · Alexandre Alahi · Xuming He · Yujiao Shi

Cross-view localization, the task of estimating a camera's 3-degrees-of-freedom (3-DoF) pose by aligning ground-level images with satellite images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a weakly supervised self-distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a panoramic image, while the student model predicts locations from a limited FoV counterpart created by FoV-based masking. By aligning the student's predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty, regardless of whether the query images are panoramas or limited FoV images. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges.

Thu 23 Oct. 17:45 - 19:45 PDT

#61
REPARO: Compositional 3D Assets Generation with Differentiable 3D Layout Alignment

Haonan Han · Rui Yang · Huan Liao · Haonan Han · Zunnan Xu · Xiaoming Yu · Junwei Zha · Xiu Li · Wanhua Li

Traditional image-to-3D models often struggle with scenes containing multipleobjects due to biases and occlusion complexities. To address this challenge, wepresent REPARO, a novel approach for compositional 3D asset generation fromsingle images. REPARO employs a two-step process: first, it extracts individualobjects from the scene and reconstructs their 3D meshes using off-the-shelf imageto-3D models; then, it optimizes the layout of these meshes through differentiablerendering techniques, ensuring coherent scene composition. By integrating optimaltransport-based long-range appearance loss term and high-level semantic loss termin the differentiable rendering, REPARO can effectively recover the layout of 3Dassets. The proposed method can significantly enhance object independence, detailaccuracy, and overall scene coherence. Extensive evaluation of multi-object scenesdemonstrates that our REPARO offers a comprehensive approach to address thecomplexities of multi-object 3D scene generation from single images.

Thu 23 Oct. 17:45 - 19:45 PDT

#62
Towards Safer and Understandable Driver Intention Prediction

Mukilan Karuppasamy · Shankar Gangisetty · Shyam Nandan Rai · Carlo Masone · C.V. Jawahar

Autonomous driving (AD) systems are becoming increasinglycapable of handling complex tasks, largely due to recentadvances in deep learning and AI. As the interactions betweenautonomous systems and humans grow, the interpretabilityof driving system decision-making processes becomes crucialfor safe driving. Successful human-machine interactionrequires understanding the underlying representations of theenvironment and the driving task, which remains a significantchallenge in deep learning-based systems. To address this, weintroduce the task of interpretability in maneuver predictionbefore they occur for driver safety, i.e., driver intent prediction(DIP), which plays a critical role in AD systems. To fosterresearch in interpretable DIP, we curate the eXplainableDriving Action Anticipation Dataset (DAAD-X), a newmultimodal, ego-centric video dataset to provide hierarchical,high-level textual explanations as causal reasoning for thedriver’s decisions. These explanations are derived fromboth the driver’s eye-gaze and the ego-vehicle’s perspective.Next, we propose Video Concept Bottleneck Model (VCBM),a framework that generates spatio-temporally coherentexplanations inherently, without relying on post-hoc techniques.Finally, through extensive evaluations of the proposed VCBMon DAAD-X dataset, we demonstrate that transformer-basedmodels exhibit greater interpretability compared to conventionalCNN-based models. Additionally, we introduce a multilabelt-SNE visualization technique to illustrate the disentanglementand causal correlation among multiple explanations. Thedataset and code will be released on acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#63
Revisiting Point Cloud Completion: Are We Ready For The Real-World?

Stuti Pathak · Prashant Kumar · Dheeraj Baiju · Nicholus Mboga · Gunther Steenackers · Rudi Penne

Point clouds acquired in constrained, challenging, uncontrolled, and multi-sensor real-world settings are noisy, incomplete, and non-uniformly sparse. This presents acute challenges for the vital task of point cloud completion. Using tools from Algebraic Topology and Persistent Homology ($\mathcal{PH}$), we demonstrate that current benchmark object point clouds lack rich topological features that are integral part of point clouds captured in realistic environments. To facilitate research in this direction, we contribute the first real-world industrial dataset for point cloud completion, RealPC - a diverse, rich and varied set of point clouds. It consists of $\sim$ 40,000 pairs across 21 categories of industrial structures in railway establishments. Benchmark results on several strong baselines reveal that existing methods fail in real-world scenarios. We discover a striking observation - unlike current datasets, RealPC consists of multiple 0- and 1-dimensional $\mathcal{PH}$-based topological features. We prove that integrating these topological priors into existing works helps improve completion. We present how 0-dimensional $\mathcal{PH}$ priors extract the global topology of a complete shape in the form of a 3D skeleton and assist a model in generating topologically consistent complete shapes. Since computing Homology is expensive, we present a simple, yet effective Homology Sampler guided network, BOSHNet that bypasses the Homology computation by sampling proxy backbones akin to 0-dim $\mathcal{PH}$. These backbones provide similar benefits of 0-dim $\mathcal{PH}$ right from the start of the training, unlike similar methods where accurate backbones are obtained only during later phases of the training.

Thu 23 Oct. 17:45 - 19:45 PDT

#64
V2XPnP: Vehicle-to-Everything Spatio-Temporal Fusion for Multi-Agent Perception and Prediction

Zewei Zhou · Hao Xiang · Zhaoliang Zheng · Zhihao Zhao · Mingyue Lei · Yun Zhang · Tianhui Cai · Xinyi Liu · Johnson Liu · Maheswari Bajji · Xin Xia · Zhiyu Huang · Bolei Zhou · Jiaqi Ma

Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on the spatio-temporal fusion in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatio-temporal relationships across multiple agents, frames, and high-definition map. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X collaboration modes and addresses the limitations of existing real-world datasets, which are restricted to single-frame or single-mode cooperation. Extensive experiments demonstrate our framework outperforms state-of-the-art methods in both perception and prediction tasks. The codebase and dataset will be released to facilitate future V2X research.

Thu 23 Oct. 17:45 - 19:45 PDT

#65
InstaDrive: Instance-Aware Driving World Models for Realistic and Consistent Video Generation

Zhuoran Yang · Xi Guo · Chenjing Ding · Chiyu Wang · Wei Wu · Yanyong Zhang

Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos for tasks like perception and planning. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose InstaDrive, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider module, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner module, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare yet safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems.

Thu 23 Oct. 17:45 - 19:45 PDT

#66
NormalLoc: Visual Localization on Textureless 3D Models using Surface Normals

Jiro Abe · Gaku Nakano · Kazumine Ogura

We propose NormalLoc, a novel visual localization method for estimating the 6-DoF pose of a camera using textureless 3D models. Existing methods often rely on color or texture information, limiting their applicability in scenarios where such information is unavailable. NormalLoc addresses this limitation by using rendered normal images generated from surface normals of 3D models to establish a training scheme for both global descriptor computation and matching. This approach enables robust visual localization even when geometric details are limited. Experimental results demonstrate that NormalLoc achieves state-of-the-art performance for visual localization on textureless 3D models, especially in scenarios with limited geometric detail.

Thu 23 Oct. 17:45 - 19:45 PDT

#67
EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device

Gunjan Chhablani · Xiaomeng Ye · Muhammad Zubair Irshad · Zsolt Kira

The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and fine-tuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhone-captured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate real-world conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on sim-to-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation tasks. Moreover, our approach yields a high sim-vs-real correlation (0.87–0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort. Code and data will be released to facilitate further research.

Thu 23 Oct. 17:45 - 19:45 PDT

#68
FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction

Jiale Xu · Shenghua Gao · Ying Shan

Sparse-view reconstruction models typically require precise camera poses, yet obtaining these parameters from sparse-view images remains challenging. We introduce \textbf{FreeSplatter}, a scalable feed-forward framework that generates high-quality 3D Gaussians from \textbf{uncalibrated} sparse-view images while estimating camera parameters within seconds. Our approach employs a streamlined transformer architecture where self-attention blocks facilitate information exchange among multi-view image tokens, decoding them into pixel-aligned 3D Gaussian primitives within a unified reference frame. This representation enables both high-fidelity 3D modeling and efficient camera parameter estimation using off-the-shelf solvers. We develop two specialized variants--for \textbf{object-centric} and \textbf{scene-level} reconstruction--trained on comprehensive datasets. Remarkably, FreeSplatter outperforms existing pose-dependent Large Reconstruction Models (LRMs) by a notable margin while achieving comparable or even better pose estimation accuracy compared to state-of-the-art pose-free reconstruction approach MASt3R in challenging benchmarks. Beyond technical benchmarks, FreeSplatter streamlines text/image-to-3D content creation pipelines, eliminating the complexity of camera pose management while delivering exceptional visual fidelity.

Thu 23 Oct. 17:45 - 19:45 PDT

#69
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving

Rui Chen · Zehuan Wu · Yichen Liu · Yuxin Guo · Jingcheng Ni · Haifeng Xia · Siyu Xia

The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system.However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates a DiT-based diffusion model equipped with cross-frame and cross-view modules across three stages with multi training objectives, substantially boosting the diversity and quality of generated visual content. Importantly, we propose an innovative explicit viewpoint modeling approach for multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions.Compared to the best models with similar capabilities, our framework achieves improvements of 48.2\% in FID and 35.2\% in FVD.

Thu 23 Oct. 17:45 - 19:45 PDT

#70
INSTINCT: Instance-Level Interaction Architecture for Query-Based Collaborative Perception

yunjiang xu · Yupeng Ouyang · Lingzhi Li · Jin Wang · Benyuan Yang

Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-the-art approaches. To bridge this gap, we propose INSTINCT (instance-level interaction architecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance.Specifically, our method achieves an improvement in accuracy 13.23\%/32.24\% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code will be released soon.

Thu 23 Oct. 17:45 - 19:45 PDT

#71
ForeSight: Multi-View Streaming Joint Object Detection and Trajectory Forecasting

Sandro Papais · Letian Wang · Brian Cheong · Steven Waslander

We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues from past forecasts. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multi-frame sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9\%, surpassing previous methods by 9.3\%, while also attaining the highest mAP among multi-view detection models and maintaining competitive motion forecasting accuracy.

Thu 23 Oct. 17:45 - 19:45 PDT

#72
NGD: Neural Gradient Based Deformation for Monocular Garment Reconstruction

Soham Dasgupta · Shanthika Naik · Preet Savalia · Sujay Kumar Ingle · Avinash Sharma

Dynamic garment reconstruction from monocular video is an important yet challenging task due to the complex dynamics and unconstrained nature of the garments. Recent advancements in neural rendering have enabled high-quality geometric reconstruction with image/video supervision. However, implicit representation methods that use volume rendering often provide smooth geometry and fail to model high-frequency details. While template reconstruction methods model explicit geometry, they use vertex displacement for deformation which results in artifacts. Addressing these limitations, we propose NGD, a Neural Gradient-based Deformation method to reconstruct dynamically evolving textured garments from monocular videos. Additionally, we propose a novel adaptive remeshing strategy for modeling dynamically evolving surfaces like wrinkles and pleats of the skirt, leading to high-quality reconstruction. Finally, we learn dynamic texture maps to capture per-frame lighting and shadow effects. We provide extensive qualitative and quantitative evaluations to demonstrate significant improvements over existing SOTA methods and provide high-quality garment reconstructions.

Thu 23 Oct. 17:45 - 19:45 PDT

#73
Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs

Soonbin Lee · Fangwen Shu · Yago Sanchez de la Fuente · Thomas Schierl · Cornelius Hellge

3D Gaussian Splatting is a recognized method for 3D scene representation, known for its high rendering quality and speed. However, its substantial data requirements present challenges for practical applications. In this paper, we introduce an efficient compression technique that significantly reduces storage overhead by using compact representation. We propose a unified architecture that combines point cloud data and feature planes through a progressive tri-plane structure. Our method utilizes 2D feature planes, enabling continuous spatial representation. To further optimize these representations, we incorporate entropy modeling in the frequency domain, specifically designed for standard video codecs. We also propose channel-wise bit allocation to achieve a better trade-off between bitrate consumption and feature plane representation. Consequently, our model effectively leverages spatial correlations within the feature planes to enhance rate-distortion performance using standard, non-differentiable video codecs. Experimental results demonstrate that our method outperforms existing methods in data compactness while maintaining high rendering quality.

Thu 23 Oct. 17:45 - 19:45 PDT

#74
Spatially-Varying Autofocus

Yingsi Qin · Aswin Sankaranarayanan · Matthew O'Toole

A lens brings a $\textit{single}$ plane into focus on a planar sensor; hence, parts of the scene that are outside this planar focus plane are resolved on the sensor under defocus. Can we break this precept by enabling a lens that can change its depth-of-field arbitrarily? This work investigates the design and implementation of such a computational lens with spatially-selective focusing. Our design uses an optical arrangement of Lohmann lenses and phase spatial light modulators to allow each pixel to focus onto a different depth. We extend classical techniques used in autofocusing to the spatially-varying scenario where the depth map is iteratively estimated using contrast and disparity cues, enabling the camera to progressively shape its depth-of-field to the scene's depth. By obtaining an optical all-in-focus image, our technique advances upon a broad swathe of prior work ranging from depth-from-focus/defocus to coded aperture techniques in two key aspects: the ability to bring an entire scene in focus simultaneously, and the ability to maintain the highest possible spatial resolution.

Thu 23 Oct. 17:45 - 19:45 PDT

#75
Event-based Visual Vibrometry

Xinyu Zhou · Peiqi Duan · Yeliduosi Xiaokaiti · Chao Xu · Boxin Shi

Visual vibrometry has emerged as a powerful technique for remote acquisition of audio signals and the physical properties of materials. To capture high-frequency vibrations, frame-based visual vibrometry approaches often require a high-speed video camera and bright lighting to compensate for the short exposure time. In this paper, we introduce event-based visual vibrometry, a new high-speed visual vibration sensing method using an event camera. Exploiting the high temporal resolution, dynamic range, and low bandwidth characteristics of event cameras, event-based visual vibrometry achieves high-speed vibration sensing under common lighting conditions with enhanced data efficiency. Specifically, we leverage a hybrid camera system and propose an event-based subtle motion estimation framework that integrates an optimization-based approach for estimating coarse motion within short time intervals and a neural network to mitigate the inaccuracies in the coarse motion estimation. We demonstrate our method by capturing vibration caused by audio sources and estimating material properties for various objects.

Thu 23 Oct. 17:45 - 19:45 PDT

#76
Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations

Xiang Xu · Lingdong Kong · Song Wang · Chuanwei Zhou · Qingshan Liu

LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code will be made publicly accessible for future research.

Thu 23 Oct. 17:45 - 19:45 PDT

#77
BézierGS: Dynamic Urban Scene Reconstruction with Bézier Curve Gaussian Splatting

Zipei Ma · Junzhe Jiang · Yurui Chen · Li Zhang

The realistic reconstruction of street scenes is critical for developing real-world simulators in autonomous driving. Most existing methods rely on object pose annotations, using these poses to reconstruct dynamic objects and move them during the rendering process. This dependence on high-precision object annotations limits large-scale and extensive scene reconstruction. To address this challenge, we propose Bézier curve Gaussian splatting (BézierGS), which represents the motion trajectories of dynamic objects using learnable Bézier curves. This approach fully leverages the temporal information of dynamic objects and, through learnable curve modeling, automatically corrects pose errors. By introducing additional supervision on dynamic object rendering and inter-curve consistency constraints, we achieve reasonable and accurate separation and reconstruction of scene elements. Extensive experiments on the Waymo Open Dataset and the nuPlan benchmark demonstrate that BézierGS outperforms state-of-the-art alternatives in both dynamic and static scene components reconstruction and novel view synthesis.

Thu 23 Oct. 17:45 - 19:45 PDT

#78
Lifting the Structural Morphing for Wide-Angle Images Rectification: Unified Content and Boundary Modeling

Wenting Luan · Siqi Lu · Yongbin Zheng · Wanying XU · Lang Nie · Zongtan Zhou · Kang Liao

The mainstream approach for correcting distortions in wide-angle images typically involves a cascading process of rectification followed by rectangling. These tasks address distorted image content and irregular boundaries separately, using two distinct pipelines. However, this independent optimization prevents the two stages from benefiting each other. It also increases susceptibility to error accumulation and misaligned optimization, ultimately degrading the quality of the rectified image and the performance of downstream vision tasks.In this work, we observe and verify that transformations based on motion representations (e.g., Thin-Plate Spline) exhibit structural continuity in both rectification and rectangling tasks. This continuity enables us to establish their relationships through the perspective of structural morphing, allowing for an optimal solution within a single end-to-end framework.To this end, we propose ConBo-Net, a unified Content and Boundary modeling approach for one-stage wide-angle image correction. Our method jointly addresses distortion rectification and boundary rectangling in an end-to-end manner. To further enhance the model’s structural recovery capability, we incorporate physical priors based on the wide-angle camera model during training and introduce an ordinal geometric loss to enforce curvature monotonicity. Extensive experiments demonstrate that ConBo-Net outperforms state-of-the-art two-stage solutions. The code and dataset will be made available.

Thu 23 Oct. 17:45 - 19:45 PDT

#79
Global Regulation and Excitation via Attention Tuning for Stereo Matching

Jiahao LI · Xinhong Chen · Zhengmin JIANG · Qian Zhou · Yung-Hui Li · Jianping Wang

Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to lacking global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code for reproducibility will be available in the future.

Thu 23 Oct. 17:45 - 19:45 PDT

#80
Global-Aware Monocular Semantic Scene Completion with State Space Models

Shijie Li · Zhongyao Cheng · Rong Li · Shuai Li · Juergen Gall · Xun Xu · Xulei Yang

Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points (Fig. \ref{fig:perspective}) and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a Dual-Head Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code will be released upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#81
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving

Yuping Wang · Xiangyu Huang · Xiaokang Sun · Mingxuan Yan · Shuo Xing · Zhengzhong Tu · Jiachen Li

We introduce UniOcc, a comprehensive, unified benchmark for occupancy forecasting (i.e., predicting future occupancies based on historical information) and current-frame occupancy prediction from camera images. UniOcc unifies data from multiple real-world datasets (i.e., nuScenes, Waymo) and high-fidelity driving simulators (i.e., CARLA, OpenCOOD), which provides 2D/3D occupancy labels with per-voxel flow annotations and support for cooperative autonomous driving. Unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel evaluation metrics that do not depend on ground-truth occupancy, enabling robust assessment on additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance. We will release UniOcc to facilitate research in safe and reliable autonomous driving.

Thu 23 Oct. 17:45 - 19:45 PDT

#82
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

Tianqi Liu · Zihao Huang · Zhaoxi Chen · Guangcong Wang · Shoukang Hu · Liao Shen · Huiqiang Sun · Zhiguo Cao · Wei Li · Ziwei Liu

We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To lift this coarse structure into spatial-temporal consistent multi-view videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To turn these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable temporal-spatial rendering, marking a significant advancement in single-image-based 4D scene generation. Code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#83
MuGS: Multi-Baseline Generalizable Gaussian Splatting Reconstruction

Yaopeng Lou · Liao Shen · Tianqi Liu · Jiaqi Li · Zihao Huang · Huiqiang Sun · Zhiguo Cao

We present Multi-Baseline Gaussian Splatting (MuGS), a generalized feed-forward approach for novel view synthesis that effectively handles diverse baseline settings, including sparse input views with both small and large baselines.Specifically, we integrate features from Multi-View Stereo (MVS) and Monocular Depth Estimation (MDE) to enhance feature representations for generalizable reconstruction. Next, We propose a projection-and-sampling mechanism for deep depth fusion, which constructs a fine probability volume to guide the regression of the feature map. Furthermore, We introduce a reference-view loss to improve geometry and optimization efficiency.We leverage $3$D Gaussian representations to accelerate training and inference time while enhancing rendering quality.MuGS achieves state-of-the-art performance across multiple baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K). We also demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets. Code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#84
S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction

Guangting Zheng · Jiajun Deng · Xiaomeng Chu · Yu Yuan · Houqiang Li · Yanyong Zhang

Recently, 3D Gaussian Splatting (3DGS) has reshaped the field of photorealistic 3D reconstruction, achieving impressive rendering quality and speed. However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead.After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for large-scale Street Scene Reconstruction, effectively mitigating these limitations. Moreover, most existing street 3DGS methods rely on ground-truth 3D bounding boxes to separate dynamic and static components, but 3D bounding boxes are difficult to obtain, limiting real-world applicability. To address this, we propose an alternative solution with 2D boxes, which are easier to annotate or can be predicted by off-the-shelf vision foundation models. Such designs together make S3R-GS readily adapt to large, in-the-wild scenarios.Extensive experiments demonstrate that S3R-GS enhances rendering quality and significantly accelerates reconstruction. Remarkably, when applied to videos from the challenging Argoverse2 dataset, it achieves state-of-the-art PSNR and SSIM, reducing reconstruction time to below 50\%—and even 20\%—of competing methods. The code will be released to facilitate further exploration.

Thu 23 Oct. 17:45 - 19:45 PDT

#85
HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder

Yingqi Tang · Zhuoran Xu · Zhaotie Meng · Erkang Cheng

Although end-to-end autonomous driving (E2E-AD) technologies have made significant progress in recent years, there remains an unsatisfactory performance on closed-loop evaluation. The potential of leveraging planning in query design and interaction has not yet been fully explored. In this paper, we introduce a multi-granularity planning query representation that integrates heterogeneous waypoints, including spatial, temporal, and driving-style waypoints across various sampling patterns. It provides additional supervision for trajectory prediction, enhancing precise closed-loop control for the ego vehicle. Additionally, we explicitly utilize the geometric properties of planning trajectories to effectively retrieve relevant image features based on physical locations using deformable attention. By combining these strategies, we propose a novel end-to-end autonomous driving framework, termed HiP-AD, which simultaneously performs perception, prediction, and planning within a unified decoder. HiP-AD enables comprehensive interaction by allowing planning queries to iteratively interact with perception queries in the BEV space while dynamically extracting image features from perspective views.Experiments demonstrate that HiP-AD outperforms all existing end-to-end autonomous driving methods on the closed-loop benchmark Bench2Drive and achieves competitive performance on the real-world dataset nuScenes. The code will be available upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#86
Highlight
RayletDF: Raylet Distance Fields for Generalizable 3D Surface Reconstruction from Point Clouds or Gaussians

Shenxing Wei · Jinxi Li · Yafei YANG · Siyuan Zhou · Bo Yang

In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named RayletDF, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.

Thu 23 Oct. 17:45 - 19:45 PDT

#87
GAP: Gaussianize Any Point Clouds with Text Guidance

Weiqi Zhang · Junsheng Zhou · Haotian Geng · Wenyuan Zhang · Liang Han

3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes.

Thu 23 Oct. 17:45 - 19:45 PDT

#88
Semantic-guided Camera Ray Regression for Visual Localization

Yesheng Zhang · Xu Zhao

This work presents a novel framework for Visual Localization (VL), that is, regressing camera rays from query images to derive camera poses. As an overparameterized representation of the camera pose, camera rays possess superior robustness in optimization.Of particular importance, Camera Ray Regression (CRR) is privacy-preserving, rendering it a viable VL approach for real-world applications. Thus, we introduce DINO-based Multi-Mappers, coined DIMM, to achieve VL by CRR.DIMM utilizes DINO as a scene-agnostic encoder to obtain powerful features from images. To mitigate ambiguity, the features integrate both local and global perception, as well as potential geometric constraint. Then, a scene-specific mapper head regresses camera rays from these features. It incorporates a semantic attention module for soft fusion of multiple mappers, utilizing the rich semantic information in DINO features. In extensive experiments on both indoor and outdoor datasets, our methods showcase impressive performance, revealing a promising direction for advancements in VL.

Thu 23 Oct. 17:45 - 19:45 PDT

#89
SketchSplat: 3D Edge Reconstruction via Differentiable Multi-view Sketch Splatting

Haiyang Ying · Matthias Zwicker

Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During edge reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image error with respect to the input 2D edge images can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate and complete results. We also propose a series of adaptive topological operations and apply them along with the sketch optimization. The topological operations help reduce the number of sketches required while ensuring high accuracy, yielding a more compact reconstruction. Finally, we contribute an accurate 2D edge detector that improves the performance of both ours and existing methods. Experiments show that our method achieves state-of-the-art accuracy, completeness, and compactness on a benchmark CAD dataset.

Thu 23 Oct. 17:45 - 19:45 PDT

#90
Polarimetric Neural Field via Unified Complex-Valued Wave Representation

Chu Zhou · Yixin Yang · Junda Liao · Heng Guo · Boxin Shi · Imari Sato

Polarization has found applications in various computer vision tasks by providing additional physical cues. However, due to the limitations of current imaging systems, polarimetric parameters are typically stored in discrete form, which is non-differentiable and limits their applicability in polarization-based vision. While current neural field methods have shown promise for continuous signal reconstruction, they struggle to model the intrinsic physical interdependencies among polarimetric parameters. In this work, we propose a physics-grounded representation scheme to represent polarimetric parameters as a unified complex-valued wavefunction. Tailored to this scheme, we propose a tuning-free fitting strategy along with a lightweight complex-valued neural network, enabling property-preserved reconstruction. Experimental results show that our method achieves state-of-the-art performance and facilitates smooth polarized image rendering and flexible resolution adjustments.

Thu 23 Oct. 17:45 - 19:45 PDT

#91
High-Precision 3D Measurement of Complex Textured Surfaces Using Multiple Filtering Approach

Yuchong Chen · Jian Yu · Shaoyan Gai · Zeyu Cai · Feipeng Da

In structured light systems, measurement accuracy tends to decline significantly when evaluating complex textured surfaces, particularly at boundaries between different colors. To address this issue, this paper conducts a detailed analysis to develop an error model that illustrates the relationship between phase error and image characteristics, specifically the blur level, grayscale value, and grayscale gradient. Based on this model, a high-precision approach for measuring complex textured targets is introduced, employing a multiple filtering approach. This approach first applies a sequence of filters to vary the blur level of the captured patterns, allowing calculation of phase differences under different blur conditions. Then, these phase differences are used in the constructed error model to identify the critical parameter causing phase errors. Finally, phase recovery is performed using the calibrated parameter, effectively reducing errors caused by complex textures. Experimental comparisons exhibit that this method reduces the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) by 40.31% and 40.78%, respectively. In multiple experiments, its performance generally surpassed that of existing methods, demonstrating improved accuracy and robustness.

Thu 23 Oct. 17:45 - 19:45 PDT

#92
OCSplats: Observation Completeness Quantification and Label Noise Separation in 3DGS

Han Ling · Yinghui Sun · Xian Xu · Quansen Sun

3D Gaussian Splatting (3DGS) has become one of the most promising 3D reconstruction technologies. However, label noise in real-world scenarios—such as moving objects, non-Lambertian surfaces, and shadows—often leads to reconstruction errors. Existing 3DGS-Bsed anti-noise reconstruction methods either fail to separate noise effectively or require scene-specific fine-tuning of hyperparameters, making them difficult to apply in practice.This paper re-examines the problem of anti-noise reconstruction from the perspective of epistemic uncertainty, proposing a novel framework, OCSplats. By combining key technologies such as hybrid noise assessment and observation-based cognitive correction, the accuracy of noise classification in areas with cognitive differences has been significantly improved.Moreover, to address the issue of varying noise proportions in different scenarios, we have designed a label noise classification pipeline based on dynamic anchor points. This pipeline enables OCSplats to be applied simultaneously to scenarios with vastly different noise proportions without adjusting parameters. Extensive experiments demonstrate that OCSplats always achieve leading reconstruction performance and precise label noise classification in scenes of different complexity levels. Code will be available.

Thu 23 Oct. 17:45 - 19:45 PDT

#93
Highlight
VMem: Consistent Interactive Video Scene Generation with Surfel-Indexed View Memory

Runjia Li · Philip Torr · Andrea Vedaldi · Tomas Jakab

We propose a novel approach for long-term autoregressive scene generation in the form of a camera-conditioned video stream.Existing methods either rely on explicit geometry estimation in inpainting-based approaches, which suffer from geometric inaccuracies, or use a limited context window in video-based approaches, which struggle with long-term coherence.To address these limitations, we introduce Surfel-Indexed Memory of Views (SIMView), a mechanism that anchors past views to surface elements (surfels) they previously observed.This allows us to retrieve and condition novel view generation on the most relevant past views rather than just the latest ones.By leveraging information about the scene's geometric structure, our method significantly enhances long-term scene consistency while reducing computational overhead.We evaluate our approach on challenging long-term scene synthesis benchmarks, demonstrating superior performance in scene coherence and camera control compared to existing methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#94
AutoScape: Geometry-Consistent Long-Horizon Scene Generation

Jiacheng Chen · Ziyu Jiang · Mingfu Liang · Bingbing Zhuang · Jong-Chyi Su · Sparsh Garg · Ying Wu · Manmohan Chandraker

Video generation for driving scenes has gained increasing attention due to its broad range of applications, including autonomous driving, robotics, and mixed reality. However, generating high-quality, long-horizon, and 3D-consistent videos remains a challenge.We propose AutoScape, a framework designed for long-horizon driving scene generation. The framework comprises two stages: 1) Keyframe Generation, which anchors global scene appearance and geometry by autoregressively generating 3D-consistent keyframes using a joint RGB-D diffusion model, and 2) Interpolation, which employs a video diffusion model to generate dense frames conditioned on consecutive keyframes, ensuring temporal continuity and geometric consistency.With three innovative design choices to guarantee 3D consistency—RGB-D Diffusion, 3D Information Conditioning, and Warp Consistent Guidance—AutoScape achieves superior performance, generating realistic and geometrically consistent driving videos of up to 20 seconds at 12 FPS. Specifically, it improves the FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively, setting a new benchmark for long-horizon video generation in driving scenes.

Thu 23 Oct. 17:45 - 19:45 PDT

#95
From Gallery to Wrist: Realistic 3D Bracelet Insertion in Videos

Chenjian Gao · Lihe Ding · Rui Han · Zhanpeng Huang · Zibin Wang · Tianfan Xue

Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing.

Thu 23 Oct. 17:45 - 19:45 PDT

#96
Street Gaussians without 3D Object Tracker

Ruida Zhang · Chengxi Li · Chenyangguang Zhang · Xingyu Liu · Haili Yuan · Yanyan Li · Xiangyang Ji · Gim Hee Lee

Realistic scene reconstruction in driving scenarios poses significant challenges due to fast-moving objects. Most existing methods rely on labor-intensive manual labeling of object poses to reconstruct dynamic objects in canonical space and move them based on these poses during rendering. While some approaches attempt to use 3D object trackers to replace manual annotations, the limited generalization of 3D trackers - caused by the scarcity of large-scale 3D datasets - results in inferior reconstructions in real-world settings. In contrast, 2D foundation models demonstrate strong generalization capabilities. To eliminate the reliance on 3D trackers and enhance robustness across diverse environments, we propose a stable object tracking module by leveraging associations from 2D deep trackers within a 3D object fusion strategy. We address inevitable tracking errors by further introducing a motion learning strategy in an implicit feature space that autonomously corrects trajectory errors and recovers missed detections. Experimental results on Waymo-NOTR and KITTI show that our method outperforms existing approaches. Our code will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#97
RogSplat: Robust Gaussian Splatting via Generative Priors

Hanyang Kong · Xingyi Yang · Xinchao Wang

3D Gaussian Splatting (3DGS) has recently emerged as an efficient representation for high-quality 3D reconstruction and rendering. Despite its superior rendering quality and speed, 3DGS heavily relies on the assumption of geometric consistency among input images. In real-world scenarios, violations of this assumption—such as occlusions, dynamic objects, or camera blur—often lead to reconstruction artifacts and rendering inaccuracies. To address these challenges, we introduce RogSplat, a robust framework that leverages generative models to enhance the reliability of 3DGS. Specifically, RogSplat identifies and rectifies occluded regions during the optimization of unstructured scenes. Outlier regions are first detected using our proposed fused features and then accurately inpainted by the proposed RF-Refiner, ensuring reliable reconstruction of occluded areas while preserving the integrity of visible regions. Extensive experiments demonstrate that RogSplat achieves state-of-the-art reconstruction quality on the RobustNeRF and NeRF-on-the-go datasets, significantly outperforming existing methods in challenging real-world scenarios involving dynamic objects.

Thu 23 Oct. 17:45 - 19:45 PDT

#98
Highlight
HiNeuS: High-fidelity Neural Surface Mitigating Low-texture and Reflective Ambiguity

Yida Wang · Xueyang Zhang · Kun Zhan · Peng Jia · XianPeng Lang

Neural surface reconstruction faces critical challenges in achieving geometrically accurate and visually coherent results under complex real-world conditions. We present a unified framework that simultaneously resolves multi-view radiance inconsistencies, enhances low-textured surface recovery, and preserves fine structural details through three fundamental innovations. First, our SDF-guided visibility factor $\mathbb{V}$ establishes continuous occlusion reasoning to eliminate reflection-induced ambiguities in multi-view supervision. Second, we introduce local geometry constraints via ray-aligned patch analysis $\mathbb{P}$, enforcing planarity in textureless regions while maintaining edge sensitivity through adaptive feature weighting. Third, we reformulate Eikonal regularization with rendering-prioritized relaxation, enabling detail preservation by conditioning geometric smoothness on local radiance variations. Unlike prior works that address these aspects in isolation, our method achieves synergistic optimization where multi-view consistency, surface regularity, and structural fidelity mutually reinforce without compromise. Extensive experiments across synthetic and real-world datasets demonstrate state-of-the-art performance, with quantitative improvements of 21.4\% in Chamfer distance over reflection-aware baselines and 2.32 dB PSNR gains against neural rendering counterparts. Qualitative results showcase unprecedented reconstruction quality for challenging cases including specular instruments, urban layouts with thin structures, and Lambertian surfaces with sub-millimeter details. Our code will be publicly released to facilitate research in unified neural surface recovery.

Thu 23 Oct. 17:45 - 19:45 PDT

#99
RGE-GS: Reward-Guided Expansive Driving Scene Reconstruction via Diffusion Priors

Sicong Du · Jiarun Liu · Qifeng Chen · Hao-Xiang Chen · Tai-Jiang Mu · Sheng Yang

A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality.

Thu 23 Oct. 17:45 - 19:45 PDT

#100
Scene Coordinate Reconstruction Priors

Wenjing Bian · Axel Barroso-Laguna · Tommaso Cavallari · Victor Prisacariu · Eric Brachmann

Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints to recover the scene geometry, SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards a more plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.

Thu 23 Oct. 17:45 - 19:45 PDT

#101
Diff2I2P: Differentiable Image-to-Point Cloud Registration with Diffusion Prior

Juncheng Mu · Chengwei REN · Weixiang Zhang · Liang Pan · Xiao-Ping Zhang · Yue Gao

Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate cross-modal correspondences. To this end, inspired by the cross-modal generation success of recent large diffusion models, we propose **Diff$^2$I2P**, a fully **Diff**erentiable **I2P** registration framework, leveraging a novel and effective **Diff**usion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the cross-modal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that **Diff$^2$I2P** consistently outperforms state-of-the-art I2P registration methods, achieving over 7 \% improvement in registration recall on the 7-Scenes benchmark. Moreover, **Diff$^2$I2P** exhibits robust and superior scene-agnostic registration performance.

Thu 23 Oct. 17:45 - 19:45 PDT

#102
Resonance: Learning to Predict Social-Aware Pedestrian Trajectories as Co-Vibrations

Conghao Wong · Ziqian Zou · Beihao Xia

Learning to forecast trajectories of intelligent agents has caught much more attention recently.However, it remains a challenge to accurately account for agents' intentions and social behaviors when forecasting, and in particular, to simulate the unique randomness within each of those components in an explainable and decoupled way.Inspired by vibration systems and their resonance properties, we propose the Resonance (short for Re) model to encode and forecast pedestrian trajectories in the form of ``co-vibrations''.It decomposes trajectory modifications and randomnesses into multiple vibration portions to simulate agents' reactions to each single cause, and forecasts trajectories as the superposition of these independent vibrations separately.Also, benefiting from such vibrations and their spectral properties, representations of social interactions can be learned by emulating the resonance phenomena, further enhancing its explainability.Experiments on multiple datasets have verified its usefulness both quantitatively and qualitatively.

Thu 23 Oct. 17:45 - 19:45 PDT

#103
GaussianUpdate: Continual 3D Gaussian Splatting Update for Changing Environments

Lin Zeng · Boming Zhao · Jiarui Hu · Xujie Shen · Ziqiang Dang · Hujun Bao · Zhaopeng Cui

Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times.

Thu 23 Oct. 17:45 - 19:45 PDT

#104
I2-World: Intra-Inter Tokenization for Efficient Dynamic 4D Scene Forecasting

Zhimin Liao · Ping Wei · Ruijie Zhang · Shuaijia Chen · Haoxuan Wang · Ziyang Ren

Forecasting the evolution of 3D scenes and generating unseen scenarios through occupancy-based world models offers substantial potential to enhance the safety of autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design retains the compactness of 3D tokenizers while capturing the dynamic expressiveness of 4D approaches. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to guide future scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, surpassing existing approaches by $\textbf{41.8}$% in 4D occupancy forecasting with exceptional efficiency—requiring only $\textbf{2.9 GB}$ of training memory and achieving real-time inference at $\textbf{94.8 FPS}$.

Thu 23 Oct. 17:45 - 19:45 PDT

#105
InsideOut: Integrated RGB-Radiative Gaussian Splatting for Comprehensive 3D Object Representation

Jungmin Lee · Seonghyuk Hong · Juyong Lee · Jaeyoon Lee · Jongwon Choi

Multi-modal data fusion plays a crucial role in integrating diverse physical properties. While RGB images capture external visual features, they lack internal features, whereas X-ray images reveal internal structures but lack external details. To bridge this gap, we propose \textit{Insideout}, a novel 3DGS framework that integrates RGB and X-ray data to represent the structure and appearance of objects. Our approach consists of three key components: internal structure training, hierarchical fitting, and detail-preserving refinement. First, RGB and radiative Gaussian splats are trained to capture surface structure. Then, hierarchical fitting ensures scale and positional synchronization between the two modalities. Next, cross-sectional images are incorporated to learn internal structures and refine layer boundaries. Finally, the aligned Gaussian splats receive color from RGB Gaussians, and fine Gaussian is duplicated to enhance surface details. Experiments conducted on a newly collected dataset of paired RGB and X-ray images demonstrate the effectiveness of \textit{InsideOut} in accurately representing internal and external structures.

Thu 23 Oct. 17:45 - 19:45 PDT

#106
Serialization based Point Cloud Oversegmentation

chenghui Lu · Dilong Li · Jianlong Kwan · Ziyi Chen · Haiyan Guan

Point cloud oversegmentation, as a fundamental preprocess step in point cloud understanding, is a challenging task as its spatial proximity and semantic similarity requirement. Most existing works struggle to efficiently group semantically consistent points into superpoints while maintaining spatial proximity. In this paper, we propose a novel serialization based point cloud oversegmentation method, which leverages serialization to avoid complex spatial queries, directly accessing neighboring points through sequence locality for similarity matching and superpoint clustering. Specifically, we first serialize point clouds into Hilbert curve and spatially-continuously partition them into multiple initial segments. Then, to guarantee the internal semantic consistency of superpoints, we design an adaptive update algorithm that clusters superpoints by matching feature similarities between neighboring segments and updates features via Cross-Attention. Experimental results show that the proposed method achieves state-of-the-art in point cloud oversegmentation across multiple large-scale indoor and outdoor datasets. Moreover, the proposed method can be flexibly adapted to the semantic segmentation task, and achieves promising performance.

Thu 23 Oct. 17:45 - 19:45 PDT

#107
StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams

Yang LI · Jinglu Wang · Lei Chu · Xiao Li · Shiu-hong Kao · Ying-Cong Chen · Yan Lu

The advent of 3D Gaussian Splatting (3DGS) has advanced 3D scene reconstruction and novel view synthesis. With the growing interest of interactive applications that need immediate feedback, online 3DGS reconstruction in real-time is in high demand. However, none of existing methods yet meet the demand due to three main challenges: the absence of predetermined camera parameters, the need for generalizable 3DGS optimization, and the necessity of reducing redundancy. We propose StreamGS, an online generalizable 3DGS reconstruction method for unposed image streams, which progressively transform image streams to 3D Gaussian streams by predicting and aggregating per-frame Gaussians. Our method overcomes the limitation of the initial point reconstruction \cite{dust3r} in tackling out-of-domain (OOD) issues by introducing a content adaptive refinement. The refinement enhances cross-frame consistency by establishing reliable pixel correspondences between adjacent frames. Such correspondences further aid in merging redundant Gaussians through cross-frame feature aggregation. The density of Gaussians is thereby reduced, empowering online reconstruction by significantly lowering computational and memory costs. Extensive experiments on diverse datasets have demonstrated that StreamGS achieves quality on par with optimization-based approaches but does so 150 times faster, and exhibits superior generalizability in handling OOD scenes.

Thu 23 Oct. 17:45 - 19:45 PDT

#108
RIOcc: Efficient Cross-Modal Fusion Transformer with Collaborative Feature Refinement for 3D Semantic Occupancy Prediction

Baojie Fan · Xiaotian Li · Yuhan Zhou · Yuyu Jiang · Jiandong Tian · Huijie Fan

The multi-modal 3D semantic occupancy task provides a comprehensive understanding of the scene and has received considerable attention in the field of autonomous driving. However, existing methods mainly focus on processing large-scale voxels, which bring high computational costs and degrade details. Additionally, they struggle to accurately capture occluded targets and distant information. In this paper, we propose a novel LiDAR-Camera 3D semantic occupancy prediction framework called RIOcc, with collaborative feature refinement and multi-scale cross-modal fusion transformer. Specifically, RIOcc encodes multi-modal data into a unified Bird's Eye View (BEV) space, which reduces computational complexity and enhances the efficiency of feature alignment. Then, multi-scale feature processing substantially expands the receptive fields. Meanwhile, in the LiDAR branch, we design the Dual-branch Pooling (DBP) to adaptively enhance geometric features across both the Channel and Grid dimensions. In the camera branch, the Wavelet and Semantic Encoders are developed to extract high-level semantic features with abundant edge and structural information. Finally, to facilitate effective cross-modal complementarity, we develop the Deformable Dual-Attention (DDA) module. Extensive experiments demonstrate that RIOcc achieves state-of-the-art performance, with 54.2 mIoU and 25.9 mIoU on the Occ3D-nuScenes and nuScenes-Occupancy datasets, respectively.

Thu 23 Oct. 17:45 - 19:45 PDT

#109
Lightweight Gradient-Aware Upscaling of 3D Gaussian Splatting Images

Simon Niedermayr · Christoph Neuhauser · Rüdiger Westermann

We introduce an image upscaling technique tailored for 3D Gaussian Splatting (3DGS) on lightweight GPUs. Compared to 3DGS, it achieves significantly higher rendering speeds and reduces artifacts commonly observed in 3DGS reconstructions. Our technique upscales low-resolution 3DGS renderings with a marginal increase in cost by directly leveraging the analytical image gradients of Gaussians for gradient-based bicubic spline interpolation.The technique is agnostic to the specific 3DGS implementation, achieving novel view synthesis at rates 3×–4× higher than the baseline implementation.Through extensive experiments on multiple datasets, we showcase the performance improvements and high reconstruction fidelity attainable with gradient-aware upscaling of 3DGS images.We further demonstrate the integration of gradient-aware upscaling into the gradient-based optimization of a 3DGS model and analyze its effects on reconstruction quality and performance.

Thu 23 Oct. 17:45 - 19:45 PDT

#110
TeethGenerator: A two-stage framework for paired pre- and post-orthodontic 3D dental data generation

Changsong Lei · Yaqian Liang · Shaofeng Wang · Jiajia Dai · Yong-Jin Liu

Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, which constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation models have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#111
Online Language Splatting

Saimouli Katragadda · Cho-Ying Wu · Yuliang Guo · Xinyu Huang · Guoquan Huang · Liu Ren

To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting (GS), these approaches rely on computationally intensive offline preprocessing of language features for each input image, limiting adaptability to new environments.In this work, we introduce Online Language Splatting, the first framework to achieve online, near real-time, open-vocabulary language mapping within a 3DGS-SLAM system without requiring pre-generated language features. The key challenge lies in efficiently fusing high-dimensional language features into 3D representations while balancing the computation speed, memory usage, rendering quality and open-vocabulary capability. To this end, we innovatively design: (1) a high-resolution CLIP embedding module capable of generating detailed language feature maps in 18ms per frame, (2) a two-stage online auto-encoder that compresses 768-dimensional CLIP features to 15 dimensions while preserving open-vocabulary capabilities, and (3) a color-language disentangled optimization approach to improve rendering quality.Experimental results show that our online method not only surpasses the state-of-the-art offline methods in accuracy but also achieves more than $40\times$ efficiency boost, demonstrating the potential for dynamic and interactive AI applications.

Thu 23 Oct. 17:45 - 19:45 PDT

#112
Probabilistic Inertial Poser (ProbIP): Uncertainty-aware Human Motion Modeling from Sparse Inertial Sensors

Min Kim · Younho Jeon · Sungho Jo

Wearable Inertial Measurement Units (IMUs) allow non-intrusive motion tracking, but limited sensor placements can introduce uncertainty in capturing detailed full-body movements. Existing methods mitigate this issue by selecting more physically plausible motion patterns but do not directly address inherent uncertainties in the data. We introduce the Probabilistic Inertial Poser (ProbIP), a novel probabilistic model that transforms sparse IMU data into human motion predictions without physical constraints. ProbIP utilizes RU-Mamba blocks to predict a matrix Fisher distribution over rotations, effectively estimating both rotation matrices and associated uncertainties. To refine motion distribution through layers, our Progressive Distribution Narrowing (PDN) technique enables stable learning across a diverse range of motions. Experimental results demonstrate that ProbIP achieves state-of-the-art performance on multiple public datasets with six IMU sensors and yields competitive outcomes even with fewer sensors. Our contributions include the development of ProbIP with RU-Mamba blocks for probabilistic motion estimation, applying PDN for uncertainty reduction, and evidence of superior results with six and reduced sensor configurations.

Thu 23 Oct. 17:45 - 19:45 PDT

#113
Towards Accurate and Efficient 3D Object Detection for Autonomous Driving: A Mixture of Experts Computing System on Edge

Linshen Liu · Boyan Su · Junyue Jiang · Guanlin Wu · Cong Guo · Ceyu Xu · Hao Frank Yang

The paper introduces Edge-based Mixture-of-Experts (MoE) Collaborative Computing (EMC2) system, the first multimodal MoE framework designed to address the conflicting requirements of low latency and high accuracy in diverse traffic scenarios for autonomous driving safety. EMC2’s key innovation is its scenario-aware computing architecture optimized for edge devices, which adaptively fuses LiDAR and image inputs by leveraging the complementary strengths of sparse 3D point clouds and dense 2D pixel grids. Specifically, an adaptive multimodal data bridge is designed that preprocesses LiDAR and image data using customized multi-scale pooling. A scenario-adaptive dispatcher then routes these fused features to specialized experts based on the object clarity and distance. Three collaborative expert models with complementary encoder-decoder architectures are designed and trained using a novel hierarchical multimodal loss and balanced sampling strategies. Then, in the inference stage, the EMC2 incorporates hardware-software co-optimization, spanning CPU thread allocation, GPU memory management, and computational graph optimization, to collaboratively enable efficient deployment on edge computing devices. Extensive evaluations conducted on open-source datasets demonstrate EMC2's superior performance, achieving an average accuracy improvement of 3.58% and an impressive 159.06% inference speedup compared to 15 leading methods on Jetson platforms. Such enhancements clearly meet the real-time operational expectations for autonomous vehicles, directly contributing to safer future transportation.

Thu 23 Oct. 17:45 - 19:45 PDT

#114
Highlight
Degradation-Modeled Multipath Diffusion for Tunable Metalens Photography

Jianing Zhang · Jiayi Zhu · Feiyu Ji · Xiaokang Yang · Xiaoyun Yuan

Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside pseudo data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: https://dmdiff.github.io/.

Thu 23 Oct. 17:45 - 19:45 PDT

#115
GS-Occ3D: Scaling Vision-only Occupancy Reconstruction with Gaussian Splatting

Baijun Ye · Minghui Qin · Saining Zhang · Moonjun Gong · Shaoting Zhu · Hao Zhao · Hang Zhao

Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representations, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octree-based Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-the-art geometry reconstruction results. We successfully curate vision-only binary occupancy ground truth across diverse urban scenes and validate its effectiveness for downstream occupancy models on the Occ3D-Waymo dataset. Our results highlight the potential of large-scale vision-based occupancy reconstruction as a new paradigm for autonomous driving perception ground truth curation.

Thu 23 Oct. 17:45 - 19:45 PDT

#116
Highlight
MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy

Wuyang Li · Wentao Pan · Xiaoyuan Liu · Zhendong Luo · Chenxin Li · Hengyu Liu · Din Tsai · Mu Chen · Yixuan Yuan

Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we deploy a gradient-guided distillation to adaptively transfer knowledge from the foundational model. Extensive experiments demonstrate that our method surpasses state-of-the-art methods in metalens segmentation and restoration by a large margin. Data, codes, and models will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#117
CoLMDriver: LLM-based Negotiation Benefits Cooperative Autonomous Driving

Changxing Liu · Genjia Liu · Zijun Wang · Jinchang Yang · Siheng Chen

Vehicle-to-vehicle (V2V) cooperative autonomous driving holds great promise for improving safety by addressing the perception and prediction uncertainties inherent in single-agent systems. However, traditional cooperative methods are constrained by rigid collaboration protocols and limited generalization to unseen interactive scenarios. While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective language-based negotiation and real-time driving control. CoLMDriver features a parallel driving pipeline with two key components: (i) an LLM-based negotiation module under an actor-critic paradigm, which continuously refines cooperation policies through feedback from previous decisions of all vehicles; and (ii) an intention-guided waypoint generator, which translates negotiation outcomes into executable waypoints. Additionally, we introduce InterDrive, a CARLA-based simulation benchmark comprising 10 challenging interactive driving scenarios for evaluating V2V cooperation. Experimental results demonstrate that CoLMDriver significantly outperforms existing approaches, achieving an 11% higher success rate across diverse highly interactive V2V driving scenarios. The code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#118
A View-consistent Sampling Method for Regularized Training of Neural Radiance Fields

Aoxiang Fan · Corentin Dumery · Nicolas Talabot · Pascal Fua

Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also propose a novel depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#119
Free-running vs Synchronous: Single-Photon Lidar for High-flux 3D Imaging

Ruangrawee Kitichotkul · Shashwath Bharadwaj · Joshua Rapp · Yanting Ma · Alexander Mehta · Vivek Goyal

Conventional wisdom suggests that single-photon lidar (SPL) should operate in low-light conditions to minimize dead-time effects.Many methods have been developed to mitigate these effects in synchronous SPL systems. However, solutions for free-running SPL remain limited despite the advantage of reduced histogram distortion from dead times.To improve the accuracy of free-running SPL, we propose a computationally efficient joint maximum likelihood estimator of the signal flux, the background flux, and the depth, along with a complementary regularization framework that incorporates a learned point cloud score model as a prior.Simulations and experiments demonstrate that free-running SPL yields lower estimation errors than its synchronous counterpart under identical conditions, with our regularization further improving accuracy.

Farthest Point Sampling (FPS) is widely used in existing point-based models because it effectively preserves structural integrity during downsampling. However, it incurs significant computational overhead, severely impacting the model's inference efficiency. Random sampling or grid sampling is considered \textbf{faster downsampling methods}; however, these fast downsampling methods may lead to the loss of geometric information during the downsampling process due to their overly simplistic and fixed rules, which can negatively affect model performance. To address this issue, we propose FastAdapter, which aggregates local contextual information through a small number of anchor points and facilitates interactions across spatial and layer dimensions, ultimately feeding this information back into the downsampled point cloud to mitigate the information degradation caused by fast downsampling methods. In addition to using FastAdapter to enhance model performance in methods that already employ fast downsampling, we aim to explore a more challenging yet valuable application scenario. Specifically, we focus on pre-trained models that utilize FPS, embedding FastAdapter and replacing FPS with random sampling for lightweight fine-tuning. This approach aims to significantly improve inference speed while maintaining relatively unchanged performance. Experimental results on ScanNet, S3DIS, and SemanticKITTI demonstrate that our method effectively mitigates the geometric information degradation issues caused by fast downSampling.

Thu 23 Oct. 17:45 - 19:45 PDT

#121
Noise2Score3D: Tweedie's Approach for Unsupervised Point Cloud Denoising

Xiangbin Wei · Yuanfeng Wang · Ao XU · Lingyu Zhu · Dongyong Sun · Keren Li · Yang Li · Qi Qin

Building on recent advances in Bayesian statistics and image denoising, we propose Noise2Score3D, a fully unsupervised framework for point cloud denoising. Noise2Score3D learns the score function of the underlying point cloud distribution directly from noisy data, eliminating the need for clean data during training. Using Tweedie's formula, our method performs denoising in a single step, avoiding the iterative processes used in existing unsupervised methods, thus improving both accuracy and efficiency. Additionally, we introduce Total Variation for Point Clouds as a denoising quality metric, which allows for the estimation of unknown noise parameters. Experimental results demonstrate that Noise2Score3D achieves state-of-the-art performance on standard benchmarks among unsupervised learning methods in Chamfer distance and point-to-mesh metrics. Noise2Score3D also demonstrates strong generalization ability beyond training datasets. Our method, by addressing the generalization issue and challenge of the absence of clean data in learning-based methods, paves the way for learning-based point cloud denoising methods in real-world applications.

Thu 23 Oct. 17:45 - 19:45 PDT

#122
ArchiSet: Benchmarking Editable and Consistent Single-View 3D Reconstruction of Buildings with Specific Window-to-Wall Ratios

Jun Yin · Pengyu Zeng · Licheng Shen · Miao Zhang · Jing Zhong · Yuxing Han · Shuai Lu

Image-based 3D reconstruction has made significant progress in typical scenarios, achieving high fidelity in capturing intricate textures. However, in the Architecture, Engineering, and Construction (AEC) design stages, existing technologies still face considerable challenges, particularly in handling specific window-to-wall ratios, ensuring window detail consistency, and enabling interactive editing. To address this research gap and encourage greater community attention on this practical architectural design problem, we propose a new task: Editable and Consistent Single-View 3D Reconstruction of Buildings with Specific Window-to-Wall Ratios. To accomplish this: 1) We introduce the ArchiSet dataset, the first public, real-world architectural design dataset, including 13,728 3D building forms in the format of point clouds, voxels, meshes, and window-to-wall ratio information, providing comprehensive support for 3D architectural design research. The dataset also contains over 1,482,624 images in three types—sketches, color block diagrams, and renderings—accompanied by paired window masks for detailed evaluation. 2) We evaluated state-of-the-art single-view 3D reconstruction algorithms on ArchiSet, identifying several limitations, such as the loss of volumetric detail, incomplete window details, and limited editability. 3) We introduce BuildingMesh, a diffusion model specifically designed for generating and editing 3D architectural forms from a single image with customizable window-to-wall ratios, suitable for dynamic architectural design workflows. We propose an regularized method to ensure window consistency. Our framework also includes an interactive module for easy further editing, enhancing platform efficiency and accuracy in professional architectural design workflows. Experimental results demonstrate that BuildingMesh achieves high-quality 3D generation with improved design flexibility and accuracy.

The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032×3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. ClaraVid will be publicly released to support UAV research.

Thu 23 Oct. 17:45 - 19:45 PDT

#124
Highlight
Discontinuity-aware Normal Integration for Generic Central Camera Models

Francesco Milano · Manuel Lopez-Antequera · Naina Dhingra · Roland Siegwart · Robert Thiel

Recovering a 3D surface from its surface normal map, a problem known as normal integration, is a key component for photometric shape reconstruction techniques such as shape-from-shading and photometric stereo. The vast majority of existing approaches for normal integration handle only implicitly the presence of depth discontinuities and are limited to orthographic or ideal pinhole cameras. In this paper, we propose a novel formulation that allows modeling discontinuities explicitly and handling generic central cameras. Our key idea is based on a local planarity assumption, that we model through constraints between surface normals and ray directions. Compared to existing methods, our approach more accurately approximates the relation between depth and surface normals, achieves state-of-the-art results on the standard normal integration benchmark, and is the first to directly handle generic central camera models.

Thu 23 Oct. 17:45 - 19:45 PDT

#125
Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps

Chong Cheng · Sicheng Yu · Zijian Wang · Yifan Zhou · Hao Wang

3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy.

Thu 23 Oct. 17:45 - 19:45 PDT

#126
SEHDR: Single-Exposure HDR Novel View Synthesis via 3D Gaussian Bracketing

Yiyu Li · Haoyuan Wang · Ke Xu · Gerhard Hancke · Rynson W.H. Lau

This paper presents SeHDR, a novel high dynamic range 3D Gaussian Splatting (HDR-3DGS) approach for generating HDR novel views given multi-view LDR images. Unlike existing methods that typically require the multi-view LDR input images to be captured from different exposures, which are tedious to capture and more likely to suffer from errors (e.g., object motion blurs and calibration/alignment inaccuracies), our approach learns the HDR scene representation from multi-view LDR images of a single exposure. Our key insight to this ill-posed problem is that by first estimating Bracketed 3D Gaussians (i.e., with different exposures) from single-exposure multi-view LDR images, we may then be able to merge these bracketed 3D Gaussians into an HDR scene representation. Specifically, SeHDR first learns base 3D Gaussians from single-exposure LDR inputs, where the spherical harmonics parameterize colors in a linear color space. We then estimate multiple 3D Gaussians with identical geometry but varying linear colors conditioned on exposure manipulations. Finally, we propose the Differentiable Neural Exposure Fusion (NeEF) to integrate the base and estimated 3D Gaussians into HDR Gaussians for novel view rendering. Extensive experiments demonstrate that SeHDR outperforms existing methods as well as carefully designed baselines.

Information quantization has been widely adopted in multimedia content, such as images, videos, and point clouds. The goal of information quantization is to achieve efficient storage and transmission by reducing data precision or redundancy. However, the information distortion caused by quantization will lead to the degradation of signal fidelity and the performance of downstream tasks. This paper focuses on the geometry quantization distortion of point clouds and proposes a unified learning-based quality enhancement framework for omni-scene point clouds. Based on the characteristics of geometry quantization distortion, we analyze and find that existing upsampling methods are not competitive in dealing with point reduction and geometry displacement caused by coordinate quantization. Therefore, we design a general rooting-growing-pruning paradigm to efficiently perceive the geometry feature of quantized point clouds and improve the quality significantly. In addition, a novel loss constraint term related to the quantization step parameter is proposed to further improve quality and accelerate model convergence. To the best of our knowledge, this is the first unified quality enhancement framework for object and scene point clouds with coordinate quantization. Extensive experiments verify the superiority of the proposed method on multi-scale point clouds with different levels of quantization distortion, including object (ModelNet40, 8iVFB) and scene (S3DIS, KITTI). In particular, the enhanced point clouds improve the performance of downstream analysis tasks, including classification and 3D object detection.

Thu 23 Oct. 17:45 - 19:45 PDT

#128
SL2A-INR: Single-Layer Learnable Activation for Implicit Neural Representation

Reza Rezaeian · Moein Heidari · Reza Azad · Dorit Merhof · Hamid Soltanian-Zadeh · Ilker Hacihaliloglu

Implicit Neural Representation (INR), leveraging a neural network to transform coordinate input into corresponding attributes, has recently driven significant advances in several vision-related domains. However, the performance of INR is heavily influenced by the choice of the nonlinear activation function used in its multilayer perceptron (MLP) architecture. To date, multiple nonlinearities have been investigated, but current INRs still face limitations in capturing high-frequency components and diverse signal types. We show that these challenges can be alleviated by introducing a novel approach in INR architecture. Specifically, we propose SL$^{2}$A-INR, a hybrid network that combines a single-layer learnable activation function with an MLP that uses traditional ReLU activations. Our method performs superior across diverse tasks, including image representation, 3D shape reconstruction, and novel view synthesis. Through comprehensive experiments, SL$^{2}$A-INR sets new benchmarks in accuracy, quality, and robustness for INR.

Thu 23 Oct. 17:45 - 19:45 PDT

#129
TARS: Traffic-Aware Radar Scene Flow Estimation

Jialong Wu · Marco Braun · Dominic Spata · Matthias Rottmann

Scene flow provides crucial motion information for autonomous driving. Recent LiDAR scene flow models utilize the rigid-motion assumption at the instance level, assuming objects are rigid bodies. However, these instance-level methods are not suitable for sparse radar point clouds. In this work, we present a novel Traffic-Aware Radar Scene flow estimation method, named TARS, which utilizes the motion rigidity at the traffic level. To address the challenges in radar scene flow, we perform object detection and scene flow jointly and boost the latter. We incorporate the feature map from the object detector, trained with detection losses, to make radar scene flow aware of the environment and road users. Therefrom, we construct a Traffic Vector Field (TVF) in the feature space, enabling a holistic traffic-level scene understanding in our scene flow branch. When estimating the scene flow, we consider both point-level motion cues from point neighbors and traffic-level consistency of rigid motion within the space. TARS outperforms the state of the art on a proprietary dataset and the View-of-Delft dataset, improving the benchmarks by 23% and 15%, respectively.

Thu 23 Oct. 17:45 - 19:45 PDT

#130
DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection

Yuval Haitman · Oded Bialer

Radar-based object detection is essential for autonomous driving due to radar's long detection range. However, the sparsity of radar point clouds, especially at long range, poses challenges for accurate detection. Existing methods increase point density through temporal aggregation with ego-motion compensation, but this approach introduces scatter from dynamic objects, degrading detection performance. We propose DoppDrive, a novel Doppler-Driven temporal aggregation method that enhances radar point cloud density while minimizing scatter. Points from previous frames are shifted radially according to their dynamic Doppler component to eliminate radial scatter, with each point assigned a unique aggregation duration based on its Doppler and angle to minimize tangential scatter. DoppDrive is a point cloud density enhancement step applied before detection, compatible with any detector, and we demonstrate that it significantly improves object detection performance across various detectors and datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#131
FVGen: Accelerating Novel-View Synthesis with Adversarial Video Diffusion Distillation

Wenbin Teng · Gonglin Chen · Haiwei Chen · Yajie Zhao

Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as 4 sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to prior works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90\%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage. Our code will be released upon acceptance of the paper.

Thu 23 Oct. 17:45 - 19:45 PDT

#132
Simulating Dual-Pixel Images From Ray Tracing For Depth Estimation

Fengchen He · Dayang Zhao · Hao Xu · Tingwei Quan · Shaoqun zeng

Many studies utilize dual-pixel (DP) sensor phase characteristicsfor various applications, such as depth estimation and deblurring.However, since the DP image features are entirely determined by the camera hardware, DP-depth paired datasets are very scarce, especially when performing depth estimation on customized cameras.To overcome this, studies simulate DP images using ideal optical system models.However, these simulations often violate real optical propagation laws,leading to poor generalization to real DP data.To address this, we investigate the domain gap between simulated and real DP data, and propose solutions using the Simulating DP images from ray tracing (Sdirt) scheme.The Sdirt scheme generates realistic DP images via ray tracingand integrates them into the depth estimation training pipeline.Experimental results show that models trained with Sdirt-simulated imagesgeneralize better to real DP data.The code and simulated datasets will be available on GitHub.

Thu 23 Oct. 17:45 - 19:45 PDT

#133
Leaps and Bounds: An Improved Point Cloud Winding Number Formulation for Fast Normal Estimation and Surface Reconstruction

Chamin Hewa Koneputugodage · Dylan Campbell · Stephen Gould

Recent methods for point cloud surface normal estimation predominantly use the generalized winding number field induced by the normals. Optimizing the field towards satisfying desired properties, such as the input points being on the surface defined by the field, provides a principled way to obtain globally consistent surface normals. However, we show that the existing winding number formulation for point clouds is a poor approximation near the input surface points, diverging as the query point approaches a surface point. This is problematic for methods that rely on the accuracy and stability of this approximation, requiring heuristics to compensate. Instead, we derive a more accurate approximation that is properly bounded and converges to the correct value. We then examine two distinct approaches that optimize for globally consistent normals using point cloud winding numbers. We show how the original unbounded formulation influences key design choices in both methods and demonstrate that substituting our formulation yields substantive improvements with respect to normal estimation and surface reconstruction accuracy.

Thu 23 Oct. 17:45 - 19:45 PDT

#134
GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion

Karlo Koledic · Luka Petrovic · Ivan Marković · Ivan Petrovic

Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature, while the entanglement between camera parameters and depth amplifies issues further, hindering multi-dataset training and zero-shot accuracy. This challenge is particularly evident in autonomous vehicles and mobile robotics, where data is collected with fixed camera setups, limiting the geometric diversity. Yet, this context also presents an opportunity: the fixed relationship between the camera and the ground plane imposes additional perspective geometry constraints, enabling depth regression via vertical image positions of objects. However, this cue is highly susceptible to overfitting, thus we propose a novel canonical representation that maintains consistency across varied camera setups, effectively disentangling depth from specific parameters and enhancing generalization across datasets. We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues. A comprehensive evaluation demonstrates the effectiveness of the proposed approach on five autonomous driving datasets, achieving accurate metric depth estimation for varying resolutions, aspect ratios and camera setups. Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup.

Thu 23 Oct. 17:45 - 19:45 PDT

#135
InterGSEdit: Interactive 3D Gaussian Splatting Editing with 3D Geometry-Consistent Attention Prior

Minghao Wen · Shengjie Wu · Kangkan Wang · Dong Liang

3D Gaussian Splatting based 3Deditinghas demonstrated impressive performance in recent years.However, the multi-view editing often exhibits significant local inconsistency, especially in areas of non-rigid deformation,which lead to local artifacts, texture blurring, or semantic variations in edited 3D scenes.We also found that the existing editing methods, which rely entirely on text prompts make the editing process a "one-shot deal", making it difficult for users to control the editing degree flexibly.In response to these challenges, we present InterGSEdit, a novel framework for high-quality 3DGS editing via interactively selecting key views with users' preferences.We propose a CLIP-based Semantic Consistency Selection (CSCS) strategy to adaptively screen a group of semantically consistent reference views for each user-selected key view.Then, the cross-attention maps derived from the reference views are used in a weighted Gaussian Splatting unprojectionto construct the 3D Geometry-Consistent Attention Prior ($GAP^{3D}$).We project $GAP^{3D}$ to obtain 3D-constrained attention, which are fused with 2D cross-attention via Attention Fusion Network (AFN). AFN employs an adaptive attention strategy that prioritizes 3D-constrained attention for geometric consistency during early inference, and gradually prioritizes 2D cross-attention maps in diffusion for fine-grained features during the later inference.Extensive experiments demonstrate that InterGSEdit achieves state-of-the-art performance, delivering consistent, high-fidelity 3DGS editing with improved user experience.

Thu 23 Oct. 17:45 - 19:45 PDT

#136
From Gaze to Movement: Predicting Visual Attention for Autonomous Driving Human-Machine Interaction based on Programmatic Imitation Learning

Yexin Huang · Yongbin Lin · Lishengsa Yue · Zhihong Yao · Jie Wang

Human-machine interaction technology requires not only the distribution of human visual attention but also the prediction of the gaze point trajectory. We introduce $\textbf{PILOT}$, a programmatic imitation learning approach that predicts a driver’s eye movements based on a set of rule-based conditions. These conditions—derived from driving operations and traffic flow characteristics—define how gaze shifts occur. They are initially identified through incremental synthesis, a heuristic search method, and then refined via L-BFGS, a numerical optimization technique. These human-readable rules enable us to understand drivers’ eye movement patterns and make efficient and explainable predictions. We also propose $\textbf{DATAD}$, a dataset that covers 12 types of autonomous driving takeover scenarios, collected from 60 participants and comprising approximately 600,000 frames of gaze point data. Compared to existing eye-tracking datasets, DATAD includes additional driving metrics and surrounding traffic flow characteristics, providing richer contextual information for modeling gaze behavior. Experimental evaluations of PILOT on DATAD demonstrate superior accuracy and faster prediction speeds compared to four baseline models. Specifically, PILOT reduces the MSE of predicted trajectories by 39.91\% to 88.02\% and improves the accuracy of gaze object predictions by 13.99\% to 55.06\%. Moreover, PILOT achieves these gains with approximately 30\% lower prediction time, offering both more accurate and more efficient eye movement prediction.

Thu 23 Oct. 17:45 - 19:45 PDT

#138
Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning

Yiyang Chen · Shanshan Zhao · Lunhao Duan · Changxing Ding · Dacheng Tao

Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Our code will be publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#139
OD-RASE: Ontology-Driven Risk Assessment and Safety Enhancement for Autonomous Driving

Kota Shimomura · Masaki Nambata · Atsuya Ishikawa · Ryota Mimura · Takayuki Kawabuchi · Takayoshi Yamashita · Koki Inoue

Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Since existing road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.

Thu 23 Oct. 17:45 - 19:45 PDT

#140
MDP-Omni: Parameter-free Multimodal Depth Prior-based Sampling for Omnidirectional Stereo Matching

Eunjin Son · HyungGi Jo · Wookyong Kwon · Sang Jun Lee

Omnidirectional stereo matching (OSM) estimates $360^\circ$ depth by performing stereo matching on multi-view fisheye images. Existing methods assume a unimodal depth distribution, matching each pixel to a single object. However, this assumption constrains the sampling range, causing over-smoothed depth artifacts, especially at object boundaries. To address these limitations, we propose MDP-Omni, a novel OSM network that leverages parameter-free multimodal depth priors. Specifically, we introduce a depth prior-based sampling method, which adjusts the sampling range without additional parameters. Furthermore, we present the azimuth-based multi-view volume fusion module to build a single cost volume. It mitigates false matches caused by occlusions in warped multi-view volumes. Experimental results demonstrate that MDP-Omni significantly improves existing methods, particularly in capturing fine details.

Thu 23 Oct. 17:45 - 19:45 PDT

#141
DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model

Rui Yu · Xianghang Zhang · Runkai Zhao · Huaicheng Yan · Meng Wang

End-to-end autonomous driving has been recently seen rapid development, exerting a profound influence on both industry and academia. However, the existing work places excessive focus on ego-vehicle status as their sole learning objectives and lacks of planning-oriented understanding, which limits the robustness of the overall decision-making prcocess. In this work, we introduce DistillDrive, an end-to-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. Specifically, we employ a planning model based on structured scene representations as the teacher model, leveraging its diversified planning instances as multi-objective learning targets for the end-to-end model. Moreover, we incorporate reinforcement learning to enhance the optimization of state-to-decision mappings, while utilizing generative modeling to construct planning-oriented instances, fostering intricate interactions within the latent space. We validate our model on the nuScenes and NAVSIM datasets, achieving a 50\% reduction in collision rate and a 3-point improvement in closed-loop performance compared to the baseline model.

Thu 23 Oct. 17:45 - 19:45 PDT

#142
Highlight
EDM: Efficient Deep Feature Matching

Xi Li · Tong Rao · Cihui Pan

Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#143
Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction

Zhensheng Yuan · Haozhi Huang · Zhen Xiong · Di Wang · Guanghua Yang

We present a resource-efficient framework that enables fast reconstruction and real-time rendering of urban-level scenarios while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize resource utilization. A controllable level-of-detail (LOD) strategy regulate the Gaussian density during training and rendering to balance quality, memory efficiency, and performance. The appearance transformation module mitigates inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and anti-aliasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality.

Thu 23 Oct. 17:45 - 19:45 PDT

#144
SuperDec: 3D Scene Decomposition with Superquadrics Primitives

Elisabetta Fedele · Boyang Sun · Francis Engelmann · Marc Pollefeys · Leonidas Guibas

We present SuperDec, an approach for compact 3D scene representations based on geometric primitives, namely superquadrics.While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.

Thu 23 Oct. 17:45 - 19:45 PDT

#145
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

Dengke Zhang · Fagui Liu · Quan Tang

Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of inter-class correlations. Additionally, we introduce two additional branches to strengthen patch features’ spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks.

Thu 23 Oct. 17:45 - 19:45 PDT

#146
GS-ID: Illumination Decomposition on Gaussian Splatting via Adaptive Light Aggregation and Diffusion-Guided Material Priors

Kang DU · Zhihao Liang · Yulin Shen · Zeyu Wang

Gaussian Splatting (GS) has become an effective representation for photorealistic rendering, but the information about geometry, material, and lighting is entangled and requires illumination decomposition for editing.Current GS-based approaches face significant challenges in disentangling complex light-geometry-material interactions under non-Lambertian conditions, particularly when handling specular reflections and shadows.We present GS-ID, a novel end-to-end framework that achieves comprehensive illumination decomposition by integrating adaptive light aggregation with diffusion-based material priors.In addition to a learnable environment map that captures ambient illumination, we model complex local lighting conditions by adaptively aggregating a set of anisotropic and spatially-varying spherical Gaussian mixtures during optimization.To better model shadow effects, we associate a learnable unit vector with each splat to represent how multiple light sources cause the shadow, further enhancing lighting and material estimation.Together with intrinsic priors from diffusion models, GS-ID significantly reduces light-geometry-material ambiguity and achieves state-of-the-art illumination decomposition performance.Experiments also show that GS-ID effectively supports various downstream applications such as relighting and scene composition.

Thu 23 Oct. 17:45 - 19:45 PDT

#147
NeRF Is a Valuable Assistant for 3D Gaussian Splatting

Shuangkang Fang · I-Chao Shen · Takeo Igarashi · Yufeng Wang · ZeSheng Wang · Yi Yang · Wenrui Ding · Shuchang Zhou

We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.

Thu 23 Oct. 17:45 - 19:45 PDT

#148
UniGS: Modeling Unitary 3D Gaussians for Novel View Synthesis from Sparse-view Images

Jiamin WU · Kenkun Liu · Xiaoke Jiang · Yuan Yao · Lei Zhang

In this work, we introduce UniGS, a novel 3D Gaussian reconstruction and novel view synthesis model that predicts a high-fidelity representation of 3D Gaussians from arbitrary number of posed sparse-view images.Previous methods often regress 3D Gaussians locally on a per-pixel basis for each view and then transfer them to world space and merge them through point concatenation.In contrast, Our approach involves modeling unitary 3D Gaussians in world space and updating them layer by layer.To leverage information from multi-view inputs for updating the unitary 3D Gaussians, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as queries and updates their parameters by performing multi-view cross-attention (MVDFA) across multiple input images, which are treated as keys and values.This approach effectively avoids 'ghosting' issue and allocates more 3D Gaussians to complex regions.Moreover, since the number of 3D Gaussians used as decoder queries is independent of the number of input views, our method allows arbitrary number of multi-view images as input without causing memory explosion or requiring retraining.Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively.

Thu 23 Oct. 17:45 - 19:45 PDT

#149
Hierarchy UGP: Hierarchy Unified Gaussian Primitive for Large-Scale Dynamic Scene Reconstruction

Hongyang Sun · Qinglin Yang · Jiawei Wang · Zhen Xu · Chen Liu · Yida Wang · Kun Zhan · Hujun Bao · Xiaowei Zhou · Sida Peng

Recent advances in differentiable rendering have significantly improved dynamic street scene reconstruction. However, the complexity of large-scale scenarios and dynamic elements, such as vehicles and pedestrians, remains a substantial challenge. Existing methods often struggle to scale to large scenes or accurately model arbitrary dynamics. To address these limitations, we propose Hierarchy UGP, which constructs a hierarchical structure consisting of a root level, sub-scenes level, and primitive level, using Unified Gaussian Primitive (UGP) defined in 4D space as the representation. The root level serves as the entry point to the hierarchy. At the sub-scenes level, the scene is spatially divided into multiple sub-scenes, with various elements extracted. At the primitive level, each element is modeled with UGPs, and its global pose is controlled by a motion prior related to time. This hierarchical design greatly enhances the model's capacity, enabling it to model large-scale scenes. Additionally, our UGP allows for the reconstruction of both rigid and non-rigid dynamics. We conducted experiments on Dynamic City, our proprietary large-scale dynamic street scene dataset, as well as the public Waymo dataset. Experimental results demonstrate that our method achieves state-of-the-art performance. We plan to release the accompanying code and the Dynamic City dataset as open resources to further research within the community.

Thu 23 Oct. 17:45 - 19:45 PDT

#150
TOTP: Transferable Online Pedestrian Trajectory Prediction with Temporal-Adaptive Mamba Latent Diffusion

Ziyang Ren · Ping Wei · Shangqi Deng · Haowen Tang · Jiapeng Li · Huan Li

Pedestrian trajectory prediction is crucial for many intelligent tasks. While existing methods predict future trajectories from fixed-frame historical observations, they are limited by the observational perspective and the need for extensive historical information, resulting in prediction delays and inflexible generalization in real-time systems. In this paper, we propose a novel task called Transferable Online Pedestrian Trajectory Prediction (TOTP), which synchronously predicts future trajectories with variable observations and enables effective task transfer under different observation constraints. To advance TOTP modeling, we propose a Temporal-Adaptive Mamba Latent Diffusion (TAMLD) model. It utilizes the Social-Implicit Mamba Synthesizer to extract motion states with social interaction and refine temporal representations through Temporal-Aware Distillation. A Trend-Conditional Mamba Decomposer generates the motion latent distribution of the future motion trends and predicts future motion trajectories through sampling decomposition. We utilize Motion-Latent Mamba Diffusion to reconstruct the latent space disturbed by imbalanced temporal noise. Our method achieves state-of-the-art results on multiple datasets and tasks, showcasing temporal adaptability and strong generalization.

Thu 23 Oct. 17:45 - 19:45 PDT

#151
AR-1-to-3: Single Image to Consistent 3D Object via Next-View Prediction

Xuying Zhang · Yupeng Zhou · Kai Wang · Yikai Wang · Zhen Li · Daquan Zhou · Shaohui Jiao · Qibin Hou · Ming-Ming Cheng

Multi-view synthesis serves as a fundamental component in creating high-quality 3D assets. We observe that the existing works represented by the Zero123 series typically struggle to maintain cross-view consistency, especially when handling views with significantly different camera poses. To overcome this challenge, we present AR-1-to-3, a novel paradigm to progressively generate the target views in an autoregressive manner. Rather than producing multiple discrete views of a 3D object from a single-view image and a set of camera poses or multiple views simultaneously under specified camera conditions, AR-1-to-3 starts from generating views closer to the input view, which is utilized as contextual information to prompt the generation of farther views. In addition, we propose two image conditioningstrategies, termed as Stacked-LE and LSTM-GE, to encode previously generated sequence views and provide pixel-wise spatial guidance and high-level semantic information for the generation of current target views. Extensive experiments on several publicly available 3D datasets show that our method can synthesize more consistent 3D views and produce high-quality 3D assets that closely mirror the givenimage. Code and pre-trained weights will be open-sourced.

Thu 23 Oct. 17:45 - 19:45 PDT

#152
UnMix-NeRF: Spectral Unmixing Meets Neural Radiance Fields

Fabian Perez · Sara Rojas Martinez · Carlos Hinojosa · Hoover Rueda-Chacón · Bernard Ghanem

Neural Radiance Field (NeRF)-based segmentation methods focus on object semantics and rely solely on RGB data, lacking intrinsic material properties. This limitation restricts accurate material perception, which is crucial for robotics, augmented reality, simulation, and other applications. We introduce UnMix-NeRF, a framework that integrates spectral unmixing into NeRF, enabling joint hyperspectral novel view synthesis and unsupervised material segmentation. Our method models spectral reflectance via diffuse and specular components, where a learned dictionary of global endmembers represents pure material signatures, and per-point abundances capture their distribution. For material segmentation, we use spectral signature predictions along learned endmembers, allowing unsupervised material clustering. Additionally, UnMix-NeRF enables scene editing by modifying learned endmember dictionaries for flexible material-based appearance manipulation. Extensive experiments validate our approach, demonstrating superior spectral reconstruction and material segmentation to existing methods. The associated data and code for reproduction will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#153
Highlight
MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion

Zebin He · Mx Yang · Shuhui Yang · Yixuan Tang · Tao Wang · Kaihao Zhang · Guanying Chen · Lliu Yuhong · Jie Jiang · Chunchao Guo · Wenhan Luo

Physically-based rendering (PBR) has become a cornerstone in modern computer graphics, enabling realistic material representation and lighting interactions in 3D scenes. In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. Our approach leverages Reference Attention to extract and encode informative latent from the input reference images, enabling intuitive and controllable texture generation. We also introduce a Consistency-Regularized Training strategy to enforce stability across varying viewpoints and illumination conditions, ensuring illumination-invariant and geometrically consistent results. Additionally, we propose Dual-Channel Material Generation, which separately optimizes albedo and metallic-roughness (MR) textures while maintaining precise spatial alignment with the input images through Multi-Channel Aligned Attention. Learnable material embeddings are further integrated to capture the distinct properties of albedo and MR. Experimental results demonstrate that our model generates PBR textures with realistic behavior across diverse lighting scenarios, outperforming existing methods in both consistency and quality for scalable 3D asset creation.

Thu 23 Oct. 17:45 - 19:45 PDT

#154
PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model

Jinhua Zhang · Hualian Sheng · Sijia Cai · Bing Deng · Qiao Liang · Wen Li · Ying Fu · Jieping Ye · Shuhang Gu

Controllable generation is considered a potentially vital approach to address the challenge of annotating 3D data, and the precision of such controllable generation becomes particularly imperative in the context of data production for autonomous driving. Existing methods focus on the integration of diverse generative information into controlling inputs, utilizing frameworks such as GLIGEN or ControlNet, to produce commendable outcomes in controllable generation. However, such approaches intrinsically restrict generation performance to the learning capacities of predefined network architectures. In this paper, we explore the innovative integration of controlling information and introduce PerLDiff (\textbf{Per}spective-\textbf{L}ayout \textbf{Diff}usion Models), a novel method for effective street view image generation that fully leverages perspective 3D geometric information. Our PerLDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process, resulting in a more robust and controllable output. Moreover, it demonstrates superior controllability compared to alternative layout control methods. Empirical results justify that our PerLDiff markedly enhances the precision of controllable generation on the NuScenes and KITTI datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#155
7DGS: Unified Spatial-Temporal-Angular Gaussian Splatting

Zhongpai Gao · Benjamin Planche · Meng Zheng · Anwesa Choudhuri · Terrence Chen · Ziyan Wu

Real-time rendering of dynamic scenes with view-dependent effects remains a fundamental challenge in computer graphics. While recent advances in Gaussian Splatting have shown promising results separately handling dynamic scenes (4DGS) and view-dependent effects (6DGS), no existing method unifies these capabilities while maintaining real-time performance. We present 7D Gaussian Splatting (7DGS), a unified framework representing scene elements as seven-dimensional Gaussians spanning position (3D), time (1D), and viewing direction (3D). Our key contribution is an efficient conditional slicing mechanism that transforms 7D Gaussians into view- and time-conditioned 3D Gaussians, maintaining compatibility with existing 3D Gaussian Splatting pipelines while enabling joint optimization. Experiments demonstrate that 7DGS outperforms prior methods by up to 7.36 dB in PSNR while achieving real-time rendering (401 FPS) on challenging dynamic scenes with complex view-dependent effects.

Thu 23 Oct. 17:45 - 19:45 PDT

#156
StochasticSplats: Stochastic Rasterization for Sorting-Free 3D Gaussian Splatting

Shakiba Kheradmand · Delio Vicini · George Kopanas · Dmitry Lagun · Kwang Moo Yi · Mark Matthews · Andrea Tagliasacchi

3D Gaussian splatting (3DGS) is a popular radiance field method, with many application-specific extensions. Most variants rely on the same core algorithm: depth-sorting of Gaussian splats then rasterizing in primitive order. This ensures correct alpha compositing, but can cause rendering artifacts due to built-in approximations. Moreover, for a fixed representation, sorted rendering offers little control over render cost and visual fidelity. For example, and counter-intuitively, rendering a lower-resolution image is not necessarily faster. In this work, we address the above limitations by combining 3D Gaussian splatting with stochastic rasterization. Concretely, we leverage an unbiased Monte Carlo estimator of the volume rendering equation. This removes the need for sorting, and allows for accurate 3D blending of overlapping Gaussians. The number of Monte Carlo samples further imbues 3DGS with a way to trade off computation time and quality. We implement our method using OpenGL shaders, enabling efficient rendering on modern GPU hardware. At a reasonable visual quality, our method renders more than four times faster than sorted rasterization.

Thu 23 Oct. 17:45 - 19:45 PDT

#157
MS3D: High-Quality 3D Generation via Multi-Scale Representation Modeling

Guan Luo · Jianfeng Zhang

High-quality textured mesh reconstruction from sparse-view images remains a fundamental challenge in computer graphics and computer vision. Traditional large reconstruction models operate in a single-scale manner, forcing the models to simultaneously capture global structure and local details, often resulting in compromised reconstructed shapes. In this work, we propose MS3D, a novel multi-scale 3D reconstruction framework. At its core, our method introduces a hierarchical structured latent representation for multi-scale modeling, coupled with a multi-scale feature extraction and integration mechanism. This enables progressive reconstruction, effectively decomposing the complex task of detailed geometry reconstruction into a sequence of easier steps. This coarse-to-fine approach effectively captures multi-frequency details, learns complex geometric patterns, and generalizes well across diverse objects while preserving fine-grained details. Extensive experiments demonstrate MS3D outperforms state-of-the-art methods and is broadly applicable to both image- and text-to-3D generation. The entire pipeline reconstructs high-quality textured meshes in under five seconds.

Thu 23 Oct. 17:45 - 19:45 PDT

#158
DASH: 4D Hash Encoding with Self-Supervised Decomposition for Real-Time Dynamic Scene Rendering

Jie Chen · Zhangchi Hu · Peixi Wu · Huyue Zhu · Hebei Li · Xiaoyan Sun

Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides a promising explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Comprehensive experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting significantly enhanced visual quality at real-time speeds of 264 FPS on a single 4090 GPU. The code will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#159
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding

Yuqi Wu · Wenzhao Zheng · Sicheng Zuo · Yuanhui Huang · Jie Zhou · Jiwen Lu

3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable cross-attention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through the local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task.Our EmbodiedOcc outperforms existing methods by a large margin and accomplishes the embodied occupancy prediction with high accuracy and efficiency.

Thu 23 Oct. 17:45 - 19:45 PDT

#160
TurboReg: TurboClique for Robust and Efficient Point Cloud Registration

Shaocheng Yan · Pengcheng Shi · Zhenjun Zhao · Kaixin Wang · Kuang Cao · Ji Wu · Jiayuan Li

Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highly-constrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC$^2$ scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves state-of-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) and TurboReg (0.5K) operate $208.22\times$ and $213.35\times$ faster than 3DMAC, respectively, while also enhancing recall. Our code is accessible at \href{https://anonymous.4open.science/r/TurboReg-FDB7/}{\texttt{TurboReg}}.

Thu 23 Oct. 17:45 - 19:45 PDT

#161
ConsistentCity: Semantic Flow-guided Occupancy DiT for Temporally Consistent Driving Scene Synthesis

Benjin Zhu · Xiaogang Wang · Hongsheng Li

Scene synthesis plays a crucial role in autonomous driving by addressing data scarcity and close-loop validation. Current approaches struggle to maintain temporal consistency in synthesized videos while preserving fine-grained details. We introduce ConsistentCity, a two-stage framework with a novel Semantic Flow-guided Diffusion Transformers (SF-DiT) that convert sequential BEV semantic maps into temporally consistent driving videos. Operating in a pretrained occupancy VQ-VAE latent space, our SF-DiT generates temporally consistent 3D occupancy, which provides guidance for controlled image and video diffusion for scene synthesis. To address the temporal consistency, SF-DiT enhances standard DiT blocks with temporal semantic modeling through two designs: (1) A Semantic Flow Estimation module capturing scene motions (flow, uncertainty, and classification) from sequential BEV semantic maps, and (2) A Semantic Flow-Modulated Cross-Attention module that dynamically adapts attention based on semantic flow patterns. This integration of semantic flow modeling in DiT enables consistent scene evolution understanding. Evaluations of image and video synthesis on nuScenes dataset demonstrate state-of-the-art performance with FID 8.3 and FVD 73.6, and superior temporal occupancy generation results on nuCraft and OpenOccupancy benchmarks.

Thu 23 Oct. 17:45 - 19:45 PDT

#162
Efficient Spiking Point Mamba for Point Cloud Analysis

Peixi Wu · Bosong Chai · Menghua Zheng · Wei Li · Zhangchi Hu · Jie Chen · Zheyu Zhang · Hebei Li · Xiaoyan Sun

Bio-inspired Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing 3D SNNs have struggled with long-range dependencies until the recent emergence of Mamba, which offers superior computational efficiency and sequence modeling capability. In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain.Due to the poor performance of simply transferring Mamba to 3D SNNs, SPM is designed to utilize both the sequence modeling capabilities of Mamba and the temporal feature extraction of SNNs. Specifically, we first introduce Hierarchical Dynamic Encoding (HDE), an improved direct encoding method that effectively introduces dynamic temporal mechanism, thereby facilitating temporal interactions. Then, we propose a Spiking Mamba Block (SMB), which builds upon Mamba while learning inter-time-step features and minimizing information loss caused by spikes. Finally, to further enhance model performance, we adopt an asymmetric SNN-ANN architecture for spike-based pre-training and finetune. Compared with the previous state-of-the-art SNN models, SPM improves OA by +6.2%, +6.1%, and +7.4% on three variants of ScanObjectNN, and boosts instance mIOU by +1.9% on ShapeNetPart. Meanwhile, its energy consumption is at least 3.5x lower than that of its ANN counterpart. The code will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#163
SpatialSplat: Efficient Semantic 3D from Sparse Unposed Images

Yu Sheng · Jiajun Deng · Xinran Zhang · Yu Zhang · Bei Hua · Yanyong Zhang · Jianmin Ji

A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector.However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce SpatialSplat, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our method, demonstrating a remarkable 60\% reduction in scene representation parameters while achieving superior performance over state-of-the-art methods. The code will be made available for future investigation.

Thu 23 Oct. 17:45 - 19:45 PDT

#164
CoMoGaussian: Continuous Motion-Aware Gaussian Splatting from Motion-Blurred Images

Jungho Lee · DongHyeong Kim · Dogyoon Lee · Suhwan Cho · Minhyeok Lee · Wonjoon Lee · Taeoh Kim · Dongyoon Wee · Sangyoun Lee

3D Gaussian Splatting (3DGS) has gained significant attention for their high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODEs). To ensure accurate modeling, we employ rigid body transformations, preserving the shape and size of the object but rely on the discrete integration of sampled frames. To better approximate the continuous nature of motion blur, we introduce a continuous motion refinement (CMR) transformation that refines rigid transformations by incorporating additional learnable parameters. By revisiting fundamental camera theory and leveraging advanced neural ODE techniques, we achieve precise modeling of continuous camera trajectories, leading to improved reconstruction accuracy. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets, which include a wide range of motion blur scenarios, from moderate to extreme blur.

Thu 23 Oct. 17:45 - 19:45 PDT

#165
Splat-based 3D Scene Reconstruction with Extreme Motion-blur

Hyeonjoong Jang · Dongyoung Choi · Donggun Kim · Woohyun Kang · Min H. Kim

We propose a splat-based 3D scene reconstruction method from RGB-D input that effectively handles extreme motion blur, a frequent challenge in low-light environments. Under dim illumination, RGB frames often suffer from severe motion blur due to extended exposure times, causing traditional camera pose estimation methods, such as COLMAP, to fail. This results in inaccurate camera pose and blurry color input, compromising the quality of 3D reconstructions. Although recent 3D reconstruction techniques like Neural Radiance Fields and Gaussian Splatting have demonstrated impressive results, they rely on accurate camera trajectory estimation, which becomes challenging under fast motion or poor lighting conditions. Furthermore, rapid camera movement and the limited field of view of depth sensors reduce point cloud overlap, limiting the effectiveness of pose estimation with the ICP algorithm. To address these issues, we introduce a method that combines camera pose estimation and image deblurring using a Gaussian Splatting framework, leveraging both 3D Gaussian splats and depth inputs for enhanced scene representation. Our method first aligns consecutive RGB-D frames through optical flow and ICP, then refines camera poses and 3D geometry by adjusting Gaussian positions for optimal depth alignment. To handle motion blur, we model camera movement during exposure and deblur images by comparing the input with a series of sharp, rendered frames. Experiments on a new RGB-D dataset with extreme motion blur show that our method outperforms existing approaches, enabling high-quality reconstructions even in challenging conditions. This approach has broad implications for 3D mapping applications in robotics, autonomous navigation, and augmented reality. Both code and dataset will be publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#166
Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution

Du Chen · Liyi Chen · Zhengqiang ZHANG · Lei Zhang

Implicit Neural Representation (INR) has been successfully employed for Arbitrary-scale Super-Resolution (ASR). However, INR-based models need to query the multi-layer perceptron module numerous times and render a pixel in each query, resulting in insufficient representation capability and computational efficiency. Recently, Gaussian Splatting (GS) has shown its advantages over INR in both visual quality and rendering speed in 3D tasks, which motivates us to explore whether GS can be employed for the ASR task. However, directly applying GS to ASR is exceptionally challenging because the original GS is an optimization-based method through overfitting each single scene, while in ASR we aim to learn a single model that can generalize to different images and scaling factors. We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Each Gaussian can fit the shape and direction of an area of complex textures, showing powerful representation capability. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted continuous Gaussians. Via end-to-end training, our optimized network, namely GSASR, can perform ASR for any image and unseen scaling factors. Extensive experiments validate the effectiveness of our proposed method. The code and models will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#167
Visual Surface Wave Elastography: Revealing Subsurface Physical Properties via Visible Surface Waves

Alexander Ogren · Berthy Feng · Jihoon Ahn · Katherine Bouman · Chiara Daraio

Wave propagation on the surface of a material contains information about physical properties beneath its surface. We propose a method for inferring the thickness and stiffness of a structure from just a video of waves on its surface. Our method works by extracting a dispersion relation from the video and then solving a physics-based optimization problem to find the best-fitting thickness and stiffness parameters. We validate our method on both simulated and real data, in both cases showing strong agreement with ground-truth measurements. Our technique provides a proof-of-concept for at-home health monitoring of medically-informative tissue properties, and it is further applicable to fields such as human-computer interaction.

Thu 23 Oct. 17:45 - 19:45 PDT

#168
GaRe: Relightable 3D Gaussian Splatting for Outdoor Scenes from Unconstrained Photo Collections

Haiyang Bai · Jiaqi Zhu · Songru Jiang · Wei Huang · Tao Lu · Yuanqi Li · Jie Guo · Runze Fu · Yanwen Guo · Lijun Chen

We propose a 3D Gaussian splatting-based framework for outdoor relighting that leverages intrinsic image decomposition to precisely integrate sunlight, sky radiance, and indirect lighting from unconstrained photo collections. Unlike prior methods that compress the per-image global illumination into a single latent vector, our approach enables simultaneously diverse shading manipulation and the generation of dynamic shadow effects. This is achieved through three key innovations: (1) a residual-based sun visibility extraction method to accurately separate direct sunlight effects, (2) a region-based supervision framework with a structural consistency loss for physically interpretable and coherent illumination decomposition, and (3) a ray-tracing-based technique for realistic shadow simulation. Extensive experiments demonstrate that our framework synthesizes novel views with competitive fidelity against state-of-the-art relighting solutions and produces more natural and multifaceted illumination and shadow effects.

Thu 23 Oct. 17:45 - 19:45 PDT

#169
Highlight
PolarAnything: Diffusion-based Polarimetric Image Synthesis

Kailong Zhang · Youwei Lyu · Heng Guo · Si Li · Zhanyu Ma · Boxin Shi

Polarization images facilitate image enhancement and 3D reconstruction tasks, but the limited accessibility of polarization cameras hinders their broader application. This gap drives the need for synthesizing photorealistic polarization images. The existing polarization simulator Mitsuba relies on a parametric polarization image formation model and requires extensive 3D assets covering shape and PBR materials, preventing it from generating large-scale photorealistic images. To address this problem, we propose PolarAnything, capable of synthesizing polarization images from a single RGB input with both photorealism and physical accuracy, eliminating the dependency on 3D asset collections. Drawing inspiration from the zero-shot performance of pretrained diffusion models, we introduce a diffusion-based generative framework with an effective representation strategy that preserves the fidelity of polarization properties. Extensive experiments show that our model not only generates high-quality polarization images but also effectively supports downstream tasks such as shape from polarization.

Thu 23 Oct. 17:45 - 19:45 PDT

#170
LightCity: An Urban Dataset for Outdoor Inverse Rendering and Reconstruction under Multi-illumination Conditions

Jingjing Wang · Qirui Hu · Chong Bao · Yuke Zhu · Hujun Bao · Zhaopeng Cui · Guofeng Zhang

We propose an outdoor scene dataset and propose a series of benchmarks based on it.Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins, yet it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects.However, these challenges' effects on intrinsic decomposition and 3D reconstruction are not explored due to the lack of appropriate datasets. In this paper, we present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects.LightCity encompasses over 300 sky maps with highly controllable illumination, varying scales with both street-level and aerial perspectives over 50K images, and rich properties such as depth, normal, and material components, light and indirect light, etc.Besides, we leverage LightCity to benchmark three fundamental tasks in the urban environments and conduct a comprehensive analysis of these benchmarks, laying a robust foundation for advancing related research.

Thu 23 Oct. 17:45 - 19:45 PDT

#171
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views

Xiaobiao Du · Yida Wang · Haiyang Sun · Zhuojie Wu · Hongwei Sheng · Shuyun Wang · Jiaying Ying · Ming Lu · Tianqing Zhu · Kun Zhan · Xin Yu

3D cars are commonly used in self-driving systems, virtual/augmented reality, and games. However, existing 3D car datasets are either synthetic or low-quality, limiting their applications in practical scenarios and presenting a significant gap toward the high-quality real-world 3D car datasets. In this paper, we propose the first large-scale 3D real car dataset, termed 3DRealCar, offering three distinctive features. (1) \textbf{High-Volume}: 2,500 cars are meticulously scanned by smartphones, obtaining car images and point clouds with real-world dimensions; (2) \textbf{High-Quality}: Each car is captured in an average of 200 dense, high-resolution 360-degree RGB-D views, enabling high-fidelity 3D reconstruction; (3) \textbf{High-Diversity}: The dataset contains various cars from over 100 brands, collected under three distinct lighting conditions, including reflective, standard, and dark. Additionally, we offer detailed car parsing maps for each instance to promote research in car parsing tasks. Moreover, we remove background point clouds and standardize the car orientation to a unified axis for the reconstruction only on cars and controllable rendering without background. We benchmark 3D reconstruction results with state-of-the-art methods across each lighting condition in 3DRealCar. Extensive experiments demonstrate that the standard lighting condition part of 3DRealCar can be used to produce a large number of high-quality 3D cars, improving various 2D and 3D tasks related to cars. Notably, our dataset brings insight into the fact that recent 3D reconstruction methods face challenges in reconstructing high-quality 3D cars under reflective and dark lighting conditions.

Thu 23 Oct. 17:45 - 19:45 PDT

#172
PCR-GS: COLMAP-Free 3D Gaussian Splatting via Pose Co-Regularizations

YU WEI · Jiahui Zhang · Xiaoqin Zhang · Ling Shao · Shijian Lu

COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is wavelet-based frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.

Thu 23 Oct. 17:45 - 19:45 PDT

#173
NuiScene: Exploring Efficient Generation of Unbounded Outdoor Scenes

Han-Hung Lee · Qinghong Han · Angel Chang

In this paper, we explore the task of generating expansive outdoor scenes, ranging from city skyscrapers to medieval castles and houses. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including the wide variation in scene heights and the need for an efficient approach capable of rapidly producing large landscapes. To address this, we introduce an efficient representation that encodes scene chunks as homogeneous vector sets, offering better compression than spatially structured latents used in prior methods. Furthermore, we train an outpainting model under four conditional patterns to generate scene chunks in a zig-zag manner, enabling more coherent generation compared to prior work that relies on inpainting methods. This provides richer context and speeds up generation by eliminating extra diffusion steps. Finally, to facilitate this task, we curate NuiScene43, a small but high-quality set of scenes and preprocess them for joint training. Interestingly, when trained on scenes of varying styles, our model can blend vastly different scenes, such as rural houses and city skyscrapers, within the same scene.

Thu 23 Oct. 17:45 - 19:45 PDT

#174
MinCD-PnP: Learning 2D-3D Correspondences with Approximate Blind PnP

Pei An · Jiaqi Yang · Muyao Peng · You Yang · Qiong Liu · Xiaolin Wu · Liangliang Nan

Image-to-point-cloud (I2P) registration is a fundamental problem in computer vision, focusing on establishing 2D-3D correspondences between an image and a point cloud. The differential perspective-n-point (PnP) has been widely used to supervise I2P registration networks by enforcing the projective constraints on 2D-3D correspondences. However, differential PnP is highly sensitive to noise and outliers in the predicted correspondences. This issue hinders the effectiveness of correspondence learning. Inspired by the robustness of blind PnP against noise and outliers in correspondences, we propose an approximated blind PnP based correspondence learning approach. To mitigate the high computational cost of blind PnP, we simplify blind PnP to an amenable task of minimizing Chamfer distance between learned 2D and 3D keypoints, called MinCD-PnP. To effectively solve MinCD-PnP, we design a lightweight multi-task learning module, named as MinCD-Net, which can be easily integrated into the existing I2P registration architectures. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected datasets demonstrate that MinCD-Net outperforms state-of-the-art methods and achieves a higher inlier ratio (IR) and registration recall (RR) in both cross-scene and cross-dataset settings. Source code will be released soon.

Thu 23 Oct. 17:45 - 19:45 PDT

#175
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models

Shadi Hamdan · Chonghao Sima · Zetong Yang · Hongyang Li · Fatma Guney

How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8\% with a driving score of 69.53 while maintaining a near-real-time inference speed at 50 ms.

Thu 23 Oct. 17:45 - 19:45 PDT

#176
MergeOcc: Bridge the Domain Gap between Different LiDARs for Robust Occupancy Prediction

Zikun Xu · Shaobing Xu

LiDAR-based 3D occupancy prediction algorithms evolved rapidly with the advent of large-scale datasets. However, the full potential of the existing diverse datasets remains underutilized, as they are typically employed in isolation. Models trained on a single dataset often suffer considerable performance degradation when deployed to real-world scenarios or datasets involving disparate LiDARs.To address this limitation, we introduce \emph{MergeOcc}, a generalized pipeline designed to handle different LiDARs by leveraging multiple datasets concurrently.The gaps among LiDAR datasets primarily manifest in geometric disparities and semantic inconsistencies, which correspond to the fundamental components of datasets: data and labels. In response, MergeOcc incorporates a novel model architecture that features a geometric realignment and a semantic label mapping to facilitate multiple datasets training (MDT). The effectiveness of MergeOcc is validated through extensive experiments on two prominent datasets for autonomous vehicles: OpenOccupancy-nuScenes and SemanticKITTI.The results demonstrate its enhanced robustness and performance improvements across both types of LiDARs, outperforming several SOTA methods. Additionally, despite using an identical model architecture and hyper-parameter set, MergeOcc can significantly surpass the baselines thanks to its ability to learn from diverse datasets. To the best of our knowledge, this work presents the first cross-dataset 3D occupancy prediction pipeline that effectively bridges the domain gap for seamless deployment across heterogeneous platforms.

Thu 23 Oct. 17:45 - 19:45 PDT

#177
RARE: Refine Any Registration of Pairwise Point Clouds via Zero-Shot Learning

Chengyu Zheng · Honghua Chen · Jin Huang · Mingqiang Wei

Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#178
Feature Extraction and Representation of Pre-training Point Cloud Based on Diffusion Models

Chang Qiu · Feipeng Da · Zilei Zhang

The pretrain-finetune paradigm of pre-training a model on large amounts of image and text data and then fine-tuning the model for a specific task has led to significant progress in many 2D image and natural language processing tasks.Similarly, the use of pre-training methods in point cloud data can also enhance the working performance and generalization ability of the model.Therefore, in this paper, we propose a pre-training framework based on a diffusion model called PreDifPoint. It is able to accomplish the pre-training of the model's backbone network through a diffusion process of gradual denoising. We aggregate the potential features extracted from the backbone network, input them as conditions into the subsequent diffusion model, and direct the point-to-point mapping relationship of the noisy point clouds at neighboring time steps, so as to generate high-quality point clouds and at the same time better perform various downstream tasks of the point clouds.We also introduce a bi-directional covariate attention (DXCA-Attention) mechanism for capturing complex feature interactions, fusing local and global features, and improving the detail recovery of point clouds.In addition, we propose a density-adaptive sampling strategy, which can help the model dynamically adjust the sampling strategy between different time steps, and guide the model to pay more attention to the denser regions in the point cloud, thus improving the effectiveness of the model in point cloud recovery.Our PreDifPoint framework achieves more competitive results on various real-world datasets. Specifically, PreDifPoint achieves an overall accuracy of 87.96%, which is 0.35% higher than PointDif, on the classification task on PB-T50-395RS, a variant of ScanObjectNN dataset.

Thu 23 Oct. 17:45 - 19:45 PDT

#179
Occupancy Learning with Spatiotemporal Memory

Ziyang Leng · Jiawei Yang · Wenlong Yi · Bolei Zhou

3D occupancy becomes a promising perception representation for autonomous driving to model the surrounding environment at a fine-grained scale. However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. ST-Occ consists of two core designs: a spatiotemporal memory that captures comprehensive historical information and stores it efficiently through a scene-level representation, and a memory attention that conditions the current occupancy representation on the spatiotemporal memory with a model of uncertainty and dynamic awareness. Our method significantly enhances the spatiotemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs. Experiments show that our approach outperforms the state-of-the-art methods by a margin of 3 mIoU and reduces the temporal inconsistency by 29%. The code and model will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#180
Towards Open-World Generation of Stereo Images and Unsupervised Matching

Feng Qiao · Zhexiao Xiong · Eric Xing · Nathan Jacobs

Stereo images are fundamental to numerous applications, including extended reality (XR) devices, autonomous driving, and robotics. Unfortunately, acquiring high-quality stereo images remains challenging due to the precise calibration requirements of dual-camera setups and the complexity of obtaining accurate, dense disparity maps. Existing stereo image generation methods typically focus on either visual quality for viewing or geometric accuracy for matching, but not both. We introduce GenStereo, a diffusion-based approach, to bridge this gap. The method includes two primary innovations (1) conditioning the diffusion process on a disparity-aware coordinate embedding and a warped input image, allowing for more precise stereo alignment than previous methods, and (2) an adaptive fusion mechanism that intelligently combines the diffusion-generated image with a warped image, improving both realism and disparity consistency. Through extensive training on 11 diverse stereo datasets, GenStereo demonstrates strong generalization ability. GenStereo achieves state-of-the-art performance in both stereo image generation and unsupervised stereo matching tasks. Our framework eliminates the need for complex hardware setups while enabling high-quality stereo image generation, making it valuable for both real-world applications and unsupervised learning scenarios. The code will be made publicly available upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#181
DMesh++: An Efficient Differentiable Mesh for Complex Shapes

Sanghyun Son · Matheus Gadelha · Yang Zhou · Matthew Fisher · Zexiang Xu · Yi-Ling Qiao · Ming Lin · Yi Zhou

Recent probabilistic methods for 3D triangular meshes have shown promise in capturing diverse shapes by managing mesh connectivity in a differentiable manner. However, these methods are often limited by high computational costs that scale disproportionately with the level of detail, restricting their applicability for complex shapes requiring high face density. In this work, we introduce a novel differentiable mesh processing method that addresses these computational challenges in both 2D and 3D. Our method reduces time complexity from $O(N)$ to $O(\log N)$ and requires significantly less memory than previous approaches, enabling us to handle far more intricate structures. Building on this innovation, we present a reconstruction algorithm capable of generating complex 2D and 3D shapes from point clouds or multi-view images. We demonstrate its efficacy on various objects exhibiting diverse topologies and geometric details.

Thu 23 Oct. 17:45 - 19:45 PDT

#182
TAD-E2E: A Large-scale End-to-end Autonomous Driving Dataset

Chang Liu · mingxuzhu mingxuzhu · Zheyuan Zhang · Linna Song · xiao zhao · Luo Qingliang · Qi Wang · Chufan Guo · Kuifeng Su

End-to-end autonomous driving technology has recently become a focal point of research and application in autonomous driving. State-of-the-art (SOTA) methods are often trained and evaluated on the NuScenes dataset. However, the NuScenes dataset, introduced in 2019 for 3D perception tasks, faces several limitations—such as insufficient scale, simple scenes, and homogeneous driving behaviors—that restrict the upper-bound development of end-to-end autonomous driving algorithms. In light of these issues, we propose a novel, large-scale real-world dataset specifically designed for end-to-end autonomous driving tasks, named TAD-E2E, which is 25x larger, 1.7x scene complexity over NuScenes, and features a highly diverse range of driving behaviors. We replicated SOTA methods on the TAD-E2E dataset and observed that these methods no longer performed well, as expected. Additionally, in response to the challenging scenarios presented in the TAD-E2E dataset, we devised a multimodal sparse end-to-end method that significantly outperforms SOTA methods. Ablation studies demonstrate the effectiveness of our method, and we analyze the contributions of each module. The dataset and code will be made open source upon acceptance of the paper.

Thu 23 Oct. 17:45 - 19:45 PDT

#183
LoD-Loc v2: Aerial Visual Localization over Low Level-of-Detail City Models using Explicit Silhouette Alignment

Juelin Zhu · Shuaibang Peng · Long Wang · Hanlin Tan · Yu Liu · Maojun Zhang · Shen Yan

We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc [97] has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are low-LoD (LoD1). Consequently, enabling localization on low-LoD city models could unlock drones' potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air.Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution.Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further facilitate research in this field, we release two datasets with LoD1 city models covering 10.7 km$^2$, along with real RGB queries and ground-truth pose annotations. Experimental results show that LoD-Loc v2 improves estimation accuracy with high-LoD models and enables localization with low-LoD models for the first time. Moreover, it outperforms state-of-the-art baselines by large margins, even surpassing texture-model-based methods, and broadens the convergence basin to accommodate larger prior errors.The code and dataset will be made available upon publication.

Thu 23 Oct. 17:45 - 19:45 PDT

#184
LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

WEI-JER Chang · Masayoshi Tomizuka · Wei Zhan · Manmohan Chandraker · Francesco Pittaluga

Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing.

Thu 23 Oct. 17:45 - 19:45 PDT

#185
EVT: Efficient View Transformation for Multi-Modal 3D Object Detection

Yongjin Lee · Hyeon-Mun Jeong · Yurim Jeon · Sanghyun Kim

Multi-modal sensor fusion in Bird’s Eye View (BEV) representation has become the leading approach for 3D object detection. However, existing methods often rely on depth estimators or transformer encoders to transform image features into BEV space, which reduces robustness or introduces significant computational overhead. Moreover, the insufficient geometric guidance in view transformation results in ray-directional misalignments, limiting the effectiveness of BEV representations. To address these challenges, we propose Efficient View Transformation (EVT), a novel 3D object detection framework that constructs a well-structured BEV representation, improving both accuracy and efficiency. Our approach focuses on two key aspects. First, Adaptive Sampling and Adaptive Projection (ASAP), which utilizes LiDAR guidance to generate 3D sampling points and adaptive kernels, enables more effective transformation of image features into BEV space and a refined BEV representation. Second, an improved query-based detection framework, incorporating group-wise mixed query selection and geometry-aware cross-attention, effectively captures both the common properties and the geometric structure of objects in the transformer decoder. On the nuScenes test set, EVT achieves state-of-the-art performance of 75.3\% NDS with real-time inference speed.

Thu 23 Oct. 17:45 - 19:45 PDT

#186
OccluGaussian: Occlusion-Aware Gaussian Splatting for Large Scene Reconstruction and Rendering

Shiyong Liu · Xiao Tang · Zhihao Li · Yingfan He · Chongjie Ye · Jianzhuang Liu · Binxiao Huang · Shunbo Zhou · Xiaofei Wu

In large-scale scene reconstruction using 3D Gaussian splatting, it is common to partition the scene into multiple smaller regions and reconstruct them individually. However, existing division methods are occlusion-agnostic, meaning that each region may contain areas with severe occlusions. As a result, the cameras within those regions are less correlated, leading to a low average contribution to the overall reconstruction. In this paper, we propose an occlusion-aware scene division strategy that clusters training cameras based on their positions and co-visibilities to acquire multiple regions. Cameras in such regions exhibit stronger correlations and a higher average contribution, facilitating high-quality scene reconstruction. We further propose a region-based rendering technique to accelerate large scene rendering, which culls Gaussians invisible to the region where the viewpoint is located. Such a technique significantly speeds up the rendering without compromising quality. Extensive experiments on multiple large scenes show that our method achieves superior reconstruction results with faster rendering speeds compared to existing state-of-the-art approaches.

Thu 23 Oct. 17:45 - 19:45 PDT

#187
Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

Ziliang Miao · Runjian Chen · Yixi Cai · Buwei He · Wenquan Zhao · Wenqi Shao · Bo Zhang · Fu Zhang

Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that alleviate the labeling burden for MOS. TOP explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called mIoU_obj to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that TOPoutperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.

Thu 23 Oct. 17:45 - 19:45 PDT

#188
Highlight
Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection

Hanshi Wang · Jin Gao · Weiming Hu · Zhipeng Zhang

We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes validation benchmark, even surpassing methods that utilize high-resolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed than most recent state-of-the-art methods. Code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#189
Liberated-GS: 3D Gaussian Splatting Independent from SfM Point Clouds

Weihong Pan · Xiaoyu Zhang · Hongjia Zhai · Xiaojun Xiang · Hanqing Jiang · Guofeng Zhang

3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis and real-time rendering. However, it heavily relies on high-quality initial sparse points from Structure-from-Motion (SfM) which often struggles in textureless regions, degrading the geometry and visual quality of 3DGS. To address this limitation, we propose a novel initialization pipeline, achieving high-fidelity reconstruction from dense image sequences without relying on SfM-derived point clouds. Specifically, we first propose an effective depth alignment method to align the estimated monocular depth with depth rendered from an under-optimized coarse Gaussian model using an unbiased depth rasterization approach and ensemble them afterward. After that, to efficiently process dense image sequences, we incorporate a progressive segmented initialization process that to generate the initial points. Extensive experiments demonstrate the superiority of our method over previous approaches. Notably, our method outperforms the SfM-based method by a 14.4% improvement in LPIPS on the Mip-NeRF360 datasets and a 30.7% improvement on the Tanks and Temples datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#190
S$^3$E: Self-Supervised State Estimation for Radar-Inertial System

Shengpeng Wang · Yulong Xie · Qing Liao · Wei Wang

Millimeter-wave radar for state estimation is gaining significant attention for its affordability and reliability in harsh conditions. Existing localization solutions typically rely on post-processed radar point clouds as landmark points. Nonetheless, the inherent sparsity of radar point clouds, ghost points from multi-path effects, and limited angle resolution in single-chirp radar severely degrade state estimation performance. To address these issues, we propose S$^3$E, a \textbf{S}elf-\textbf{S}upervised \textbf{S}tate \textbf{E}stimator that employs more richly informative radar signal spectra to bypass sparse points and fuses complementary inertial information to achieve accurate localization. S$^3$E fully explores the association between \textit{exteroceptive} radar and \textit{proprioceptive} inertial sensor to achieve complementary benefits. To deal with limited angle resolution, we introduce a novel cross-fusion technique that enhances spatial structure information by exploiting subtle rotational shift correlations across heterogeneous data. The experimental results demonstrate our method achieves robust and accurate performance without relying on localization ground truth supervision. To the best of our knowledge, this is the first attempt to achieve state estimation by fusing radar spectra and inertial data in a complementary self-supervised manner. Codes will be released on GitHub.

Thu 23 Oct. 17:45 - 19:45 PDT

#191
MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model

Yaoye Zhu · Zhe Wang · Yan Wang

As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. Code will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#192
VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

Hyojun Go · Byeongjun Park · Hyelin Nam · Byung-Hoon Kim · Hyungjin Chung · Changick Kim

We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.

Thu 23 Oct. 17:45 - 19:45 PDT

#193
ReconDreamer++: Harmonizing Generative and Reconstructive Models for Driving Scene Representation

Guosheng Zhao · Xiaofeng Wang · Chaojun Ni · Zheng Zhu · Wenkang Qin · Guan Huang · Xingang Wang

Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface.Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, andthe optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1\% increase in NTA-IoU, a 23. 0\% improvement in FID, and a remarkable 4.5\% gain in the ground surface metric NTL-IoU, highlighting its effectiveness in accurately reconstructing structured elements such as the road surface.

Thu 23 Oct. 17:45 - 19:45 PDT

#194
S²M²: Scalable Stereo Matching Model for Reliable Depth Estimation

JUNHONG MIN · YOUNGPIL JEON · Jimin Kim · Minyong Choi

Accurate and scalable stereo matching remains a critical challenge, particularly for high-resolution images requiring both fine-grained disparity estimation and computational efficiency. While recent methods have made progress, achieving global and local consistency alongside computational efficiency remains difficult. Transformer-based models effectively capture long-range dependencies but suffer from high computational overhead, while cost volume-based iterative methods rely on local correlations, limiting global consistency and scalability to high resolutions and large disparities. To address these issues, we introduce S$^2$M$^2$, a Scalable Stereo Matching Model that achieves high accuracy, efficiency, and generalization without compromise. Our approach integrates a multi-resolution transformer framework, enabling effective information aggregation across different scales. Additionally, we propose a new loss function that enhances disparity estimation by concentrating probability on feasible matches. Beyond disparity prediction, S$^2$M$^2$ jointly estimates occlusion and confidence maps, leading to more robust and interpretable depth estimation. Unlike prior methods that rely on dataset-specific tuning, S$^2$M$^2$ is trained from scratch without dataset-specific adjustments, demonstrating strong generalization across diverse benchmarks. Extensive evaluations on Middlebury v3, ETH3D, and our high-fidelity synthetic dataset establish new state-of-the-art results.

Thu 23 Oct. 17:45 - 19:45 PDT

#195
3DGS-LM: Faster Gaussian-Splatting Optimization with Levenberg-Marquardt

Lukas Höllein · Aljaz Bozic · Michael Zollhöfer · Matthias Nießner

We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS) by replacing its ADAM optimizer with a tailored Levenberg-Marquardt (LM). Existing methods reduce the optimization time by decreasing the number of Gaussians or by improving the implementation of the differentiable rasterizer. However, they still rely on the ADAM optimizer to fit Gaussian parameters of a scene in thousands of iterations, which can take up to an hour. To this end, we change the optimizer to LM that runs in conjunction with the 3DGS differentiable rasterizer. For efficient GPU parallelization, we propose a caching data structure for intermediate gradients that allows us to efficiently calculate Jacobian-vector products in custom CUDA kernels. In every LM iteration, we calculate update directions from multiple image subsets using these kernels and combine them in a weighted mean. Overall, our method is 20% faster than the original 3DGS while obtaining the same reconstruction quality. Our optimization is also agnostic to other methods that accelerate 3DGS, thus enabling even faster speedups compared to vanilla 3DGS.

Thu 23 Oct. 17:45 - 19:45 PDT

#196
ACE-G: Improving Generalization of Scene Coordinate Regression Through Query Pre-Training

Leonard Bruns · Axel Barroso-Laguna · Tommaso Cavallari · Áron Monszpart · Sowmya Munukutla · Victor Prisacariu · Eric Brachmann

Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from fixed map codes to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.

Thu 23 Oct. 17:45 - 19:45 PDT

#197
3D Test-time Adaptation via Graph Spectral Driven Point Shift

Xin Wei · Qin Yang · Yijie Fang · Mingrui Zhu · Nannan Wang

Test-time adaptation (TTA) methods effectively address domain shifts by dynamically adapting pre-trained models to target domain data during online inference. While effective for 2D images, TTA struggles with 3D point clouds due to their irregular and unordered nature. Existing 3D TTA methods often involve complex high-dimensional optimization tasks, such as patch reconstruction or per-point transformation learning in the spatial domain, which require access to additional training data. In contrast, we propose Graph Spectral Domain Test-Time Adaptation (GSDTTA), a novel approach for 3D point cloud classification that shifts adaptation to the graph spectral domain, enabling more efficient adaptation by capturing global structural properties with fewer parameters. Point clouds in target domain are represented as outlier-aware graphs and transformed into graph spectral domain by Graph Fourier Transform (GFT). For efficiency, we only optimize the lowest 10\% of frequency components, which capture the majority of the point cloud’s energy. An inverse GFT (IGFT) is then applied to reconstruct the adapted point cloud with the graph spectral-driven point shift. Additionally, an eigenmap-guided self-training strategy is introduced to iteratively optimize both spectral adjustment and model parameters. Experimental results and ablation studies on benchmark datasets demonstrate the effectiveness of GSDTTA, outperforming existing TTA methods for 3D point cloud classification.

Thu 23 Oct. 17:45 - 19:45 PDT

#198
VistaDream: Sampling multiview consistent images for single-view scene reconstruction

Haiping Wang · Yuan Liu · Ziwei Liu · Wenping Wang · Zhen Dong · Bisheng Yang

In this paper, we propose VistaDream, a novel framework to reconstruct a 3D scene from a single-view image. Recent diffusion models enable generating high-quality novel-view images from a single-view input image. Most existing methods only concentrate on building the consistency between the input image and the generated images while losing the consistency between the generated images. VistaDream addresses this problem by a two-stage pipeline. In the first stage, VistaDream builds a global coarse 3D scaffold by zooming out a little step with inpainted boundaries and an estimated depth map. Then, on this global scaffold, we use iterative diffusion-based RGB-D inpainting to generate novel-view images to inpaint the holes of the scaffold. In the second stage, we further enhance the consistency between the generated novel-view images by a novel training-free Multiview Consistency Sampling (MCS) that introduces multi-view consistency constraints in the reverse sampling process of diffusion models. Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves high-quality scene reconstruction and novel view synthesis using a single-view image and outperforms baseline methods by a large margin.

Thu 23 Oct. 17:45 - 19:45 PDT

#199
Towards Visual Localization Interoperability: Cross-Feature for Collaborative Visual Localization and Mapping

Alberto Jaenal · Paula Carbó Cubero · Jose Araujo · André Mateus

The growing presence of vision-based systems in the physical world comes with a major requirement: highly accurate estimation of the pose, a task typically addressed through methods based on local features. The totality of the available feature-based localization solutions are designed under the assumption of using the same feature for mapping and localization. However, as the implementation provided by each vendor is based on heterogeneous feature extraction algorithms, collaboration between different devices is not straightforward or even not possible. Although there are some alternatives, such as re-extracting the features or reconstructing the image from them, these are impractical or costly to implement in a real pipeline. To overcome this, and inspired in the seminal work Cross-Descriptor [12], we propose Cross-Feature, a method that applies a patch-based training strategy to a simple MLP which projects features to a common embedded space. As a consequence, our proposal allows to establish suitable correspondences between features computed through heterogeneous algorithms, e.g., SIFT [23] and SuperPoint [9]. We experimentally demonstrate the validity of Cross-Feature by evaluating it in tasks as Image Matching, Visual Localization and a new Collaborative Visual Localization and Mapping scenario. We believe this is the first step towards full Visual Localization interoperability. Code and data will be made available.

Thu 23 Oct. 17:45 - 19:45 PDT

#200
MiDSummer: Multi-Guidance Diffusion for Controllable Zero-Shot Immersive Gaussian Splatting Scene Generation

Anjun Hu · Richard Tomsett · Valentin Gourmet · Massimo Camplani · Jas Kandola · Hanting Xie

We present MiDSummer, a two-stage framework for generating immersive Gaussian Splatting scenes that leverages multiple diffusion guidance signals to enable structured layout control, enhanced physical realism, and improved visual quality.While 3D scene generation has seen significant recent advances, current approaches could benefit from: (1) achieving precise, reliable layout control while preserving open-world generalization and physical plausibility, (2) balancing high-level semantic reasoning with low-level, directly controllable geometric constraints, and (3) effectively utilizing layout knowledge for visual refinement. Our work addresses these challenges through a structured two-stage planning-assembly framework.For planning, we introduce a dual layout diffusion guidance approach to bridge semantic reasoning and geometric controllability. Our approach uniquely integrates LLMs' open-vocabulary reasoning with Graph Diffusion Models' (GDM) geometric precision by incorporating multi-level self-consistency scores over scene graph structures and layout bounding box parameters. This fusion enables fine-grained control over scene composition while ensuring physical plausibility and faithful prompt interpretation.For assembly, we propose a layout-guided optimization technique for scene refinement. We effectively incorporate layout priors obtained during the planning stage into a Stable Diffusion (SD)-based refinement process that jointly optimizes camera trajectories and scene splats. This layout-aware joint optimization, constrained by multi-view consistency, produces visually compelling immersive scenes that are structurally coherent and controllable.

Thu 23 Oct. 17:45 - 19:45 PDT

#201
AG2aussian: Anchor-Graph Structured Gaussian Splatting for Instance-Level 3D Scene Understanding and Editing

Zhaonan Wang · Manyi Li · Changhe Tu

3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks. Existing approaches typically attach semantic features to a collection of free Gaussians and distill the features via differentiable rendering, leading to noisy segmentation and a messy selection of Gaussians. In this paper, we introduce AG$^2$aussian, a novel framework that leverages an anchor-graph structure to organize semantic features and regulate Gaussian primitives. Our anchor-graph structure not only promotes compact and instance-aware Gaussian distributions, but also facilitates graph-based propagation, achieving a clean and accurate instance-level Gaussian selection. Extensive validation across four applications, i.e. interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation, demonstrates the advantages of our approach and its benefits to various applications. The experiments and ablation studies further evaluate the effectiveness of the key designs of our approach.

Thu 23 Oct. 17:45 - 19:45 PDT

#202
Hierarchical 3D Scene Graphs Construction Outdoors

Jon Nyffeler · Federico Tombari · Daniel Barath

Understanding and structuring outdoor environments in 3D is critical for numerous applications, including robotics, urban planning, and autonomous navigation. In this work, we propose a pipeline to construct hierarchical 3D scene graphs from outdoor data, consisting of posed images and 3D reconstructions. Our approach systematically extracts and organizes objects and their subcomponents, enabling representations that span from entire buildings to their facades and individual windows. By leveraging geometric and semantic relationships, our method efficiently groups objects into meaningful hierarchies while ensuring robust spatial consistency. We integrate efficient feature extraction, hierarchical object merging, and relationship inference to generate structured scene graphs that capture both global and local dependencies. Our approach scales to large outdoor environments while maintaining efficiency, and we demonstrate its effectiveness on real-world datasets. We also demonstrate that these constructed outdoor scene graphs are beneficial for downstream applications, such as 3D scene alignment. The code will be made public.

Thu 23 Oct. 17:45 - 19:45 PDT

#203
Highlight
Spatio-Spectral Pattern Illumination for Direct and Indirect Separation from a Single Hyperspectral Image

Shin Ishihara · Imari Sato

Hyperspectral imaging has proven effective for appearance inspection because it can identify material compositions and reveal hidden features. Similarly, direct/indirect separation provides essential information about surface appearance and internal conditions, including layer structures and scattering behaviors. This paper presents a novel illumination system incorporating dispersive optics to unify both advantages for scene analyses. In general, achieving distinct direct/indirect separation requires multiple images with varying patterns. In a hyperspectral scenario, using a hyperspectral camera or tunable filters extends exposure and measurement times, hindering practical application.Our proposed system enables the illumination of a wavelength-dependent, spatially shifted pattern. With proper consideration of reflectance differences, we demonstrate robust separation of direct and indirect components for each wavelength can be achieved using a single hyperspectral image taken under one illumination pattern. Furthermore, we demonstrate analyzing the observed differences across wavelengths contributes to estimating depth.

Thu 23 Oct. 17:45 - 19:45 PDT

#204
SDFormer: Vision-based 3D Semantic Scene Completion via SAM-assisted Dual-channel Voxel Transformer

Yujie Xue · Huilong Pi · Jiapeng Zhang · Qin Yunchuan · Zhuo Tang · Kenli Li · Ruihui Li

Vision-based semantic scene completion (SSC) is able to predict complex scene information from limited 2D images, which has attracted widespread attention. Currently, SSC methods typically construct unified voxel features containing both geometry and semantics, which lead to different depth positions in occluded regions sharing the same 2D semantic information, resulting in ambiguous semantic segmentation. To address this problem, we propose SDFormer, a novel SAM-assisted Dual-channel Voxel Transformer framework for SSC. We uncouple the task based on its multi-objective nature and construct two parallel sub-networks: a semantic constructor (SC) and a geometric refiner (GR). The SC utilizes the Segment Anything Model (SAM) to construct dense semantic voxel features from reliable visible semantic information in the image. The GR accurately predicts depth positions and then further adjusts the semantic output by SAM. Additionally, we design a Semantic Calibration Affinity to enhance semantic-aware transformations in SC. Within the GR, Shape Segments Interactive and Learnable mask generation module to emphasize the spatial location of semantics to obtain fine-grained voxel information. Extensive qualitative and quantitative results on the SemanticKITTI and SSCBench-KITTI-360 datasets show that our method outperforms state-of-the-art approaches.

Thu 23 Oct. 17:45 - 19:45 PDT

#205
Adversarial Exploitation of Data Diversity Improves Visual Localization

Sihang Li · Siqi Tan · Bowen Chang · Jing Zhang · Chen Feng · Yiming Li

Visual localization, which estimates a camera's pose within a known scene, is a fundamental capability for autonomous systems. While absolute pose regression (APR) methods have shown promise for efficient inference, they often struggle with generalization. Recent approaches attempt to address this through data augmentation with varied viewpoints, yet they overlook a critical factor: appearance diversity.In this work, we identify appearance variation as the key to robust localization. Specifically, we first lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring capabilities, enabling the synthesis of diverse training data that varies not just in poses but also in environmental conditions such as lighting and weather. To fully unleash the potential of the appearance-diverse data, we build a two-branch joint training pipeline with an adversarial discriminator to bridge the syn-to-real gap.Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, reducing translation and rotation errors by 50% and 22% on indoor datasets, and 37% and 42% on outdoor datasets. Most notably, our method shows remarkable robustness in dynamic driving scenarios under varying weather conditions and in day-to-night scenarios, where previous APR methods fail.

Thu 23 Oct. 17:45 - 19:45 PDT

#206
SU-RGS: Relightable 3D Gaussian Splatting from Sparse Views under Unconstrained Illuminations

Qi Zhang · Chi Huang · Qian Zhang · Nan Li · Wei Feng

The latest advancements in scene relighting have been predominantly driven by inverse rendering with 3D Gaussian Splatting (3DGS). However, existing methods remain overly reliant on densely sampled images under static illumination conditions, which is prohibitively expensive and even impractical in real-world scenarios. In this paper, we propose a novel learning from Sparse views under Unconstrained illuminations Relightable 3D Gaussian Splatting (dubbed SU-RGS), to address this challenge by jointly optimizing 3DGS representations, surface materials, and environment illuminations (i.e., unknown and various lighting conditions in training) using only sparse input views. Firstly, SU-RGS presents a varying appearance rendering strategy, enabling each 3D Gaussian can perform inconsistent color under various lightings. Next, SU-RGS establishes the multi-view semantic consistency by constructing hierarchical semantics pseudo-labels across inter-views, to compensate for extra supervisions and facilitate sparse inverse rendering for confronting unconstrained illuminations. Additionally, we introduce an adaptive transient object perception component that integrates the scene geometry and semantics in a fine-grained manner, to quantify and eliminate the uncertainty of the foreground. Extensive experiments on both synthetic and real-world challenging datasets demonstrate the effectiveness of SU-RGS, achieving the state-of-the-art performance for scene inverse rendering by learning 3DGS from only sparse views under unconstrained illuminations.

Thu 23 Oct. 17:45 - 19:45 PDT

#207
GS-LIVM: Real-Time Photo-Realistic LiDAR-Inertial-Visual Mapping with Gaussian Splatting

Yusen XIE · Zhenmin Huang · Jin Wu · Jun Ma

In this paper, we introduce GS-LIVM, a real-time photo-realistic LiDAR-Inertial-Visual mapping framework with Gaussian Splatting tailored for outdoor scenes. Compared to existing methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), our approach enables real-time photo-realistic mapping while ensuring high-quality image rendering in large-scale unbounded outdoor environments. In this work, Gaussian Process Regression (GPR) is employed to mitigate the issues resulting from sparse and unevenly distributed LiDAR observations. The voxel-based 3D Gaussians map representation facilitates real-time dense mapping in large outdoor environments with acceleration governed by custom CUDA kernels. Moreover, the overall framework is designed in a covariance-centered manner, where the estimated covariance is used to initialize the scale and rotation of 3D Gaussians, as well as update the parameters of the GPR. We evaluate our algorithm on several outdoor datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of mapping efficiency and rendering quality. The source code is available on GitHub.

Thu 23 Oct. 17:45 - 19:45 PDT

#208
GeoFormer: Geometry Point Encoder for 3D Object Detection with Graph-based Transformer

Xin Jin · Haisheng Su · Cong Ma · Kai Liu · Wei Wu · Fei HUI · Junchi Yan

Lidar-based 3D detection is one of the most popular research fields in autonomous driving. 3D detectors typically detect specific targets in a scene according to the pattern formed by the spatial distribution of point clouds. However, existing voxel-based methods usually adopt MLP and global pooling (e.g., PointNet, CenterPoint) as voxel feature encoder, which makes it less effective to extract detailed spatial structure information from raw points, leading to information loss and inferior performance. In this paper, we propose a novel graph-based transformer to encode voxel features by condensing the full and detailed point's geometry, termed as GeoFormer. We first represent points within a voxel as a graph, based on relative distances to capture its spatial geometry. Then, We introduce a geometry-guided transformer architecture to encode voxel features, where the adjacent geometric clues are used to re-weight point feature similarities, enabling more effective extraction of geometric relationships between point pairs at varying distances. We highlight that GeoFormer is a plug-and-play module which can be seamlessly integrated to enhance the performance of existing voxel-based detectors. Extensive experiments conducted on three popular outdoor datasets demonstrate that our GeoFormer achieves the start-of-the-art performance on both effectiveness and robustness comparisons.

Thu 23 Oct. 17:45 - 19:45 PDT

#209
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Yuntao Chen · Yuqi Wang · Zhaoxiang Zhang

World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.

Thu 23 Oct. 17:45 - 19:45 PDT

#210
AlignDiff: Learning Physically-Grounded Camera Alignment via Diffusion

Liuyue Xie · Jiancong Guo · Ozan Cakmakci · Andre Araujo · Laszlo A. A. Jeni · zhiheng jia

Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by $\sim 8.2^\circ$ and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#211
Towards a 3D Transfer-based Black-box Attack via Critical Feature Guidance

Shuchao Pang · Zhenghan Chen · Shen Zhang · Liming Lu · Siyuan Liang · Anan Du · Yongbin Zhou

Deep neural networks for 3D point clouds have been demonstrated to be vulnerable to adversarial examples. Previous 3D adversarial attack methods often exploit certain information about the target models, such as model parameters or outputs, to generate adversarial point clouds. However, in realistic scenarios, it is challenging to obtain any information about the target models under conditions of absolute security. Therefore, we focus on transfer-based attacks, where generating adversarial point clouds does not require any information about the target models. Based on our observation that the critical features used for point cloud classification are consistent across different DNN architectures, we propose CFG, a novel transfer-based black-box attack method that improves the transferability of adversarial point clouds via the proposed Critical Feature Guidance. Specifically, our method regularizes the search of adversarial point clouds by computing the importance of the extracted features, prioritizing the corruption of critical features that are likely to be adopted by diverse architectures. Further, we explicitly constrain the maximum deviation extent of the generated adversarial point clouds in the loss function to ensure their imperceptibility. Extensive experiments conducted on the ModelNet40 and ScanObjectNN benchmark datasets demonstrate that the proposed CFG outperforms the state-of-the-art attack methods by a large margin.

Thu 23 Oct. 17:45 - 19:45 PDT

#212
Tile-wise vs. Image-wise: Random-Tile Loss and Training Paradigm for Gaussian Splatting

Xiaoyu Zhang · Weihong Pan · Xiaojun Xiang · Hongjia Zhai · Liyang Zhou · Hanqing Jiang · Guofeng Zhang

3D Gaussian Splatting (3DGS) has drawn significant attention for its advantages in rendering speed and quality. Most existing methods still rely on the image-wise loss and training paradigm because of its intuitive nature in the Splatting algorithm. However, image-wise loss lacks multi-view constraints, which are generally essential for optimizing 3D appearance and geometry. To address this, we propose RT-Loss along with a tile-based training paradigm, which uses randomly sampled tiles to integrate multi-view appearance and structural constraints in 3DGS. Additionally, we introduce an tile-based adaptive densification control strategy tailored for our training paradigm. Extensive experiments show that our approach consistently improves performance metrics while maintaining efficiency across various benchmark datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#213
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving

Xuemeng Yang · Licheng Wen · Tiantian Wei · Yukai Ma · Jianbiao Mei · Xin Li · Wenjie Lei · Daocheng Fu · Pinlong Cai · Min Dou · Liang He · Yong Liu · Botian Shi · Yu Qiao

This paper introduces DriveArena, the first high-fidelity closed-loop simulation system designed for driving agents navigating real-world scenarios. DriveArena comprises two core components: Traffic Manager, a traffic simulator capable of generating realistic traffic flow on any global street map, and World Dreamer, a high-fidelity conditional generative model with infinite auto-regression. DriveArena supports closed-loop simulation using road networks from cities worldwide, enabling the generation of diverse traffic scenarios with varying styles. This powerful synergy empowers any driving agent capable of processing real-world images to navigate in DriveArena's simulated environment. Furthermore, DriveArena features a flexible, modular architecture, allowing for multiple implementations of its core components and driving agents. Serving as a highly realistic arena for these players, our work provides a valuable platform for developing and evaluating driving agents across diverse and challenging scenarios. DriveArena takes a significant leap forward in leveraging generative models for driving simulation platforms, opening new avenues for closed-loop evaluation of autonomous driving systems. Codes of DriveArena are attached to the supplementary material. Project Page: https://blindpaper.github.io/DriveArena/

Thu 23 Oct. 17:45 - 19:45 PDT

#214
Highlight
Explaining Human Preferences via Metrics for Structured 3D Reconstruction

Jack Langerman · Denis Rozumny · Yuzhong Huang · Dmytro Mishkin

What cannot be measured cannot be improved while likely never uttered by Lord Kelvin, summarizes effectively the purpose of this work. This paper presents a detailed evaluation of automated metrics for evaluating structured 3D reconstructions. Pitfalls of each metric are discussed, and a thorough analyses through the lens of expert 3D modelers' preferences is presented. A set of systematic "unit tests" are proposed to empirically verify desirable properties, and context aware recommendations as to which metric to use depending on application are provided. Finally, a learned metric distilled from human expert judgments is proposed and analyzed.

Thu 23 Oct. 17:45 - 19:45 PDT

#215
Highlight
CoopTrack: Exploring End-to-End Learning for Efficient Cooperative Sequential Perception

Jiaru Zhong · Jiahao Wang · Jiahui Xu · Xiaofan Li · Zaiqing Nie · Haibao Yu

Cooperative perception aims to address the inherent limitations of single autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated.Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches.CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises three key components: Multi-Dimensional Feature Extraction (MDFE), Cross-Agent Alignment (CAA), and Graph-Based Association (GBA), which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on graph learning.Experiments on the V2X-Seq dataset demonstrate that, benefiting from its sophisticated design, CoopTrack achieves state-of-the-art performance, with 39.0\% mAP and 32.8\% AMOTA. Codes and visualization results are provided in the supplementary materials.

Thu 23 Oct. 17:45 - 19:45 PDT

#216
VLR-Driver: Large Vision-Language-Reasoning Models for Embodied Autonomous Driving

Fanjie Kong · Yitong Li · Weihuang Chen · Chen Min · Yizhe Li · Zhiqiang Gao · Haoyang Li · Zhongyu Guo · Hongbin Sun

The rise of embodied intelligence and multi-modal large language models has led to exciting advancements in the field of autonomous driving, establishing it as a prominent research focus in both academia and industry. However, when confronted with intricate and ambiguous traffic scenarios, the lack of logical reasoning and cognitive decision-making capabilities remains the primary challenge impeding the realization of embodied autonomous driving. Although Vision Language Models (VLMs) have enhanced the deep semantic understanding of autonomous driving systems, they exhibit notable limitations in decision explainability when handling rare and long-tail traffic scenarios. In this paper, we propose VLR-Driver, a novel multi-modal Vision-Language-Reasoning (VLR) framework based on Chain of Thought (CoT) for embodied autonomous driving. The framework employs a spatiotemporal CoT reasoning approach to recursively analyze potential safety risks and driving intentions of other agents, thereby delivering an efficient and transparent decision-making process. Furthermore, we construct a multi-modal reasoning-decision dataset to support the advancement of hierarchical reasoning of VLMs in autonomous driving. Closed-loop experiments conducted in CARLA demonstrate that the VLR-Driver significantly outperforms state-of-the-art end-to-end methods. Notably, key metrics such as driving score improved by 17.5\%, while the success rate improved by 22.2\%, offering a more transparent, reliable, and secure solution for autonomous driving systems. The code, dataset, and demonstration video will be open-sourced.

Thu 23 Oct. 17:45 - 19:45 PDT

#217
RoCo-Sim: Enhancing Roadside Collaborative Perception through Foreground Simulation

Yuwen Du · Anning Hu · Zichen Chao · Yifan Lu · Junhao Ge · Genjia Liu · Wei-Tao Wu · Lanjun Wang · Siheng Chen

Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by \textbf{83.74\%} on Rcooper-Intersection and \textbf{83.12\%} on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released soon.

Thu 23 Oct. 17:45 - 19:45 PDT

#218
Highlight
Inverse 3D Microscopy Rendering for Cell Shape Inference with Active Mesh

Sacha Ichbiah · Anshuman Sinha · Fabrice Delbary · Hervé Turlier

Traditional methods for biological shape inference, such as deep learning (DL) and active contour models, face limitations in 3D. DL requires large labeled datasets, which are difficult to obtain, while active contour models rely on fine-tuned hyperparameters for intensity attraction and regularization. We introduce deltaMic, a novel 3D differentiable renderer for fluorescence microscopy. By leveraging differentiable Fourier-space convolution, deltaMic accurately models the image formation process, integrating a parameterized microscope point spread function and a mesh-based object representation. Unlike DL-based segmentation, it directly optimizes shape and microscopy parameters to fit real microscopy data, removing the need for large datasets or heuristic priors. To enhance efficiency, we develop a GPU-accelerated Fourier transform for triangle meshes, significantly improving speed. We demonstrate deltaMic’s ability to reconstruct cellular shapes from synthetic and real microscopy images, providing a robust tool for 3D segmentation and biophysical modeling. This work bridges physics-based rendering with modern optimization techniques, offering a new paradigm for microscopy image analysis and inverse biophysical modeling.

Thu 23 Oct. 17:45 - 19:45 PDT

#219
E-SAM: Training-Free Segment Every Entity Model

WEIMING ZHANG · Dingwen Xiao · Lei Chen · Lin Wang

Entity Segmentation (ES) aims at identifying and segmenting distinct entities within an image without the need for predefined class labels. This characteristic makes ES well-suited to open-world applications with adaptation to diverse and dynamically changing environments, where new and previously unseen entities may appear frequently. Existing ES methods either require large annotated datasets or high training costs, limiting their scalability and adaptability. Recently, the Segment Anything Model (SAM), especially in its Automatic Mask Generation (AMG) mode, has shown potential for holistic image segmentation. However, it struggles with over-segmentation and under-segmentation, making it less effective for ES. In this paper, we introduce E-SAM, a novel training-free framework that exhibits exceptional ES capability. Specifically, we first propose Multi-level Mask Generation (MMG) that hierarchically processes SAM's AMG outputs to generate reliable object-level masks while preserving fine details at other levels. Entity-level Mask Refinement (EMR) then refines these object-level masks into accurate entity-level masks. That is, it separates overlapping masks to address the redundancy issues inherent in SAM's outputs and merges similar masks by evaluating entity-level consistency. Lastly, Under-Segmentation Refinement (USR) addresses under-segmentation by generating additional high-confidence masks fused with EMR outputs to produce the final ES map. These three modules are seamlessly optimized to achieve the best ES without additional training overhead. Extensive experiments demonstrate that E-SAM achieves state-of-the-art performance compared to prior ES methods, demonstrating a significant improvement by +30.1 on benchmark metrics.

Thu 23 Oct. 17:45 - 19:45 PDT

#220
Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Yiqing Shen · Bohan Liu · Chenjia Li · Lalithkumar Seenivasan · Mathias Unberath

Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- an LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and only request this limited subset instead of always evaluating every specialist model. The LLM then performs reasoning on this digital twin representation to identify target objects. To evaluate our approach, we introduce a new comprehensive video reasoning segmentation benchmark comprising 200 videos with 895 implicit text queries. The benchmark spans three reasoning categories (semantic, spatial, and temporal) with three different reasoning chain complexity. Experimental results demonstrate that our method performs best across all reasoning categories, suggesting that our just-in-time digital twin can bridge the gap between high-level reasoning and low-level perception in embodied AI. The dataset is available at https://anonymous.4open.science/r/benchmark-271B/.

Thu 23 Oct. 17:45 - 19:45 PDT

#221
Disentangling Instance and Scene Contexts for 3D Semantic Scene Completion

Enyu Liu · En Yu · Sijia Chen · Wenbing Tao

3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose Disentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9% and 11.9%, respectively, on the SemanticKITTI hidden test.

Thu 23 Oct. 17:45 - 19:45 PDT

#222
GaussRender: Learning 3D Occupancy with Gaussian Rendering

Loick Chambon · Eloi Zablocki · Alexandre Boulch · Mickael Chen · Matthieu Cord

Understanding the 3D geometry and semantics of driving scenes is critical for developing safe autonomous vehicles. Recent advances in 3D occupancy prediction have improved scene representation but often suffer from spatial inconsistencies, leading to floating artifacts and poor surface localization. Existing voxel-wise losses (e.g., cross-entropy) fail to enforce geometric coherence. In this paper, we propose GaussRender, a module that improves 3D occupancy learning by enforcing projective consistency. Our key idea is to project both predicted and ground-truth 3D occupancy into 2D camera views, where we apply supervision. Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent and geometrically plausible 3D structure. To achieve this efficiently, we leverage differentiable rendering with Gaussian splatting. GaussRender seamlessly integrates with existing architectures while maintaining efficiency and requiring no inference-time modifications. Extensive evaluations on multiple benchmarks (SurroundOcc-nuScenes, Occ3D nuScenes, SSCBench-KITTI360) demonstrate that GaussRender significantly improves geometric fidelity across various 3D occupancy models (TPVFormer, SurroundOcc, Symphonies), achieving state-of-the-art results, particularly on surface-sensitive metrics. The code and models will be open-sourced.

Thu 23 Oct. 17:45 - 19:45 PDT

#223
SA-Occ: Satellite-Assisted 3D Occupancy Prediction in Real World

Chen Chen · Zhirui Wang · Taowei Sheng · Yi Jiang · Yundu Li · Peirui Cheng · Luning Zhang · Kaiqiang Chen · Yanfeng Hu · Xue Yang · Xian Sun

Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose SA-Occ, the first Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) Dynamic-Decoupling Fusion, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) 3D-Proj Guidance, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) Uniform Sampling Alignment, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset will be publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#224
UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

Jin Cao · Hongrui Wu · Ziyong Feng · Hujun Bao · Xiaowei Zhou · Sida Peng

This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. The code will be released for the reproducibility.

Thu 23 Oct. 17:45 - 19:45 PDT

#225
ExploreGS: Explorable 3D Scene Reconstruction with Virtual Camera Samplings and Diffusion Priors

Minsu Kim · Subin Jeon · In Cho · Mijin Yoo · Seon Joo Kim

Recent advances in novel view synthesis (NVS) have enabled real-time rendering with 3D Gaussian Splatting (3DGS). However, existing methods struggle with artifacts and missing regions when rendering unseen viewpoints, limiting seamless scene exploration. To address this, we propose a 3DGS-based pipeline that generates additional training views to enhance reconstruction. We introduce an information-gain-driven virtual camera placement strategy to maximize scene coverage, followed by video diffusion priors to refine rendered results. Fine-tuning 3D Gaussians with these enhanced views significantly improves reconstruction quality. To evaluate our method, we present Wild-Explore, a benchmark designed for challenging scene exploration. Experiments demonstrate that our approach outperforms existing 3DGS-based methods, enabling high-quality, artifact-free rendering from arbitrary viewpoints.

Thu 23 Oct. 17:45 - 19:45 PDT

#226
LaneDiffusion: Improving Centerline Graph Learning via Prior Injected BEV Feature Generation

Zijie Wang · Weiming Zhang · Wei Zhang · Xiao Tan · hongxing liu · Yaowei Wang · Guanbin Li

Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underexplored in this domain. We introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird's Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on fine-grained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAP{cf}, DET{l} and TOP_{ll}). These results establish state-of-the-art performance in centerline graph learning, offering new insights into generative models for this task.

Thu 23 Oct. 17:45 - 19:45 PDT

#227
Bridging 3D Anomaly Localization and Repair via High-Quality Continuous Geometric Representation

Bozhong Zheng · Jinye Gan · Xiaohao Xu · Xintao Chen · Wenqiao Li · Xiaonan Huang · Na Ni · Yingna Wu

3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patch-based methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization.We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, pose-invariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module.Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating in-situ anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. Our code will be released to drive further research.

Thu 23 Oct. 17:45 - 19:45 PDT

#228
Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization

Hao Ju · Shaofei Huang · Si Liu · Zhedong Zheng

Existing approaches to drone visual geo-localization predominantly adopt the image-based setting, where a single drone-view snapshot is matched with images from other platforms. Such task formulation, however, underutilizes the inherent video output of the drone and is sensitive to occlusions and viewpoint disparity. To address these limitations, we formulate a new video-based drone geo-localization task and propose the Video2BEV paradigm. This paradigm transforms the video into a Bird's Eye View (BEV), simplifying the subsequent inter-platform matching process. In particular, we employ Gaussian Splatting to reconstruct a 3D scene and obtain the BEV projection. Different from the existing transform methods, e.g., polar transform, our BEVs preserve more fine-grained details without significant distortion.To facilitate the discriminative intra-platform representation learning, our Video2BEV paradigm also incorporates a diffusion-based module for generating hard negative samples. To validate our approach, we introduce UniV, a new video-based geo-localization dataset that extends the image-based University-1652 dataset. UniV features flight paths at $30^\circ$ and $45^\circ$ elevation angles with increased frame rates of up to 10 frames per second (FPS). Extensive experiments on the UniV dataset show that our Video2BEV paradigm achieves competitive recall rates and outperforms conventional video-based methods. Compared to other competitive methods, our proposed approach exhibits robustness at lower elevations with more occlusions.

Thu 23 Oct. 17:45 - 19:45 PDT

#229
CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

Caner Korkmaz · Brighton Nuwagira · Baris Coskunuzer · Tolga Birdal

We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the vectorization of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP into a combination of individual, learnable single-parameter persistence, where the bifiltration functions are jointly learned. Thanks to the differentiability, its robust topological feature vectors can be seamlessly used within state-of-the-art architectures such as Swin Transformers. We establish theoretical guarantees for the stability of our vectorization under generalized Wasserstein metrics. Our experiments on benchmark medical imaging datasets show the benefit CuMPerLay on classification performance, particularly in limited-data scenarios. Overall, CuMPerLay offers a promising direction for integrating global structural information into deep networks for structured image analysis.

Thu 23 Oct. 17:45 - 19:45 PDT

#230
Highlight
SGAD: Semantic and Geometric-aware Descriptor for Local Feature Matching

Xiangzeng Liu · CHI WANG · Guanglu Shi · Xiaodong Zhang · Qiguang Miao · Miao Fan

Local feature matching remains a fundamental challenge in computer vision. Recent Area to Point Matching (A2PM) methods have improved matching accuracy. However, existing research based on this framework relies on inefficient pixel-level comparisons and complex graph matching that limit scalability. In this work, we introduce the Semantic and Geometric-aware Descriptor Network (SGAD), which fundamentally rethinks area-based matching by generating highly discriminative area descriptors that enable direct matching without complex graph optimization. This approach significantly improves both accuracy and efficiency of area matching. We further improve the performance of area matching through a novel supervision strategy that decomposes the area matching task into classification and ranking subtasks. Finally, we introduce the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas by analyzing containment graphs. SGAD demonstrates remarkable performance gains, reducing runtime by 60$\times$ (0.82s vs. 60.23s) compared to MESA. Extensive evaluations show consistent improvements across multiple point matchers: SGAD+LoFTR reduces runtime compared to DKM, while achieving higher accuracy (0.82s vs. 1.51s, 65.98 vs. 61.11) in outdoor pose estimation, and SGAD+ROMA delivers +7.39\% AUC@5$^\circ$ in indoor pose estimation, establishing a new state-of-the-art.

Thu 23 Oct. 17:45 - 19:45 PDT

#231
ScanEdit: Hierarchically-Guided Functional 3D Scan Editing

Mohamed El Amine Boudjoghra · Ivan Laptev · Angela Dai

With the growing ease of capture of real-world 3D scenes, effective editing becomes essential for the use of captured 3D scan data in various graphics applications.We present ScanEdit, which enables functional editing of complex, real-world 3D scans from natural language text prompts.By leveraging the high-level reasoning capabilities of large language models (LLMs), we construct a hierarchical scene graph representation for an input 3D scan given its instance decomposition. We develop a hierarchically-guided, multi-stage prompting approach using LLMs to decompose general language instructions (that can be vague, without referencing specific objects) into specific, actionable constraints that are applied to our scene graph. Our scene optimization integrates LLM-guided constraints along with 3D-based physical plausibility objectives, enabling the generation of edited scenes that align with a variety of input prompts, from abstract, functional-based goals to more detailed, specific instructions. This establishes a foundation for intuitive, text-driven 3D scene editing in real-world scenes.

Thu 23 Oct. 17:45 - 19:45 PDT

#232
Supercharging Floorplan Localization with Semantic Rays

Yuval Grader · Hadar Averbuch-Elor

Floorplans provide a compact representation of the building’s structure, revealing not only layout information but also detailed semantics such as the locations of windows and doors. However, contemporary floorplan localization techniques mostly focus on matching depth-based structural cues, ignoring the rich semantics communicated within floorplans. In this work, we introduce a semantic-aware localization framework that jointly estimates depth and semantic rays, consolidating over both for predicting a structural-semantic probability volume. Our probability volume is constructed in a coarse-to-fine manner: We first sample a small set of rays to obtain an initial low-resolution probability volume. We then refine these probabilities by performing a denser sampling only in high-probability regions and process the refined values for predicting a 2D location and orientation angle. We conduct an evaluation on two standard floorplan localization benchmarks. Our experiments demonstrate that our approach substantially outperforms state-of-the-art methods, achieving significant improvements in recall metrics compared to prior works. Moreover, we demonstrate that our framework can easily incorporate additional metadata such as room labels, enabling additional gains in both accuracy and efficiency. We will release our code and trained models.

Thu 23 Oct. 17:45 - 19:45 PDT

#233
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS

Chuanyu Fu · Yuqi Zhang · Kunbin Yao · Guanying Chen · Yuan Xiong · Chuan Huang · Shuguang Cui · Xiaochun Cao

3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances. To address this, we propose RobustSplat, a robust solution based on two critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method. Our code will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#234
End-to-End Driving with Online Trajectory Evaluation via BEV World Model

Yingyan Li · Yuqi Wang · Yang Liu · Jiawei He · Lue Fan · Zhaoxiang Zhang

End-to-end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework WoTE, which leverages a BEV World model to predict future BEV states for Trajectory Evaluation. The proposed BEV world model is latency-efficient compared to image-level world models and can be seamlessly supervised using off-the-shelf BEV-space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance. The code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#235
Highlight
Planar Affine Rectification from Local Change of Scale and Orientation

Yuval Nissan · Marc Pollefeys · Daniel Barath

We propose a method for affine rectification of an image plane by leveraging changes in local scales and orientations under projective distortion. Specifically, we derive a novel linear constraint that directly relates pairs of points with orientations to the parameters of a projective transformation. This constraint is combined with an existing linear constraint on local scales, leading to highly robust rectification. The method reduces to solving a system of linear equations, enabling an efficient algebraic least-squares solution. It requires only two local scales and two local orientations, which can be extracted from, e.g., SIFT features. Unlike prior approaches, our method does not impose restrictions on individual features, does not require class segmentation, and makes no assumptions about feature interrelations. It is compatible with any feature detector that provides local scale or orientation. Furthermore, combining scaled and oriented points with line segments yields a highly robust algorithm that outperforms baselines. Extensive experiments show the effectiveness of our approach on real-world images, including repetitive patterns, building facades, and text-based content.

Thu 23 Oct. 17:45 - 19:45 PDT

#236
ERNet: Efficient Non-Rigid Registration Network for Point Sequences

Guangzhao He · Yuxi Xiao · Zhen Xu · Xiaowei Zhou · Sida Peng

Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose \methodname, an efficient feed-forward model trained on large deformation datasets.It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates frame-wise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach (i) outperforms previous state of the art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.

Thu 23 Oct. 17:45 - 19:45 PDT

#237
SeqGrowGraph: Learning Lane Topology as a Chain of Graph Expansions

Mengwei Xie · Shuang Zeng · Xinyuan Chang · Xinran Liu · Zheng Pan · Mu Xu · Xing Wei

Accurate lane topology is essential for autonomous driving, yet traditional methods struggle to model the complex, non-linear structures—such as loops and bidirectional lanes—prevalent in real-world road structure. We present SeqGrowGraph, a novel framework that learns lane topology as a chain of graph expansions, inspired by human map-drawing processes. Representing the lane graph as a directed graph $G=(V,E)$, with intersections ($V$) and centerlines ($E$), SeqGrowGraph incrementally constructs this graph by introducing one node at a time. At each step, an adjacency matrix ($A$) expands from $n \times n$ to $(n+1) \times (n+1)$ to encode connectivity, while a geometric matrix ($M$) captures centerline shapes as quadratic Bézier curves. The graph is serialized into sequences, enabling a transformer model to autoregressively predict the chain of expansions, guided by a depth-first search ordering. Evaluated on nuScenes and Argoverse 2 datasets, SeqGrowGraph achieves state-of-the-art performance.

Thu 23 Oct. 17:45 - 19:45 PDT

#238
InvRGB+L: Inverse Rendering of Complex Scenes with Unified Color and LiDAR Reflectance Modeling

Xiaoxue Chen · Bhargav Chandaka · Chih-Hao Lin · Ya-Qin Zhang · David Forsyth · Hao Zhao · Shenlong Wang

We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. Conventional inverse graphics methods rely primarily on RGB observations and use LiDAR mainly for geometric information, often resulting in suboptimal material estimates due to visible light interference. We find that LiDAR’s intensity values—captured with active illumination in a different spectral range—offer complementary cues for robust material estimation under variable lighting. Inspired by this, InvRGB+L leverages LiDAR intensity cues to overcome challenges inherent in RGB-centric inverse graphics through two key innovations: (1) a novel physics-based LiDAR shading model and (2) RGB–LiDAR material consistency losses. The model produces novel-view RGB and LiDAR renderings of urban and indoor scenes and supports relighting, night simulations, and dynamic object insertions—achieving results that surpass current state-of-the-art methods in both scene-level urban inverse rendering and LiDAR simulation.

Thu 23 Oct. 17:45 - 19:45 PDT

#239
CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction

Yuanyuan Gao · Hao Li · Jiaqi Chen · Zhihang Zhong · Zhengyu Zou · Dingwen Zhang · Xiao Sun · Junwei Han

Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy.These core issues arise from its inherently unstructured design and the absence of efficient parallelization.To overcome these challenges simultaneously, we introduce \textbf{CityGS-\(\mathcal{X}\)}, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH$^2$-3D).As an early attempt, CityGS-\(\mathcal{X}\) abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance.%To further enhance both overall quality and geometric accuracy, CityGS-\(\mathcal{X}\) presents a progressive RGB-Depth-Normal training strategy.This approach enhances 3D consistency by jointly optimizing appearance and geometry representation through multi-view constraints and off-the-shelf depth priors within batch-level training.%Through extensive experiments, CityGS-\(\mathcal{X}\) consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. %Notably, CityGS-\(\mathcal{X}\) can train and render a 5,000+ image scene with only 4\(\times\)4090 GPUs in just 5 hours, while many other methods even encounter Out-Of-Memory (OOM) issues during rendering, making CityGS-\(\mathcal{X}\) a more accessible and scalable solution for this task.Notably, CityGS-\(\mathcal{X}\) can train and render a scene with 5,000+ images in just 5 hours using only 4×4090 GPUs, a task that would make other alternative methods encounter Out-Of-Memory (OOM) issues and fail completely. This implies that CityGS-\(\mathcal{X}\) is far beyond the capacity of other existing methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#240
Doppler-Aware LiDAR-RADAR Fusion for Weather-Robust 3D Detection

Yujeong Chae · Heejun Park · Hyeonseong Kim · Kuk-Jin Yoon

Robust 3D object detection across diverse weather conditions is crucial for safe autonomous driving, and RADAR is increasingly leveraged for its resilience in adverse weather. Recent advancements have explored 4D RADAR and LiDAR-RADAR fusion to enhance 3D perception capabilities, specifically targeting weather robustness. However, existing methods often handle Doppler in ways that are not well-suited for multi-modal settings or lack tailored encoding strategies, hindering effective feature fusion and performance. To address these shortcomings, we propose a novel Doppler-aware LiDAR-4D RADAR fusion (DLRFusion) framework for robust 3D object detection. We introduce a multi-path iterative interaction module that integrates LiDAR, RADAR power, and Doppler, enabling a structured feature fusion process. Doppler highlights dynamic regions, refining RADAR power and enhancing LiDAR features across multiple stages, improving detection confidence. Extensive experiments on the K-RADAR dataset demonstrate that our approach effectively exploits Doppler information, achieving state-of-the-art performance in both normal and adverse weather conditions. The code will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#241
Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance

Mingfang Zhang · Ryo Yonetani · Yifei Huang · Liangyang Ouyang · Ruicong Liu · Yoichi Sato

This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from head-mounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions observed through the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. These encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art inertial localization and inertial action recognition baselines.

Thu 23 Oct. 17:45 - 19:45 PDT

#242
Epona: Autoregressive Diffusion World Model for Autonomous Driving

Kaiwen Zhang · Zhenyu Tang · Xiaotao Hu · Xingang Pan · Xiaoyang Guo · Yuan Liu · Jingwei Huang · Li Yuan · Qian Zhang · XIAOXIAO LONG · Xun Cao · Wei Yin

Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks.

Thu 23 Oct. 17:45 - 19:45 PDT

#243
DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

Jiazhe Guo · Yikang Ding · Xiwu Chen · Shuo Chen · Bohan Li · Yingshuang Zou · Xiaoyang Lyu · Feiyang Tan · Xiaojuan Qi · Zhiheng Li · Hao Zhao

Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations. The code is available in the supplementary.

Thu 23 Oct. 17:45 - 19:45 PDT

#244
MCOP: Multi-UAV Collaborative Occupancy Prediction

Zefu Lin · Wenbo Chen · Xiaojuan Jin · Yuran Yang · Lue Fan · YIXIN ZHANG · Yufeng Zhang · Zhaoxiang Zhang

Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird's Eye View (BEV)-based approaches exhibit two main limitations: bounding-box representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects.To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead.Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.

Thu 23 Oct. 17:45 - 19:45 PDT

#245
Spectral Sensitivity Estimation with an Uncalibrated Diffraction Grating

Lilika Makabe · Hiroaki Santo · Fumio Okura · Michael Brown · Yasuyuki Matsushita

This paper introduces a practical and accurate calibration method for camera spectral sensitivity using a diffraction grating. Accurate calibration of camera spectral sensitivity is crucial for various computer vision tasks, including color correction, illumination estimation, and material analysis. Unlike existing approaches that require specialized narrow-band filters or reference targets with known spectral reflectances, our method only requires an uncalibrated diffraction grating sheet, readily available off-the-shelf. By capturing images of the direct illumination and its diffracted pattern through the grating sheet, our method estimates both the camera's spectral sensitivity and the diffraction grating parameters in a closed-form manner. Experiments on synthetic and real-world data demonstrate that our approach outperforms reference target-based methods, underscoring its effectiveness and practicality.

Thu 23 Oct. 17:45 - 19:45 PDT

#246
Leveraging Local Patch Alignment to Seam-cutting for Large Parallax Image Stitching

Tianli Liao · Chenyang Zhao · Lei Li · Heling Cao

Seam cutting has shown significant effectiveness in the composition phase of image stitching, particularly for scenarios involving parallax. However, conventional implementations typically position seam-cutting as a downstream process contingent upon successful image alignment. This approach inherently assumes the existence of locally aligned regions where visually plausible seams can be established. Current alignment methods frequently fail to satisfy this prerequisite in large parallax scenarios despite considerable research efforts dedicated to improving alignment accuracy. In this paper, we propose an alignment-compensation paradigm that dissociates seam quality from initial alignment accuracy by integrating a Local Patch Alignment Module (LPAM) into the seam-cutting pipeline. Concretely, given the aligned images with an estimated initial seam, our method first identifies low-quality pixels along the seam through a seam quality assessment, then performs localized SIFT-flow alignment on the critical patches enclosing these pixels. Finally, we recomposite the aligned patches using adaptive seam-cutting and merge them into the original aligned images to generate the final mosaic. Comprehensive experiments on large parallax stitching datasets demonstrate that LPAM significantly enhances stitching quality while maintaining computational efficiency.

Thu 23 Oct. 17:45 - 19:45 PDT

#247
InfiniCube: Unbounded and Controllable Dynamic 3D Driving Scene Generation with World-Guided Video Models

Yifan Lu · Xuanchi Ren · Jiawei Yang · Tianchang Shen · Jay Zhangjie Wu · Jun Gao · Yue Wang · Siheng Chen · Mike Chen · Sanja Fidler · Jiahui Huang

We present InfiniCube, a scalable and controllable method to generate unbounded and dynamic 3D driving scenes with high fidelity.Previous methods for scene generation are constrained either by their applicability to indoor scenes or by their lack of controllability.In contrast, we take advantage of recent advances in 3D and video generative models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions.First, we construct a map-conditioned 3D voxel generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of pixel-aligned guidance buffers, synthesizing a consistent appearance on long-video generation for large-scale scenes.Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift videos to dynamic 3D Gaussians with controllable objects.Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness of our model design. Code will be released upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#248
PriorMotion: Generative Class-Agnostic Motion Prediction with Raster-Vector Motion Field Priors

Kangan Qian · Jinyu Miao · Xinyu Jiao · Ziang Luo · Zheng Fu · Yining Shi · Yunlong Wang · Kun Jiang · Diange Yang

Reliable spatial and motion perception is essential for safe autonomous navigation. Recently, class-agnostic motion prediction on bird's-eye view (BEV) cell grids derived from LiDAR point clouds has gained significant attention. However, existing frameworks typically perform cell classification and motion prediction on a per-pixel basis, neglecting important motion field priors such as rigidity constraints, temporal consistency, and future interactions between agents. These limitations lead to degraded performance, particularly in sparse and distant regions.To address these challenges, we introduce $\textbf{PriorMotion}$, an innovative generative framework designed for class-agnostic motion prediction that integrates essential motion priors by modeling them as distributions within a structured latent space. Specifically, our method captures structured motion priors using raster-vector representations and employs a variational autoencoder with distinct dynamic and static components to learn future motion distributions in the latent space. Experiments on the nuScenes dataset demonstrate that $\textbf{PriorMotion}$ outperforms state-of-the-art methods across both traditional metrics and our newly proposed evaluation criteria. Notably, we achieve improvements of approximately 15.24\% in accuracy for fast-moving objects, an 3.59\% increase in generalization, a reduction of 0.0163 in motion stability, and a 31.52\% reduction in prediction errors in distant regions. Further validation on FMCW LiDAR sensors confirms the robustness of our approach.

Thu 23 Oct. 17:45 - 19:45 PDT

#249
MGSR: 2D/3D Mutual-boosted Gaussian Splatting for High-fidelity Surface Reconstruction under Various Light Conditions

Qingyuan Zhou · Yuehu Gong · Weidong Yang · Jiaze Li · Yeqi Luo · Baixin Xu · Shuhao Li · Ben Fei · Ying He

Novel view synthesis (NVS) and surface reconstruction (SR) are essential tasks in 3D Gaussian Splatting (3D-GS). Despite recent progress, these tasks are often addressed independently, with GS-based rendering methods struggling under diverse light conditions and failing to produce accurate surfaces, while GS-based reconstruction methods frequently compromise rendering quality. This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose $MGSR$, a 2D/3D $M$utual-boosted $G$aussian Splatting for $S$urface $R$econstruction that enhances both rendering quality and 3D reconstruction accuracy. MGSR introduces two branches—one based on 2D-GS and the other on 3D-GS. The 2D-GS branch excels in surface reconstruction, providing precise geometry information to the 3D-GS branch. Leveraging this geometry, the 3D-GS branch employs a geometry-guided illumination decomposition module that captures reflected and transmitted components, enabling realistic rendering under varied light conditions. Using the transmitted component as supervision, the 2D-GS branch also achieves high-fidelity surface reconstruction. Throughout the optimization process, the 2D-GS and 3D-GS branches undergo alternating optimization, providing mutual supervision. Prior to this, each branch completes an independent warm-up phase, with an early stopping strategy implemented to reduce computational costs. We evaluate MGSR on a diverse set of synthetic and real-world datasets, at both object and scene levels, demonstrating strong performance in rendering and surface reconstruction.

Thu 23 Oct. 17:45 - 19:45 PDT

#250
Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training

Zhenxin Li · Shihao Wang · Shiyi Lan · Zhiding Yu · Zuxuan Wu · Jose M. Alvarez

End-to-end autonomous driving research currently faces a critical challenge in bridging the gap between open-loop training and closed-loop deployment. Current approaches are trained to predict trajectories in an open-loop environment, which struggle with quick reactions to other agents in closed-loop environments and risk generating kinematically infeasible plans due to the gap between open-loop training and closed-loop driving. In this paper, we introduce Hydra-NeXt, a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and a trajectory refinement network in one model. Unlike current open-loop trajectory prediction models that only handle general-case planning, Hydra-NeXt further utilizes a control decoder to focus on short-term actions, which enables faster responses to dynamic situations and reactive agents. Moreover, we propose the Trajectory Refinement module to augment and refine the planning decisions by effectively adhering to kinematic constraints in closed-loop environments. This unified approach bridges the gap between open-loop training and closed-loop driving, demonstrating superior performance of 65.89 Driving Score (DS) and 48.20\% Success Rate (SR) on the Bench2Drive dataset without relying on external experts for data collection. Hydra-NeXt surpasses the previous state-of-the-art by 22.98 DS and 17.49 SR, marking a significant advancement in autonomous driving.

Thu 23 Oct. 17:45 - 19:45 PDT

#251
IntrinsicControlNet: Cross-distribution Image Generation with Real and Unreal

Jiayuan Lu · Rengan Xie · Zixuan Xie · Zhizhen Wu · Dianbing Xi · Qi Ye · Rui Wang · Hujun Bao · Yuchi Huo

Realistic images are usually produced by simulating light transportation results of 3D scenes using rendering engines. This framework can precisely control the output but is usually weak at producing photo-like images. Alternatively, diffusion models have seen great success in photorealistic image generation by leveraging priors from large datasets of real-world images but lack affordance controls. Promisingly, the recent ControlNet enables flexible control of the diffusion model without degrading its generation quality. In this work, we introduce IntrinsicControlNet, an intrinsically controllable image generation framework that enables easily generating photorealistic images from precise and explicit control, similar to a rendering engine, by using intrinsic images such as material properties, geometric details, and lighting as network inputs. Beyond this, we notice that there is a domain gap between the synthetic and real-world datasets, and therefore, naively blending these datasets yields domain confusion. To address this problem, we present a cross-domain control architecture that extracts control information from synthetic datasets, and control and content information from real-world datasets. This bridges the domain gap between real-world and synthetic datasets, enabling the blending or editing of 3D assets and real-world photos to support various interesting applications. Experiments and user studies demonstrate that our method can generate explicitly controllable and highly photorealistic images based on the input intrinsic images.

Thu 23 Oct. 17:45 - 19:45 PDT

#252
SteerX: Creating Any Camera-Free 3D and 4D Scenes with Geometric Steering

Byeongjun Park · Hyojun Go · Hyelin Nam · Byung-Hoon Kim · Hyungjin Chung · Changick Kim

Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inference-time steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.

Thu 23 Oct. 17:45 - 19:45 PDT

#253
WIPES: Wavelet-based Visual Primitives

Wenhao Zhang · Hao Zhu · Delong Wu · Di Kang · Linchao Bao · Xun Cao · Zhan Ma

Pursuing a continuous visual representation that offers flexible frequency modulation and fast rendering speed has recently garnered increasing attention in the fields of 3D vision and graphics. However, existing representations often rely on frequency guidance or complex neural network decoding, leading to spectrum loss or slow rendering. To address these limitations, we propose WIPES, a universal Wavelet-based vIsual PrimitivES for representing multi-dimensional visual signals. Building on the spatial-frequency localization advantages of wavelets, WIPES effectively captures both the low-frequency "forest" and the high-frequency "trees." Additionally, we develop a wavelet-based differentiable rasterizer to achieve fast visual rendering. Experimental results on various visual tasks, including 2D image representation, 5D static and 6D dynamic novel view synthesis, demonstrate that WIPES, as a visual primitive, offers higher rendering quality and faster inference than INR-based methods, and outperforms Gaussian-based representations in rendering quality.

Thu 23 Oct. 17:45 - 19:45 PDT

#254
DiffPCI: Large Motion Point Cloud frame Interpolation with Diffusion Model

tianyu zhang · Haobo Jiang · jian Yang · Jin Xie

Point cloud interpolation aims to recover intermediate frames for temporally smoothing a point cloud sequence. However, real-world challenges, such as uneven or large scene motions, cause existing methods to struggle with limited interpolation precision. To address this, we introduce DiffPCI, a novel diffusion interpolation model that formulates the frame interpolation task as a progressive denoising diffusion process. Training DiffPCI involves two key stages: a forward interpolation diffusion process and a reverse interpolation denoising process. In the forward process, the clean intermediate frame is progressively transformed into a noisy one through continuous Gaussian noise injection. The reverse process then focuses on training a denoiser to gradually refine this noisy frame back to the ground-truth frame. In particular, we derive a point cloud interpolation-specific variational lower bound as our optimization objective for denoiser training. Furthermore, to alleviate the interpolation error especially in highly dynamic scenes, we develop a novel full-scale, dual-branch denoiser that enables more comprehensive front-back frame information fusion for robust bi-directional interpolation. Extensive experiments demonstrate that DiffPCI significantly outperforms current state-of-the-art frame interpolation methods (e.g. 27\% and 860\% reduction in Chamfer Distance and Earth Mover’s Distance in Nuscenes).

Thu 23 Oct. 17:45 - 19:45 PDT

#255
UPP: Unified Point-Level Prompting for Robust Point Cloud Analysis

Zixiang Ai · Zhenyu Cui · Yuxin Peng · Jiahuan Zhou

Pre-trained point cloud analysis models have shown promising advancements in various downstream tasks, yet their effectiveness is typically suffering from low-quality point cloud (i.e., noise and incompleteness), which is a common issue in real-world data due to casual object occlusions and unsatisfactory data collected by 3D sensors. To this end, existing methods focus on enhancing point cloud quality by developing dedicated denoising and completion models. However, due to the isolation between the point cloud enhancement tasks and downstream tasks, these methods fail to work in various real-world domains. In addition, the conflicting objectives between point cloud denoising and completing tasks further limit the ensemble paradigm to preserve critical geometric features in real scenarios. To tackle the above challenges, we propose a unified point-level prompting method that reformulates point cloud denoising and completion as a prompting mechanism, enabling robust analysis in a parameter-efficient manner. We start by introducing a Rectification Prompter to adapt to noisy points through the predicted rectification vector prompts, effectively filtering noise while preserving intricate geometric features essential for accurate analysis. Sequentially, we further incorporate a Completion Prompter to generate auxiliary point prompts based on the rectified point clouds, facilitating their robustness and adaptability. Finally, a Shape-Aware Unit module is exploited to efficiently unify and capture the filtered geometric features and the downstream task-aware detail information for the point cloud analysis.Extensive experiments on four datasets demonstrate the superiority and robustness of our method when handling noisy and incomplete point cloud data against existing state-of-the-art methods. Our code will be released soon.

Thu 23 Oct. 17:45 - 19:45 PDT

#256
ArgMatch: Adaptive Refinement Gathering for Efficient Dense Matching

Yuxin Deng · Kaining Zhang · Linfeng Tang · Jiaqi Yang · Jiayi Ma

Establishing dense correspondences is crucial yet computationally expensive in many multi-view tasks. Although the state-of-the-art dense matchers typically adopt a coarse-to-fine scheme to mitigate the computational cost, their efficiency is often compromised by the use of heavy models with redundant feature representations, which are essential for desirable results. In this work, we introduce adaptive refinement gathering that significantly alleviates the demand on such computational burdens without sacrificing too much accuracy. The pipeline consists of (i) context-aware offset estimator: exploiting content information for rough features to enhance the offset decoding accuracy. (ii) Locally consistent match rectifier: correcting erroneous initial matches with local consistency. (iii) Locally consistent upsampler: mitigating over-smoothing at depth-discontinuous edges. Additionally, we propose an adaptive gating strategy, combined with the nature of local consistency, to dynamically modulate the contribution of different components and pixels, enabling adaptive gradient backpropagation and fully unleashing the network's capacity. Compared to the state-of-the-art, our lightweight network, termed ArgMatch, achieves competitive performance on MegaDepth, while using 90% fewer parameters, 73% less computation time, and 84% lower memory cost.

Thu 23 Oct. 17:45 - 19:45 PDT

#257
RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case

Baihui Xiao · Chengjian Feng · Zhijian Huang · Feng yan · Yujie Zhong · Lin Ma

Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose SimBoost that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Secondly, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments are conducted on nuScenes, where SimBoost improves driving performance in challenging scenarios by about 50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of SimBoost in better managing rare high-risk driving scenarios.

Thu 23 Oct. 17:45 - 19:45 PDT

#258
Highlight
Thermal Polarimetric Multi-view Stereo

Takahiro Kushida · Kenichiro Tanaka

This paper introduces a novel method for detailed 3D shape reconstruction utilizing thermal polarization cues. Unlike state-of-the-art methods, the proposed approach is independent of illumination, material properties, and heating processes. In this paper, we formulate a general theory of polarization observation and show that long-wave infrared (LWIR) polarimetric imaging is free from the ambiguities that affect visible polarization analyses. Subsequently, we propose a method for recovering detailed 3D shapes using thermal polarimetric images, showing that our approach effectively reconstructs fine details on heterogeneous materials and outperforms existing techniques.

Thu 23 Oct. 17:45 - 19:45 PDT

#259
StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

Bo-Hsu Ke · You-Zhe Xie · Yu-Lun Liu · Wei-Chen Chiu

3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques.

Thu 23 Oct. 17:45 - 19:45 PDT

#260
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos

Chin-Yang Lin · Cheng Sun · Fu-En Yang · Min-Hung Chen · Yen-Yu Lin · Yu-Lun Liu

LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a Tracking and Alignment Module leveraging learned 3D priors, which combines correspondence-guided PnP initialization with photometric refinement for accurate camera tracking; and (3) an adaptive Octree Anchor Formation mechanism that dynamically adjusts anchor densities, significantly reducing memory usage. Extensive experiments on challenging benchmarks (Tanks and Temples, Free, and Hike datasets) demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches.

Thu 23 Oct. 17:45 - 19:45 PDT

#261
WonderTurbo: Generating Interactive 3D World in 0.72 Seconds

Chaojun Ni · Xiaofeng Wang · Zheng Zhu · Weijie Wang · Haoyun Li · Guosheng Zhao · Jie Li · Wenkang Qin · Guan Huang · Wenjun Mei

Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15$\times$ speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.

Thu 23 Oct. 17:45 - 19:45 PDT

#262
Geometric Alignment and Prior Modulation for View-Guided Point Cloud Completion on Unseen Categories

Jingqiao Xiu · Yicong Li · Na Zhao · Han Fang · Xiang Wang · Angela Yao

View-Guided Point Cloud Completion (VG-PCC) aims to reconstruct complete point clouds from partial inputs by referencing single-view images. While existing VG-PCC models perform well on in-class predictions, they exhibit significant performance drops when generalizing to unseen categories. We identify two key limitations underlying this challenge: (1) Current encoders struggle to bridge the substantial modality gap between images and point clouds. Consequently, their learned representations often lack robust cross-modal alignment and over-rely on superficial class-specific patterns. (2) Current decoders refine global structures holistically, overlooking local geometric patterns that are class-agnostic and transferable across categories. To address these issues, we present a novel generalizable VG-PCC framework for unseen categories based on Geometric Alignment and Prior Modulation (GAPM). First, we introduce a Geometry Aligned Encoder that lifts reference images into 3D space via depth maps for natural alignment with the partial point clouds. This reduces dependency on class-specific RGB patterns that hinder generalization to unseen classes. Second, we propose a Prior Modulated Decoder that incorporates class-agnostic local priors to reconstruct shapes on a regional basis. This allows the adaptive reuse of learned geometric patterns that promote generalization to unseen classes. Extensive experiments validate that GAPM consistently outperforms existing models on both seen and, notably, unseen categories, establishing a new benchmark for unseen-category generalization in VG-PCC. Our code can be found in the supplementary material.

Thu 23 Oct. 17:45 - 19:45 PDT

#263
AccidentalGS: 3D Gaussian Splatting from Accidental Camera Motion

Mao Mao · Xujie Shen · Guyuan Chen · Boming Zhao · Jiarui Hu · Hujun Bao · Zhaopeng Cui

Neural 3D modeling and novel view synthesis with Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) typically requires the multi-view images with wide baselines and accurate camera poses as input. However, scenarios with accidental camera motions are rarely studied. In this paper, we propose AccidentalGS , the first method for neural 3D modeling and novel view synthesis from accidental camera motions. To achieve this, we present a novel joint optimization framework that considers geometric and photometric errors, using a simplified camera model for stability. We also introduce a novel online adaptive depth-consistency loss to prevent the overfitting of the Gaussian model to input images. Extensive experiments on both synthetic and real-world datasets show that AccidentalGS achieves more accurate camera poses and realistic novel views compared to existing methods, and supports 3D modeling and neural rendering even for the Moon with telescope-like images.

Thu 23 Oct. 17:45 - 19:45 PDT

#264
MOSAIC: Generating Consistent, Privacy-Preserving Scenes from Multiple Depth Views in Multi-Room Environments

Zhixuan Liu · Haokun Zhu · Rui Chen · Jonathan Francis · Soonmin Hwang · Ji Zhang · Jean Oh

We introduce a novel diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense.MOSAIC operates through a novel inference-time optimization that avoids error accumulation common in sequential or single-room constraint in panorama-based approaches.MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising processes when more overlapping views are added, leading to improved generation quality.Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments.

Thu 23 Oct. 17:45 - 19:45 PDT

#265
Coordinate-based Speed of Sound Recovery for Aberration-Corrected Photoacoustic Computed Tomography

Tianao Li · Manxiu Cui · Cheng Ma · Emma Alexander

Photoacoustic computed tomography (PACT) is a non-invasive imaging modality, similar to ultrasound, with wide-ranging medical applications. Conventional PACT images are degraded by wavefront distortion caused by the heterogeneous speed of sound (SOS) in tissue. Accounting for these effects can improve image quality and provide medically useful information, but measuring the SOS directly is burdensome and the existing joint reconstruction method is computationally expensive. Traditional supervised learning techniques are currently inaccessible in this data-starved domain. In this work, we introduce an efficient, self-supervised joint reconstruction method that recovers SOS and high-quality images using a differentiable physics model to solve the semi-blind inverse problem. The SOS, parametrized by either a pixel grid or a neural field (NF), is updated directly by backpropagation. Our method removes SOS aberrations more accurately and 35x faster than the current SOTA. We demonstrate the success of our method quantitatively in simulation and qualitatively on experimentally-collected and in-vivo data.

Thu 23 Oct. 17:45 - 19:45 PDT

#266
Highlight
Sparfels: Fast Reconstruction from Sparse Unposed Imagery

Shubhendu Jena · Amine Ouasfi · Mae Younes · Adnane Boukhayma

We present a method for Sparse view reconstruction with surface element splatting that runs within 2 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning test-time optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate stat-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view Benchmarks based on established multi-view datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#267
GenFlow3D: Generative Scene Flow Estimation and Prediction on Point Cloud Sequences

Hanlin Li · Wenming Weng · Yueyi Zhang · Zhiwei Xiong

Scene flow provides the fundamental information of the scene dynamics. Existing scene flow estimation methods typically rely on the correlation between only a consecutive point cloud pair, making them limited to the instantaneous state of the scene and face challenge in real-world scenarios with factors like occlusion, noise, and diverse motion of background and foreground. In this paper, we study the joint sequential scene flow estimation and future scene flow prediction on point cloud sequences. The expanded sequential input introduces long-term and high-order motion information. We propose GenFlow3D, a recurrent neural network model which integrates diffusion in the decoder to better incorporate the two tasks and enhance the ability to extract general motion patterns. A transformer-based denoising network is adopted to help capture useful information. Depending on the input point clouds, discriminative condition signals are generated to guide the diffusion decoder to switch among different modes specific for scene flow estimation and prediction in a multi-scale manner. GenFlow3D is evaluated on the real-world datasets Nuscenes and Argoverse 2, and demonstrates superior performance compared with the existing methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#268
Self-Supervised Sparse Sensor Fusion for Long Range Perception

Edoardo Palladin · Samuel Brucker · Filippo Ghilotti · Praveen Narayanan · Mario Bijelic · Felix Heide

Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50–100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia.However, most existing perception approaches focus on shorter ranges and rely on Bird’s Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pre-training scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters.

Thu 23 Oct. 17:45 - 19:45 PDT

#269
Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors

Katja Schwarz · Norman Müller · Peter Kontschieder

Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) – a novel approach that integrates a 3D rep resentation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet++, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet++.

Thu 23 Oct. 17:45 - 19:45 PDT

#270
You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception

hao si · Ehsan Javanmardi · Manabu Tsukada

Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.

Thu 23 Oct. 17:45 - 19:45 PDT

#271
Curve-Aware Gaussian Splatting for 3D Parametric Curve Reconstruction

Zhirui Gao · Renjiao Yi · YaQiao Dai · Xuening Zhu · Wei Chen · Kai Xu · Chenyang Zhu

This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, CurveGaussian, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.

Thu 23 Oct. 17:45 - 19:45 PDT

#272
TransiT: Transient Transformer for Non-line-of-sight Videography

Ruiqian Li · Siyuan Shen · Suan Xia · Ziheng Wang · Xingyue Peng · Chengxuan Song · Yingsheng Zhu · Tao Wu · Shiying Li · Jingyi Yu

High quality and high speed videography using Non-Line-of-Sight (NLOS) imaging benefit autonomous navigation, collision prevention, and post-disaster search and rescue tasks. Current solutions have to balance between the frame rate and image quality. High frame rates, for example, can be achieved by reducing either per-point scanning time or scanning density, but at the cost of lowering the information density at individual frames. Fast scanning process further reduces the signal-to-noise ratio and different scanning systems exhibit different distortion characteristics. In this work, we design and employ a new Transient Transformer architecture called TransiT to achieve real-time NLOS recovery under fast scans. TransiT directly compresses the temporal dimension of input transients to extract features, reducing computation costs and meeting high frame rate requirements. It further adopts a feature fusion mechanism as well as employs a spatial-temporal Transformer to help capture features of NLOS transient videos. Moreover, TransiT applies transfer learning to bridge the gap between synthetic and real-measured data. In real experiments, TransiT manages to reconstruct from sparse transients of $16 \times 16$ measured at an exposure time of 0.4 ms per point to NLOS videos at a $64 \times 64$ resolution at 10 frames per second. We will make our code and dataset available to the community.

Thu 23 Oct. 17:45 - 19:45 PDT

#273
Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues

Xu Cao · Takafumi Taketomi

We propose a neural inverse rendering approach to reconstruct 3D shape, spatially varying BRDF, and lighting parameters from multi-view images captured under varying lighting conditions.Unlike conventional multi-view photometric stereo (MVPS) methods, our approach does not rely on geometric, reflectance, or lighting cues derived from single-view photometric stereo. Instead, we jointly optimize all scene properties end-to-end to directly reproduce raw image observations.We represent both geometry and SVBRDF as neural implicit fields and incorporate shadow-aware volume rendering with physics-based shading. Experiments show that our method outperforms MVPS methods guided by high-quality normal maps and enables photorealistic rendering from novel viewpoints under novel lighting conditions. Our method reconstructs intricate surface details for objects with challenging reflectance properties using view-unaligned OLAT images, which conventional MVPS methods cannot handle.

Thu 23 Oct. 17:45 - 19:45 PDT

#274
Unified Multi-Agent Trajectory Modeling with Masked Trajectory Diffusion

songru Yang · Zhenwei Shi · Zhengxia Zou

Understanding movements in multi-agent scenarios is a fundamental problem in intelligent systems. Previous research assumes complete and synchronized observations. However, real-world partial observation caused by occlusions leads to inevitable model failure, which demands a unified framework for coexisting trajectory prediction, imputation, and recovery. Unlike previous attempts that handled observed and unobserved behaviors in a coupled manner, we explore a decoupled denoising diffusion modeling paradigm with a unidirectional information valve to separate the interference from uncertain behaviors. Building on this, we proposed a Unified Masked Trajectory Diffusion model (UniMTD) for arbitrary levels of missing observations. We designed a unidirectional attention as a valve unit to control the direction of information flow between the observed and masked areas, gradually refining the missing observations toward a real-world distribution. We construct it into a unidirectional MoE structure to handle varying proportions of missing observations. A Cached Diffusion model is further designed to improve generation quality while reducing computation and time overhead. Our method has achieved a great leap across human motions and vehicle traffic. UniMTD efficiently achieves 65% improvement in minADE20 and reaches SOTA with advantages of 98%, 50%, 73%, and 29% across 4 fidelity metrics on out-of-boundary, velocity, and trajectory length. Our code will be released here.

Thu 23 Oct. 17:45 - 19:45 PDT

#275
RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis

Hugo Blanc · Jean-Emmanuel Deschaud · Alexis Paljic

RayGauss has recently achieved state-of-the-art results on synthetic and indoor scenes, representing radiance and density fields with irregularly distributed elliptical basis functions rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that significantly accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5× to 12× faster training and 50× to 80× higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. The code will soon be publicly available on GitHub.

Thu 23 Oct. 17:45 - 19:45 PDT

#276
SynCity: Training-Free Generation of 3D Cities

Paul Engstler · Aleksandar Shtedritski · Iro Laina · Christian Rupprecht · Andrea Vedaldi

In this paper, we address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training-free and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most current 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based grid approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.

Thu 23 Oct. 17:45 - 19:45 PDT

#277
RadarSplat: Radar Gaussian Splatting for High-Fidelity Data Synthesis and 3D Reconstruction of Autonomous Driving Scenes

Pou-Chun Kung · Skanda Harisha · Ram Vasudevan · Aline Eid · Katherine A. Skinner

High-Fidelity 3D scene reconstruction plays a crucial role in autonomous driving by enabling novel data generation from existing datasets. This allows simulating safety-critical scenarios and augmenting training datasets without incurring further data collection costs.While recent advances in radiance fields have demonstrated promising results in 3D reconstruction and sensor data synthesis using cameras and LiDAR, their potential for radar remains largely unexplored. Radar is crucial for autonomous driving due to its robustness in adverse weather conditions like rain, fog, and snow, where optical sensors often struggle. Although the state-of-the-art radar-based neural representation shows promise for 3D driving scene reconstruction, it performs poorly in scenarios with significant radar noise, including receiver saturation and multipath reflection. Moreover, it is limited to synthesizing preprocessed, noise-excluded radar images, failing to address realistic radar data synthesis. To address these limitations, this paper proposes RadarSplat, which integrates Gaussian Splatting with novel radar noise modeling to enable realistic radar data synthesis and enhanced 3D reconstruction. Compared to the state-of-the-art, RadarSplat achieves superior radar image synthesis (+3.5 PSNR / 2.3x SSIM) and improved geometric reconstruction (-48% RMSE / 2.3x Accuracy), demonstrating its effectiveness in generating high-fidelity radar data and scene reconstruction.

Thu 23 Oct. 17:45 - 19:45 PDT

#278
Tree Skeletonization from 3D Point Clouds by Denoising Diffusion

Elias Marks · Lucas Nunes · Federico Magistri · Matteo Sodano · Rodrigo Marcuzzi · Lars Zimmermann · Jens Behley · Cyrill Stachniss

The natural world presents complex organic structures, such as tree canopies, that humans can interpret even when only partially visible.Understanding tree structures is key for forest monitoring, orchard management, and automated harvesting applications.However, reconstructing tree topologies from sensor data, called tree skeletonization, remains a challenge for computer vision approaches. Traditional methods for tree skeletonization rely on handcrafted features, regression, or generative models, whereas recent advances focus on deep learning approaches. Existing methods often struggle with occlusions caused by dense foliage, limiting their applicability over the annual vegetation cycle. Furthermore, the lack of real-world data with reference information limits the evaluation of these methods to synthetic datasets, which does not validate generalization to real environments.In this paper, we present a novel approach for tree skeletonization that combines a generative denoising diffusion probabilistic model for predicting node positions and branch directions with a classical minimum spanning tree algorithm to infer tree skeletons from 3D point clouds, even with strong occlusions. %, enabling robust topology estimation even with strong occlusions. Additionally, we provide a dataset of an apple orchard with 280 trees scanned 10 times during the growing season with corresponding reference skeletons, enabling quantitative evaluation. Experiments show the superior performance of our approach on real-world data and competitive results compared to state-of-art approaches on synthetic benchmarks.

Thu 23 Oct. 17:45 - 19:45 PDT

#279
Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner

Zhimin Chen · Xuewei Chen · Xiao Guo · Yingwei Li · Longlong Jing · Liang Yang · Bing Li

Recently, multi-modal masked autoencoders (MAE) has been introduced in 3D self-supervised learning, offering enhanced feature learning by leveraging both 2D and 3D data to capture richer cross-modal representations. However, these approaches have two limitations: (1) they inefficiently require both 2D and 3D modalities as inputs, even though the inherent multi-view properties of 3D point clouds already contain 2D modality. (2) input 2D modality causes the reconstruction learning to unnecessarily rely on visible 2D information, hindering 3D geometric representation learning. To address these challenges, we propose a 3D to Multi-View Learner (Multi-View ML) that only utilizes 3D modalities as inputs and effectively capture rich spatial information in 3D point clouds. Specifically, we first project 3D point clouds to multi-view 2D images at the feature level based on 3D-based pose. Then, we introduce two components: (1) a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features; (2) a multi-scale multi-head (MSMH) attention mechanism that facilitates local-global information interactions in each decoder transformer block through attention heads at various scales. Additionally, a novel two-stage self-training strategy is proposed to align 2D and 3D representations. Empirically, our method significantly outperforms state-of-the-art counterparts across various downstream tasks, including 3D classification, part segmentation, and object detection. Such performance superiority showcases that Multi-View ML enriches the model's comprehension of geometric structures and inherent multi-modal properties of point clouds.

Thu 23 Oct. 17:45 - 19:45 PDT

#280
Splat-LOAM: Gaussian Splatting LiDAR Odometry and Mapping

Emanuele Giacomini · Luca Di Giammarino · Lorenzo De Rebotti · Giorgio Grisetti · Martin R. Oswald

LiDARs provide accurate geometric measurements, making them valuable for ego-motion estimation and reconstruction tasks.Although its success, managing an accurate and lightweight representation of the environment still poses challenges.Both classic and NeRF-based solutions have to trade off accuracy over memory and processing times.In this work, we build on recent advancements in Gaussian Splatting methods to develop a novel \lidar~odometry and mapping pipeline that exclusively relies on Gaussian primitives for its scene representation.Leveraging spherical projection, we drive the refinement of the primitives uniquely from LiDAR measurements.Experiments show that our approach matches the current registration performance, while achieving SOTA results for mapping tasks with minimal GPU requirements. This efficiency makes it a strong candidate for further exploration and potential adoption in real-time robotics estimation tasks.

Thu 23 Oct. 17:45 - 19:45 PDT

#281
Purge-Gate: Efficient Backpropagation-Free Test-Time Adaptation for Point Clouds via Token purging

Moslem Yazdanpanah · Ali Bahri · Mehrdad Noori · Sahar Dastani · Gustavo Vargas Hakim · David OSOWIECHI · Ismail Ayed · Christian Desrosiers

Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PG-SP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNet-C, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3\% higher accuracy than state-of-the-art backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4× faster and 5.5× more memory efficient than our baseline, making it suitable for real-world deployment.

Thu 23 Oct. 17:45 - 19:45 PDT

#282
Highlight
AAA-Gaussians: Anti-Aliased and Artifact-Free 3D Gaussian Rendering

Michael Steiner · Thomas Köhler · Lukas Radl · Felix Windisch · Dieter Schmalstieg · Markus Steinberger

Although 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it still faces challenges such as aliasing, projection artifacts, and view inconsistencies, primarily due to the simplification of treating splats as 2D entities. We argue that incorporating full 3D evaluation of Gaussians throughout the 3DGS pipeline can effectively address these issues while preserving rasterization efficiency. Specifically, we introduce an adaptive 3D smoothing filter to mitigate aliasing and present a stable view-space bounding method that eliminates popping artifacts when Gaussians extend beyond the view frustum. Furthermore, we promote tile-based culling to 3D with screen-space planes, accelerating rendering and reducing sorting costs for hierarchical rasterization. Our method achieves state-of-the-art quality on in-distribution evaluation sets and significantly outperforms other approaches for out-of-distribution views. Our qualitative evaluations further demonstrate the effective removal of aliasing, distortions, and popping artifacts, ensuring real-time, artifact-free rendering.

Thu 23 Oct. 17:45 - 19:45 PDT

#283
Highlight
SAFT: Shape and Appearance of Fabrics from Template via Differentiable Physical Simulations from Monocular Video

David Stotko · Reinhard Klein

The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of $2.64$ while requiring a medium runtime of $30$ min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.

Thu 23 Oct. 17:45 - 19:45 PDT

#284
UNIS: A Unified Framework for Achieving Unbiased Neural Implicit Surfaces in Volume Rendering

Junkai Deng · Hanting Niu · Jiaze Li · Fei Hou · Ying He

Reconstruction from multi-view images is a fundamental challenge in computer vision that has been extensively studied over the past decades. Recently, neural radiance fields have driven significant advancements, especially through methods using implicit functions and volume rendering, achieving high levels of accuracy. A core component of these methods is the mapping that transforms an implicit function's output into corresponding volume densities. Despite its critical role, this mapping has received limited attention in existing literature. In this paper, we provide a comprehensive and systematic study of mapping functions, examining their properties and representations. We first outline the necessary conditions for the mapping function and propose a family of functions that meet these criteria, to ensure first-order unbiasedness. We further demonstrate that the mappings employed by NeuS and VolSDF, two representative neural implicit surface techniques, are special cases within this broader family. Building on our theoretical framework, we introduce several new mapping functions and evaluate their effectiveness through numerical experiments. Our approach offers a fresh perspective on this well-established problem, opening avenues for the development of new techniques in the field.

Thu 23 Oct. 17:45 - 19:45 PDT

#285
Highlight
BridgeDepth: Bridging Monocular and Stereo Reasoning with Latent Alignment

Tongfan Guan · Jiaxin Guo · Chen Wang · Yun-Hui Liu

Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry but struggle with ambiguities such as reflective or textureless surfaces.Despite their synergies, these paradigms remain largely disjoint in practice.We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations.At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with disparity hypothesis representations during stereo reasoning.This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry.Extensive experiments demonstrate state-of-the-art results: OmniDepth reduces zero-shot generalization error by $\!>\!40\%$ on Middlebury and ETH3D compared to leading stereo methods, while addressing longstanding failure cases on transparent and reflective surfaces.By harmonizing multi-view geometry with monocular context, OmniDepth advances robust 3D perception that transcends modality-specific limitations.Code and models will be released.

Among structured-light methods, the phase-shifting approach enables high-resolution and high-accuracy measurements using a minimum of three patterns. However, its performance is significantly affected when dynamic and complex-shaped objects are measured, as motion artifacts and phase inconsistencies can degrade accuracy. In this study, we propose an enhanced phase-shifting method that incorporates neural inverse rendering to enable the 3D measurement of moving objects. To effectively capture object motion, we introduce a displacement field into the rendering model, which accurately represents positional changes and mitigates motion-induced distortions. Additionally, to achieve high-precision reconstruction with fewer phase-shifting patterns, we designed a multiview-rendering framework that utilizes multiple cameras in conjunction with a single projector. Comparisons with state-of-the-art methods and various ablation studies demonstrated that our method accurately reconstructs the shapes of moving objects, even with a small number of patterns, using only simple, well-known phase-shifting patterns.

Thu 23 Oct. 17:45 - 19:45 PDT

#287
Towards Foundational Models for Single-Chip Radar

Tianshu Huang · Akarsh Prabhakara · Chuhan Chen · Jay Karhade · Deva Ramanan · Matthew O'Toole · Anthony Rowe

mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets.In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20\% per $10\times$ data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a $10\times$ increase in training data. Finally, we estimate a total data requirement of $\approx$100M samples (3000 hours) to fully exploit the potential of GRT.

Thu 23 Oct. 17:45 - 19:45 PDT

#288
Diffusion Image Prior

Hamadi Chihaoui · Paolo Favaro

Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior(DIIP). We take inspiration from the Deep Image Prior (DIP). since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate \methodnameacr on various degradation-blind IR tasks, including JPEG artifact removal, deblurring, denoising and super-resolution with state-of-the-art results.

Thu 23 Oct. 17:45 - 19:45 PDT

#289
Highlight
FlowR: Flowing from Sparse to Dense 3D Reconstructions

Tobias Fischer · Samuel Rota Bulò · Yung-Hsu Yang · Nikhil Keetha · Lorenzo Porzi · Norman Müller · Katja Schwarz · Jonathon Luiten · Marc Pollefeys · Peter Kontschieder

3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, very dense captures involving many images are needed to match the high-quality expectations of some applications, e.g. Virtual Reality (VR). However, dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These methods are often conditioned only on a handful of reference input views and thus do not fully exploit the available 3D information, leading to inconsistent generation results and reconstruction artifacts. To tackle this problem, we propose a multi-view, flow-matching model that learns a flow to connect novel views generated from possibly-sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with generated novel views to improve the overall reconstruction quality.Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in few-view and many-view scenarios, leading to higher-quality reconstructions than prior works across multiple, widely-used NVS benchmarks.

Thu 23 Oct. 17:45 - 19:45 PDT

#290
WorldScore: Unified Evaluation Benchmark for World Generation

Haoyi Duan · Hong-Xing Yu · Sirui Chen · Li Fei-Fei · Jiajun Wu

We introduce WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: indoor and outdoor, static and dynamic, photorealistic and stylized. The WorldScore metric evaluates generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source implementations, we reveal key insights and challenges for each category of models. We will open-source WorldScore, including evaluation metrics, datasets, and generated videos.

Thu 23 Oct. 17:45 - 19:45 PDT

#291
Perspective-Invariant 3D Object Detection

Alan Liang · Lingdong Kong · Dongyue Lu · Youquan Liu · Jian Fang · Huaici Zhao · Wei Tsang Ooi

With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we establish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception system across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#292
CA-I2P: Channel-Adaptive Registration Network with Global Optimal Selection

Zhixin Cheng · Jiacheng Deng · Xinjun Li · Xiaotian Yin · Bohao Liao · Baoqun Yin · Wenfei Yang · Tianzhu Zhang

Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy.Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching.To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.

Thu 23 Oct. 17:45 - 19:45 PDT

#293
LightSwitch: Multi-view Relighting with Material-guided Diffusion

Yehonathan Litman · Fernando De la Torre · Shubham Tulsiani

Recent approaches for 3D relighting have shown promise in integrating 2D image relighting generative priors to alter the appearance of a 3D representation while preserving the underlying structure. Nevertheless, generative priors used for 2D relighting that directly relight from an input image do not take advantage of intrinsic properties of the subject that can be inferred or cannot consider multi-view data at scale, leading to subpar relighting. In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. By using multi-view and material information cues together with a scalable denoising scheme, our method consistently and efficiently relights dense multi-view data of objects with diverse material compositions. We show that our 2D relighting prediction quality exceeds previous state-of-the-art relighting priors that directly relight from images. We further demonstrate that LightSwitch matches or outperforms state-of-the-art diffusion inverse rendering methods in relighting synthetic and real objects in as little as 2 minutes. We will publicly release our model and code.

Thu 23 Oct. 17:45 - 19:45 PDT

#294
Decoupled Diffusion Sparks Adaptive Scene Generation

Yunsong Zhou · Naisheng Ye · William Ljungbergh · Tianyu Li · Jiazhi Yang · Zetong Yang · Hongzi Zhu · Christoffer Petersson · Hongyang Li

Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40\% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20\% through data augmentation and showcase its capability in safety-critical data generation.

Thu 23 Oct. 17:45 - 19:45 PDT

#295
Recover Biological Structure from Sparse-View Diffraction Images with Neural Volumetric Prior

Renzhi He · Haowen Zhou · Yubei Chen · Yi Xue

Volumetric reconstruction of label-free living cells from non-destructive optical microscopic images reveals cellular metabolism in native environments. However, current optical tomography techniques require hundreds of 2D images to reconstruct a 3D volume, hindering them from intravital imaging of biological samples undergoing rapid dynamics. This poses a challenge of reconstructing the entire volume of semi-transparent biological samples from sparse views due to the restricted viewing angles of microscopes and the limited number of measurements. In this work, we develop Neural Volumetric Prior (NVP) for high-fidelity volumetric reconstruction of semi-transparent biological samples from sparse-view microscopic images. NVP integrates explicit and implicit neural representations and incorporates the physical prior of diffractive optics. We validate NVP on both simulated data and experimentally captured microscopic images. Compared to previous methods, NVP significantly reduces the required number of images by nearly 50-fold and processing time by 3-fold while maintaining state-of-the-art performance.NVP is the first technique to enable volumetric reconstruction of label-free biological samples from sparse-view microscopic images, paving the way for real-time 3D imaging of dynamically changing biological samples.

Thu 23 Oct. 17:45 - 19:45 PDT

#296
ReAL-AD: Towards Human-Like Reasoning in End-to-End Autonomous Driving

Yuhang Lu · Jiadong Tu · Yuexin Ma · Xinge Zhu

End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: \textbf{Driving Strategy}, \textbf{Driving Decision}, and \textbf{Driving Operation}, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the \textbf{Strategic Reasoning Injector}, which formulates high-level driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the \textbf{Tactical Reasoning Integrator}, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the \textbf{Hierarchical Trajectory Decoder}, which progressively translates tactical decisions into precise control actions for smooth and human-like trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30\%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning.

Thu 23 Oct. 17:45 - 19:45 PDT

#297
SpatialCrafter: Unleashing the Imagination of Video Diffusion Models for Scene Reconstruction from Limited Observations

Songchun Zhang · Huiyao Xu · Sitong Guo · Zhongwei Xie · Hujun Bao · Weiwei Xu · Changqing Zou

Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. This work takes on the challenge of reconstructing photorealistic 3D scenes from sparse or single-view inputs.We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets.Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes.

Thu 23 Oct. 17:45 - 19:45 PDT

#298
MonoMVSNet: Monocular Priors Guided Multi-View Stereo Network

Jianfei Jiang · Qiankun Liu · Haochen Yu · Hongyuan Liu · Liyong Wang · Jiansheng Chen · Huimin Ma

Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position embedding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks among all published methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#299
HERMES: A Unified Self-Driving World Model for Simultaneous 3D Scene Understanding and Generation

Xin Zhou · DINGKANG LIANG · Sifan Tu · Xiwu Chen · Yikang Ding · Dingyuan Zhang · Feiyang Tan · Hengshuang Zhao · Xiang Bai

Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be made available.

Thu 23 Oct. 17:45 - 19:45 PDT

#300
Highlight
MEGA: Memory-Efficient 4D Gaussian Splatting for Dynamic Scenes

XINJIE ZHANG · Zhening Liu · Yifan Zhang · Xingtong Ge · Dailan He · Tongda Xu · Yan Wang · Zehong Lin · Shuicheng YAN · Jun Zhang

4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leading to substantial memory and storage cost. This paper introduces a memory-efficient framework for 4DGS. We streamline the color attribute by decomposing it into a per-Gaussian direct color component with only 3 parameters and a shared lightweight alternating current color predictor. This approach eliminates the need for spherical harmonics coefficients, which typically involve up to 144 parameters in classic 4DGS, thereby creating a memory-efficient 4D Gaussian representation. Furthermore, we introduce an entropy-constrained Gaussian deformation technique that uses a deformation field to expand the action range of each Gaussian and integrates an opacity-based entropy loss to limit the number of Gaussians, thus forcing our model to use as few Gaussians as possible to fit a dynamic scene well. With simple half-precision storage and zip compression, our framework achieves a storage reduction by approximately 190$\times$ and 125$\times$ on the Technicolor and Neural 3D Video datasets, respectively, compared to the original 4DGS. Meanwhile, it maintains comparable rendering speeds and scene representation quality, setting a new standard in the field.

Thu 23 Oct. 17:45 - 19:45 PDT

#301
Generative Active Learning for Long-tail Trajectory Prediction via Controllable Diffusion Model

Daehee Park · Monu Surana · Pranav Desai · Ashish Mehta · Reuben John · Kuk-Jin Yoon

Predicting future trajectories of dynamic traffic agents is crucial in autonomous systems. While data-driven methods enable large-scale training, they often underperform on rarely observed tail samples, yielding a long-tail problem. Prior works have tackled this by modifying model architectures, such as using a hypernetwork.In contrast, we propose refining the training procedure to unlock each model’s potential without altering its structure.To this end, we introduce the Generative Active Learning for Trajectory prediction (GALTraj), which iteratively identifies tail samples and augments them via a controllable generative diffusion model.By incorporating the augmented samples in each iteration, we directly mitigate dataset imbalance.To ensure effective augmentation, we design a new tail-aware generation method that categorizes agents (tail, head, relevant) and applies tailored guidance of the diffusion model.It enables producing diverse and realistic trajectories that preserve tail characteristics while respecting traffic constraints. Unlike prior traffic simulation methods focused on producing diverse scenarios, ours is the first to show how simulator-driven augmentation can benefit long-tail learning for trajectory prediction. Experiments on multiple trajectory datasets (WOMD, Argoverse2) with popular backbones (QCNet, MTR) confirm that our method significantly boosts performance on tail samples and also enhances accuracy on head samples.

Thu 23 Oct. 17:45 - 19:45 PDT

#302
QuickSplat: Fast 3D Surface Reconstruction via Learned Gaussian Initialization

Yueh-Cheng Liu · Lukas Höllein · Matthias Nießner · Angela Dai

Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by 48% in comparison to state of the art methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#303
EYE3:Turn Anything into Naked-eye 3D

Yingde Song · Zongyuan Yang · Baolin Liu · yongping xiong · Sai Chen · Lan Yi · Zhaohe Zhang · Xunbo Yu

Light Field Displays (LFDs), despite significant advances in hardware technology supporting larger fields of view and multiple viewpoints, still face a critical challenge of limited content availability. Producing autostereoscopic 3D content on these displays requires refracting multi-perspective images into different spatial angles, with strict demands for spatial consistency across views, which is technically challenging for non-experts. Existing image/video generation models and radiance field-based methods cannot directly generate display content that meets the strict requirements of light field display hardware from a single 2D resource.We introduces the first generative framework ${\rm \bf EYE}^{3}$ specifically designed for 3D light field displays, capable of converting any 2D images, videos, or texts into high-quality display content tailored for these screens. The framework employs a point-based representation rendered through off-axis perspective, ensuring precise light refraction and alignment with the hardware's optical requirements. To maintain consistent 3D coherence across multiple viewpoints, we finetune a video diffusion model to fill occluded regions based on the rendered masks.Experimental results demonstrate that our approach outperforms state-of-the-art methods, significantly simplifying content creation for LFDs. With broad potential in industries such as entertainment, advertising, and immersive display technologies, our method offers a robust solution to content scarcity and greatly enhances the visual experience on LFDs.

Thu 23 Oct. 17:45 - 19:45 PDT

#304
NATRA: Noise-Agnostic Framework for Trajectory Prediction with Noisy Observations

Rongqing Li · Changsheng Li · Ruilin Lv · Yuhang Li · Yang Gao · Xiaolu Zhang · JUN ZHOU

Trajectory prediction aims to forecast an agent's future trajectories based on its historical observed trajectories, which is a critical task for various applications such as autonomous driving, robotics, and surveillance systems. Most existing trajectory prediction methods assume that the observed trajectories collected for forecasting are clean. However, in real-world scenarios, noise is inevitably introduced into the observations, resulting in the collapse of the existing approaches. Therefore, it is essential to perform robust trajectory prediction based on noisy observations, which is a more practical scenario. In this paper, we propose NATRA, a Noise-Agnostic framework capable of tackling the problem of TRAjectory prediction with arbitrary types of noisy observations. Specifically, we put forward a mutual information-based mechanism to denoise the original noisy observations. It optimizes the produced trajectories to exhibit a pattern that closely resembles the clean trajectory pattern while deviating from the noisy one.Considering that the trajectory structure may be destroyed through the only optimization of mutual information, we introduce an additional reconstruction loss to preserve the structure information of the produced observed trajectories. Moreover, we further propose a ranking loss to further enhance the performance. Because NATRA does not rely on any specific module tailored to particular noise distributions, it can handle arbitrary types of noise in principle. Additionally, our proposed NATRA can be easily integrated into existing trajectory prediction models. Extensive experiments on both synthetic and real-world noisy datasets demonstrate the effectiveness of our method.

Thu 23 Oct. 17:45 - 19:45 PDT

#305
SP2T: Sparse Proxy Attention for Dual-stream Point Transformer

Jiaxu Wan · Hong Zhang · Ziqi He · Yangyan Deng · Qishu Wang · Ding Yuan · Yifan Yang

Point transformers have demonstrated remarkable progress in 3D understanding through expanded receptive fields (RF), but further expanding the RF leads to dilution in group attention and decreases detailed feature extraction capability. Proxy, which serves as abstract representations for simplifying feature maps, enables global RF. However, existing proxy-based approaches face critical limitations: Global proxies incur quadratic complexity for large-scale point clouds and suffer positional ambiguity, while local proxy alternatives struggle with 1) Unreliable sampling from the geometrically diverse point cloud, 2) Inefficient proxy interaction computation, and 3) Imbalanced local-global information fusion; To address these challenges, we propose Sparse Proxy Point Transformer (SP$^{2}$T) -- a local proxy-based dual-stream point transformer with three key innovations: First, for reliable sampling, spatial-wise proxy sampling with vertex-based associations enables robust sampling on geometrically diverse point clouds. Second, for efficient proxy interaction, sparse proxy attention with a table-based relative bias effectively achieves the interaction with efficient map-reduce computation. Third, for local-global information fusion, our dual-stream architecture maintains local-global balance through parallel branches. Comprehensive experiments reveal that SP$^{2}$T sets state-of-the-art results with acceptable latency on indoor and outdoor 3D comprehension benchmarks, demonstrating marked improvement (+3.8\% mIoU vs. SPoTr@S3DIS, +22.9\% mIoU vs. PointASNL@Sem.KITTI) compared to other proxy-based point cloud methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#306
Instant GaussianImage: A Generalizable and Self-Adaptive Image Representation via 2D Gaussian Splatting

Zhaojie Zeng · Yuesong Wang · Chao Yang · Tao Guan · Lili Ju

Implicit Neural Representation (INR) has demonstrated remarkable advances in the field of image representation but demands substantial GPU resources. GaussianImage recently pioneered the use of Gaussian Splatting to mitigate this cost, however, the slow training process limits its practicality, and the fixed number of Gaussians per image limits its adaptability to varying information entropy. To address these issues, we propose in this paper a generalizable and self-adaptive image representation framework based on 2D Gaussian Splatting. Our method employs a network to quickly generate a coarse Gaussian representation, followed by minimal fine-tuning steps, achieving comparable rendering quality of GaussianImage while significantly reducing training time. Moreover, our approach dynamically adjusts the number of Gaussian points based on image complexity to further enhance flexibility and efficiency in practice. Experiments on DIV2K and Kodak datasets show that our method matches or exceeds GaussianImage’s rendering performance with far fewer iterations and shorter training times. Specifically, our method reduces the training time by up to one order of magnitude while achieving superior rendering performance with the same number of Gaussians.

Thu 23 Oct. 17:45 - 19:45 PDT

#307
CF3: Compact and Fast 3D Feature Fields

Hyunjoon Lee · Joonkyu Min · Jaesik Park

3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs and an excessive number of Gaussians. We propose a top-down pipeline for constructing compact and fast 3D feature fields, namely, \Ours{}. We first perform a weighted fusion of multi-view features with a pre-trained 3DGS. The aggregated feature captures spatial cues by integrating information across views, mitigating the ambiguity in 2D features. This top-down design enables a per-Gaussian autoencoder strategy to compress high-dimensional features into a 3D latent space, significantly balancing feature expressiveness and memory efficiency. Finally, we introduce an adaptive sparsification method that merges Gaussians to reduce complexity, ensuring efficient representation without unnecessary detail. Our approach produces a competitive 3D feature field using only about 10\% of the Gaussians compared to existing feature-embedded 3DGS methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#308
When Anchors Meet Cold Diffusion: A Multi-Stage Approach to Lane Detection

Bo-Lun Huang · Tzu-Hsiang Ni · Feng-Kai Huang · Hong-Han Shuai · Wen-Huang Cheng

Accurate and stable lane detection is crucial for the reliability of autonomous driving systems. A core challenge lies in predicting lane positions in complex scenarios, such as curved roads or when markings are ambiguous or absent.Conventional approaches leverage deep learning techniques to extract both high-level and low-level visual features, aiming to achieve a comprehensive understanding of the driving environment. However, these methods often rely on predefined anchors within a single-pass model, limiting their adaptability. The one-shot prediction paradigm struggles with precise lane estimation in challenging scenarios, such as curved roads or adverse conditions like low visibility at night.To address these limitations, we propose a novel cold diffusion-based framework that initializes lane predictions with predefined anchors and iteratively refines them. This approach retains the flexibility and progressive refinement capabilities of diffusion models while overcoming the constraints of traditional hot diffusion techniques.To further enhance the model’s coarse-to-fine refinement capabilities, we introduce a multi-resolution image processing strategy, where images are analyzed at different timesteps to capture both global and local lane structure details. Besides, we incorporate a learnable noise variance schedule, enabling the model to dynamically adjust its learning process based on multi-resolution inputs.Experimental results demonstrate that our method significantly improves detection accuracy across a variety of challenging scenarios, outperforming state-of-the-art lane detection methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#309
2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update

Jeongyun Kim · Seunghoon Jeong · Giseop Kim · Myung-Hwan Jeon · Eunji Jun · Ayoung Kim

Understanding the 3D geometry of transparent objects from RGB images is challenging due to their inherent physical properties, such as reflection and refraction. To address these difficulties, especially in scenarios with sparse views and dynamic environments, we introduce a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects.Our key insight lies in separating transparent objects from the background, enabling focused optimization of Gaussians corresponding to the object. We mitigate artifacts with an object‐aware loss that places Gaussians in obscured regions, ensuring coverage of invisible surfaces while reducing overfitting. Furthermore, we incorporate a physics-based simulation that refines the reconstruction in just a few seconds, effectively handling object removal and chain‐reaction movement of remaining objects without the need for rescanning.Our model was evaluated on both synthetic and real-world sequences, and it consistently demonstrated robust improvements over existing GS-based state-of-the-art methods. In comparison with baselines, our model reduces the mean absolute error by over 45\% for the synthetic TRansPose sequences. Furthermore, despite being updated using only one image, our model reaches a $\delta < 2.5$ cm accuracy of 48.46\%, nearly double that of baselines, which uses six images.

Thu 23 Oct. 17:45 - 19:45 PDT

#310
Learning Neural Scene Representation from iToF Imaging

Wenjie Chang · Hanzhi Chang · Yueyi Zhang · Wenfei Yang · Tianzhu Zhang

Indirect Time-of-Flight (iToF) cameras are popular for 3D perception because they are cost-effective and easy to deploy. They emit modulated infrared signals to illuminate the scene and process the received signals to generate amplitude and phase images. The depth is calculated from the phase using the modulation frequency. However, the obtained depth often suffers from noise caused by multi-path interference (MPI), low signal-to-noise ratio (SNR), and depth wrapping.Building on recent advancements in neural scene representations, which have shown great potential in 3D modeling from multi-view RGB images, we propose leveraging this approach to reconstruct 3D representations from noisy iToF data. Our method utilizes the multi-view consistency of amplitude and phase maps, averaging information from all input views to generate an accurate scene representation.Considering the impact of infrared illumination, we propose a new rendering scheme for amplitude maps based on signed distance function (SDF) and introduce a neural lighting function to model the appearance variations caused by active illumination.We also incorporate a phase-guided sampling strategy and a wrapping-aware phase-to-depth loss to utilize raw phase information and mitigate depth wrapping.Additionally, we add a noise-weight loss to prevent excessive smoothing information across noisy multi-view measurements.Experiments conducted on synthetic and real-world datasets demonstrate that the proposed method outperforms state-of-the-art techniques.

Thu 23 Oct. 17:45 - 19:45 PDT

#311
Highlight
No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views

Ranran Huang · Krystian Mikolajczyk

We introduce SPFSplat, an efficient framework for 3D Gaussian Splatting from sparse multi-view images, requiring no ground-truth poses during both training and inference. Our method simultaneously predicts Gaussians and camera poses from unposed images in a canonical space within a single feed-forward step. During training, the pose head estimate the poses at target views, which are supervised through the image rendering loss. Additionally, a reprojection loss is introduced to ensure alignment between Gaussians and the estimated poses of input views, reinforcing geometric consistency. This pose-free training paradigm and efficient one-step feed-forward inference makes SPFSplat well-suited for practical applications. Despite the absence of pose supervision, our self-supervised SPFSplat achieves state-of-the-art performance in novel view synthesis, even under significant viewpoint changes. Furthermore, it surpasses recent methods trained with geometry priors in relative pose estimation, demonstrating its effectiveness in both 3D scene reconstruction and camera pose learning.

Thu 23 Oct. 17:45 - 19:45 PDT

#312
Fusion Meets Diverse Conditions: A High-diversity Benchmark and Baseline for UAV-based Multimodal Object Detection with Condition Cues

Chen Chen · Kangcheng Bin · Hu Ting · Jiahao Qi · Xingyue Liu · Tianpeng Liu · Zhen Liu · Yongxiang Liu · Ping Zhong

Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of high-quality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0° to 75°, and all-day, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable high-level contextual information. To meet the challenge raised by such diverse conditions, we propose a novel prompt-guided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.

Thu 23 Oct. 17:45 - 19:45 PDT

#313
Faster and Better 3D Splatting via Group Training

Chengbo Wang · Guozheng Ma · Yizhen Lao · Yifei Xue

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, demonstrating remarkable capability in high-fidelity scene reconstruction through its Gaussian primitive representations. However, the computational overhead induced by the massive number of primitives poses a significant bottleneck to training efficiency. To overcome this challenge, we propose Group Training, a simple yet effective strategy that organizes Gaussian primitives into manageable groups, optimizing training efficiency and improving rendering quality. This approach shows universal compatibility with existing 3DGS frameworks, including vanilla 3DGS and Mip-Splatting, consistently achieving accelerated training while maintaining superior synthesis quality. Extensive experiments reveal that our straightforward Group Training strategy achieves up to 30% faster convergence and improved rendering quality across diverse scenarios.

Thu 23 Oct. 17:45 - 19:45 PDT

#314
Sat2City: 3D City Generation from A Single Satellite Image with Cascaded Latent Diffusion

Tongyan Hua · Lutao Jiang · Ying-Cong Chen · Wufan Zhao

Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond.However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations.To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning.To overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.

Thu 23 Oct. 17:45 - 19:45 PDT

#315
Recovering Parametric Scenes from Very Few Time-of-Flight Pixels

Carter Sifferman · Yiquan Li · Yiming Li · Fangzhou Mu · Michael Gleicher · Mohit Gupta · Yin Li

We aim to recover the geometry of 3D parametric scenes using very few depth measurements from low-cost, commercially available time-of-flight sensors. These sensors offer very low spatial resolution (i.e., a single pixel), but image a wide field-of-view per pixel and capture detailed time-of-flight data in the form of time-resolved photon counts. This time-of-flight data encodes rich scene information and thus enables recovery of simple scenes from sparse measurements. We investigate the feasibility of using a distributed set of few measurements (e.g., as few as 15 pixels) to recover the geometry of simple parametric scenes with a strong prior, such as estimating the 6D pose of a known object. To achieve this, we design a method that utilizes both feed-forward prediction to infer scene parameters, and differentiable rendering within an analysis-by-synthesis framework to refine the scene parameter estimate. We develop hardware prototypes and demonstrate that our method effectively recovers object pose given an untextured 3D model in both simulations and controlled real-world captures, and show promising initial results for other parametric scenes. We additionally conduct experiments to explore the limits and capabilities of our imaging solution.

Thu 23 Oct. 17:45 - 19:45 PDT

#316
Highlight
NeuFrameQ: Neural Frame Fields for Scalable and Generalizable Anisotropic Quadrangulation

Ying-Tian Liu · Jiajun Li · Yu-Tao Liu · Xin Yu · Yuan-Chen Guo · Yanpei Cao · Ding Liang · Ariel Shamir · Song-Hai Zhang

Quad meshes play a crucial role in computer graphics applications, yet automatically generating high-quality quad meshes remains challenging. Traditional quadrangulation approaches rely on local geometric features and manual constraints, often producing suboptimal mesh layouts that fail to capture global shape semantics. We introduce NeuFrameQ, a novel learning-based framework for scalable and generalizable mesh quadrangulation via frame field prediction. We first create a large-scale dataset of high-quality quad meshes with various shapes to serve as priors of domain knowledge. Empowered by this dataset, we employ a connectivity-agnostic learning approach that operates on point clouds with normals, enabling robust processing of complex mesh geometries. By decomposing frame field prediction into direction regression and magnitude estimation tasks, we effectively handle the ill-posed nature in frame field estimation. We also employ the polyvector representation and computing mechanism in both tasks to handle the inherent ambiguities in frame field representation. Extensive experiments demonstrate that NeuFrameQ produces high-quality quad meshes with superior semantic alignment, also for geometries derived from neural fields. Our method significantly advances the state of the art in automatic quad mesh generation, bridging the gap between neural content creation and production-ready geometric assets.

Thu 23 Oct. 17:45 - 19:45 PDT

#317
FB-Diff: Fourier Basis-guided Diffusion for Temporal Interpolation of 4D Medical Imaging

Xin You · Runze Yang · Chuyan Zhang · Zhongliang Jiang · JIE YANG · Nassir Navab

The temporal interpolation task for 4D medical imaging, plays a crucial role in clinical practice of respiratory motion modeling. Following the simplified linear-motion hypothesis, existing approaches adopt optical flow-based models to interpolate intermediate frames. However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier basis-guided Diffusion model, termed FB-Diff. Specifically, due to the regular motion discipline of respiration, physiological motion priors are introduced to describe general characteristics of temporal data distributions. Then a Fourier motion operator is elaborately devised to extract Fourier bases by incorporating physiological motion priors and case-specific spectral information in the feature space of Variational Autoencoder. Well-learned Fourier bases can better simulate respiratory motions with motion patterns of specific frequencies. Conditioned on starting and ending frames, the diffusion model further leverages well-learned Fourier bases via the basis interaction operator, which promotes the temporal interpolation task in a generative manner. Extensive results demonstrate that FB-Diff achieves state-of-the-art (SOTA) perceptual performance with better temporal consistency while maintaining promising reconstruction metrics. Anonymous codes are available.

Thu 23 Oct. 17:45 - 19:45 PDT

#318
RTMap: Real-Time Recursive Mapping with Change Detection and Localization

Yuheng Du · Sheng Yang · Lingxuan Wang · Zhenghua.Hou Zhenghua.Hou · Chengying Cai · Zhitao Tan · Mingxia Chen · Shi-Sheng Huang · Qiang Li

While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously.

Thu 23 Oct. 17:45 - 19:45 PDT

#319
CoDa-4DGS: Dynamic Gaussian Splatting with Context and Deformation Awareness for Autonomous Driving

Rui Song · Chenwei Liang · Yan Xia · Walter Zimmer · Hu Cao · Holger Caesar · Andreas Festag · Alois Knoll

Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating end-to-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, which incorporates context and temporal deformation awareness to improve dynamic scene rendering. Specifically, we employ a 2D semantic segmentation foundation model to self-supervise the 4D semantic features of Gaussians, ensuring meaningful contextual embedding. Simultaneously, we track the temporal deformation of each Gaussian across adjacent frames. By aggregating and encoding both semantic and temporal deformation features, each Gaussian is equipped with cues for potential deformation compensation within 3D space, facilitating a more precise representation of dynamic scenes. Experimental results show that our method improves 4DGS's ability to capture fine details in dynamic scene rendering for autonomous driving and outperforms other self-supervised methods in 4D reconstruction and novel view synthesis. Furthermore, CoDa-4DGS deforms semantic features with each Gaussian, enabling broader applications.

Thu 23 Oct. 17:45 - 19:45 PDT

#320
A Differentiable Wave Optics Model for End-to-End Computational Imaging System Optimization

Chi-Jui Ho · Yash Belhe · Steve Rotenberg · Ravi Ramamoorthi · Tzu-Mao Li · Nicholas Antipa

End-to-end optimization, which integrates differentiable optics simulators with computational algorithms, enables the joint design of hardware and software in data-driven imaging systems. However, existing methods usually compromise physical accuracy by neglecting wave optics or off-axis effects due to the high computational cost of modeling both aberration and diffraction. This limitation raises concerns about the robustness of optimized designs. In this paper, we propose a differentiable optics simulator that accurately and efficiently models aberration and diffraction in compound optics and allows us to analyze the role and impact of diffraction in end-to-end optimization. Experimental results demonstrate that compared with ray-optics-based optimization, diffraction-aware optimization improves system robustness to diffraction blur. Through accurate wave optics modeling, we also apply the simulator to optimize the Fizeau interferometer and free form optics elements. These findings underscore the importance of accurate wave optics modeling in robust end-to-end optimization.

Thu 23 Oct. 17:45 - 19:45 PDT

#321
Controllable 3D Outdoor Scene Generation via Scene Graphs

Yuheng Liu · Xinke Li · Yuning Zhang · Lu Qi · Xin Li · Wenping Wang · Chongshou Li · Xueting Li · Ming-Hsuan Yang

Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses scene graphs—an accessible, user-friendly control format—to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs. Code and dataset will be released upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#322
Highlight
CasP: Improving Semi-Dense Feature Matching Pipeline Leveraging Cascaded Correspondence Priors for Guidance

Peiqi Chen · Lei Yu · Yi Wan · Yingying Pei · Xinyi Liu · YongxiangYao YongxiangYao · Yingying Zhang · Lixiang Ru · Liheng Zhong · Jingdong Chen · Ming Yang · Yongjun Zhang

Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of $\sim2.2\times$ at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems.

Thu 23 Oct. 17:45 - 19:45 PDT

#323
PolGS: Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

Yufei Han · Bowen Tie · Heng Guo · Youwei Lyu · Si Li · Boxin Shi · Yunpeng Jia · Zhanyu Ma

Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a $\underline{Pol}$arimetric $\underline{G}$aussian $\underline{S}$platting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method.

Thu 23 Oct. 17:45 - 19:45 PDT

#324
Driving View Synthesis on Free-form Trajectories with Generative Prior

Zeyu Yang · Zijie Pan · Yuankun Yang · Xiatian Zhu · Li Zhang

Driving view synthesis along free-form trajectories is essential for realistic driving simulations, enabling closed-loop evaluation of end-to-end driving policies. Existing methods excel at view interpolation along recorded paths but struggle to generalize to novel trajectories due to limited viewpoints in driving videos. To tackle this challenge, we propose DriveX, a novel free-form driving view synthesis framework, that progressively distills generative prior into the 3D Gaussian model during its optimization. Within this framework, we utilize a video diffusion model to refine the degraded novel trajectory renderings from the in-training Gaussian model, while the restored videos in turn serve as additional supervision for optimizing the 3D Gaussian. Concretely, we craft an inpainting-based video restoration task, which can disentangle the identification of degraded regions from the generative capability of the diffusion model and remove the need of simulating specific degraded pattern in the training of the diffusion model. To further enhance the consistency and fidelity of generated contents, the pseudo ground truth is progressively updated with gradually improved novel trajectory rendering, allowing both components to co-adapt and reinforce each other while minimizing the disruption on the optimization. By tightly integrating 3D scene representation with generative prior, DriveX achieves high-quality view synthesis beyond recorded trajectories in real time—unlocking new possibilities for flexible and realistic driving simulations on free-form trajectories.

Thu 23 Oct. 17:45 - 19:45 PDT

#325
ResGS: Residual Densification of 3D Gaussian for Efficient Detail Recovery

Yanzhe Lyu · Kai Cheng · Kang Xin · Xuejin Chen

Recently, 3D Gaussian Splatting (3D-GS) has prevailed in novel view synthesis, achieving high fidelity and efficiency. However, it often struggles to capture rich details and complete geometry. Our analysis reveals that the 3D-GS densification operation lacks adaptiveness and faces a dilemma between geometry coverage and detail recovery. To address this, we introduce a novel densification operation, residual split, which adds a downscaled Gaussian as a residual. Our approach is capable of adaptively retrieving details and complementing missing geometry. To further support this method, we propose a pipeline named ResGS. Specifically, we integrate a Gaussian image pyramid for progressive supervision and implement a selection scheme that prioritizes the densification of coarse Gaussians over time. Extensive experiments demonstrate that our method achieves SOTA rendering quality. Consistent performance improvements can be achieved by applying our residual split on various 3D-GS variants, underscoring its versatility and potential for broader application in 3D-GS-based applications.

Thu 23 Oct. 17:45 - 19:45 PDT

#326
SEGS-SLAM: Structure-enhanced 3D Gaussian Splatting SLAM with Appearance Embedding

Tianci Wen · Zhiang Liu · Yongchun Fang

3D Gaussian splatting (3D-GS) has recently revolutionized novel view synthesis in the simultaneous localization and mapping (SLAM) problem. However, most existing algorithms fail to fully capture the underlying structure, resulting in structural inconsistency. Additionally, they struggle with abrupt appearance variations, leading to inconsistent visual quality. To address these problems, we propose SEGS-SLAM, a structure-enhanced 3D Gaussian Splatting SLAM, which achieves high-quality photorealistic mapping. Our main contributions are two-fold. First, we propose a structure-enhanced photorealistic mapping (SEPM) framework that, for the first time, leverages highly structured point cloud to initialize structured 3D Gaussians, leading to significant improvements in rendering quality. Second, we propose Appearance-from-Motion embedding (AfME), enabling 3D Gaussians to better model image appearance variations across different camera poses. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that SEGS-SLAM significantly outperforms state-of-the-art (SOTA) methods in photorealistic mapping quality, e.g., an improvement of $19.86\%$ in PSNR over MonoGS on the TUM RGB-D dataset for monocular cameras. The project page is available at https://segs-slam.github.io/.

Thu 23 Oct. 17:45 - 19:45 PDT

#327
Constraint-Aware Feature Learning for Parametric Point Cloud

Xi Cheng · Ruiqi Lei · Di Huang · Zhichao Liao · Fengyuan Piao · Yan Chen · Pingfa Feng · Long ZENG

Parametric point clouds are sampled from CAD shapes and are becoming increasingly common in industrial manufacturing. Most existing CAD-specific deep learning methods only focus on geometric features, while overlooking constraints which are inherent and important in CAD shapes. This limits their ability to discern CAD shapes with similar appearance but different constraints. To tackle this challenge, we first analyze the constraint importance via a simple validation experiment. Then, we introduce a deep learning-friendly constraints representation with three vectorized components, and design a constraint-aware feature learning network (CstNet), which includes two stages. Stage 1 extracts constraint feature from B-Rep data or point cloud based on shape local information. It enables better generalization ability to unseen dataset after model pre-training. Stage 2 employs attention layers to adaptively adjust the weights of three constraints' components. It facilitates the effective utilization of constraints. In addition, we built the first multi-modal parametric-purpose dataset, i.e. Param20K, comprising about 20K shape instances of 75 classes. On this dataset, we performed the classification and rotation robustness experiments, and CstNet achieved 3.52\% and 26.17\% absolute improvements in instance accuracy over the state-of-the-art methods, respectively. To the best of our knowledge, CstNet is the first constraint-aware deep learning method tailored for parametric point cloud analysis in CAD domain.

Thu 23 Oct. 17:45 - 19:45 PDT

#328
PixelStitch: Structure-Preserving Pixel-Wise Bidirectional Warps for Unsupervised Image Stitching

Hengzhe Jin · Lang Nie · Chunyu Lin · Xiaomei Feng · Yao Zhao

We propose $\textit{PixelStitch}$, a pixel-wise bidirectional warp that learns to stitch images as well as preserve structure in an unsupervised paradigm. To produce natural stitched images, we first determine the middle plane through homography decomposition and globally project the original images toward the desired plane. Compared with unidirectional homography transformation, it evenly spreads projective distortion across two views and decreases the proportion of invalid pixels. Then, the bidirectional optical flow fields are established to carry out residual pixel-wise deformation with projection-weighted natural coefficients, encouraging pixel motions to be as unnoticeable as possible in non-overlapping regions while smoothly transitioning into overlapping areas. Crucially, this flexible deformation enables $\textit{PixelStitch}$ to align large-parallax images and preserve the structural integrity of non-overlapping contents. To obtain high-quality stitched images in the absence of labels, a comprehensive unsupervised objective function is proposed to simultaneously encourage content alignment, structure preservation, and bidirectional consistency. Finally, extensive experiments are conducted to show our superiority to existing state-of-the-art (SoTA) methods in the quantitative metric, qualitative appearance, and generalization ability. The code will be available.

Thu 23 Oct. 17:45 - 19:45 PDT

#329
MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Ruiyuan Gao · Kai Chen · Bo Xiao · Lanqing HONG · Zhenguo Li · Qiang Xu

The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is vital for applications like autonomous driving. Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for frame-wise geometric control, rendering existing methods ineffective. To address these issues, we propose MagicDrive-V2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control. Additionally, we introduce an efficient method for obtaining contextual descriptions for videos to support diverse textual control, along with a progressive training strategy using mixed video data to enhance training efficiency and generalizability. Consequently, MagicDrive-V2 enables multi-view driving video synthesis with 3.3× resolution and 4× frame count (compared to current SOTA), rich contextual control, and geometric controls. Extensive experiments demonstrate MagicDrive-V2’s ability, unlocking broader applications in autonomous driving. Project page: magicdrive-v2.github.io

Thu 23 Oct. 17:45 - 19:45 PDT

#330
FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers

Junjie Zhang · Haisheng Su · Feixiang Song · Sanping Zhou · Wei Wu · Junchi Yan · Nanning Zheng

Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.

Thu 23 Oct. 17:45 - 19:45 PDT

#331
Correspondence-Free Fast and Robust Spherical Point Pattern Registration

Anik Sarker · Alan Asbeck

Existing methods for rotation estimation between two spherical ($\mathbb{S}^2$) patterns typically rely on spherical cross-correlation maximization between two spherical function. However, these approaches exhibit computational complexities greater than cubic $O(n^3)$ with respect to rotation space discretization and lack extensive evaluation under significant outlier contamination.To this end, we propose a rotation estimation algorithm between two spherical patterns with linear time complexity $O(n)$. Unlike existing spherical-function-based methods, we explicitly represent spherical patterns as discrete 3D point sets on the unit sphere, reformulating rotation estimation as a spherical point-set alignment (i.e., Wahba problem for 3D unit vectors). Given the geometric nature of our formulation, our spherical pattern alignment algorithm naturally aligns with the Wahba problem framework for 3D unit vectors. Specifically, we introduce three novel algorithms: (1) SPMC (Spherical Pattern Matching by Correlation), (2) FRS (Fast Rotation Search), and (3) a hybrid approach (SPMC+FRS) that combines the advantages of the previous two methods. Our experiments demonstrate that in the $\mathbb{S}^2$ domain and in correspondence-free settings, our algorithms are over 10x faster and over 10x more accurate than current state-of-the-art methods for the Wahba problem with outliers. We validate our approach through extensive simulations on a new dataset of spherical patterns, the ``Robust Vector Alignment Dataset."Furthermore, we adapt our methods to two real-world tasks: (i) Point Cloud Registration (PCR) and (ii) rotation estimation for spherical images. In the PCR task, our approach successfully registers point clouds exhibiting overlap ratios as low as 65\%. In spherical image alignment, we show that our method robustly estimates rotations even under challenging conditions involving substantial clutter (over 19\%) and large rotational offsets. Our results highlight the effectiveness and robustness of our algorithms in realistic, complex scenarios.Our Code is available in the attached link: https://anonymous.4open.science/r/RobustVectorAlignment-EC0E/README.md

Thu 23 Oct. 17:45 - 19:45 PDT

#332
Highlight
NeuraLeaf: Neural Parametric Leaf Models with Shape and Deformation Disentanglement

Yang Yang · Dongni Mao · Hiroaki Santo · Yasuyuki Matsushita · Fumio Okura

We develop a neural parametric model for 3D plant leaves for modeling and reconstruction of plants that are essential for agriculture and computer graphics. While parametric modeling has been actively studied for human and animal shapes, plant leaves present unique challenges due to their diverse shapes and flexible deformation, making common approaches inapplicable. To this problem, we introduce a learning-based parametric model, NeuraLeaf, disentangling the leaves' geometry into their 2D base shapes and 3D deformations. Since the base shapes represent flattened 2D leaves, it allows learning from rich sources of 2D leaf image datasets, and also has the advantage of simultaneously learning texture aligned with the geometry. To model the 3D deformation, we propose a novel skeleton-free skinning model and a newly captured 3D leaf dataset called DeformLeaf. We establish a parametric deformation space by converting the sample-wise skinning parameters into a compact latent representation, allowing for flexible and efficient modeling of leaf deformations. We show that NeuraLeaf successfully generates a wide range of leaf shapes with deformation, resulting in accurate model fitting to 3D observations like depth maps and point clouds. Our implementation and datasets will be released upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#333
ZeroStereo: Zero-shot Stereo Matching from Single Images

Xianqi Wang · Hao Yang · Gangwei Xu · Junda Cheng · Min Lin · Yong Deng · Jinliang Zang · Yurui Chen · Xin Yang

State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated real-world stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow.

Thu 23 Oct. 17:45 - 19:45 PDT

#334
CVFusion: Cross-View Fusion of 4D Radar and Camera for 3D Object Detection

Hanzhi Zhong · Zhiyu Xiang · Ruoyu Xu · Jingyun Fu · Peng Xu · Shaohong Wang · Zhihao Zhihao · Tianyu Pu · Eryun Liu

4D radar has received significant attention in autonomous driving thanks to its robustness under adverse weathers. Due to the sparse points and noisy measurements of the 4D radar, most of the research finish the 3D object detection task by integrating images from camera and perform modality fusion in BEV space. However, the potential of the radar and the fusion mechanism is still largely unexplored, hindering the performance improvement. In this study, we propose a cross-view two-stage fusion network called CVFusion. In the first stage, we design a radar guided iterative (RGIter) BEV fusion module to generate high-recall 3D proposal boxes. In the second stage, we aggregate features from multiple heterogeneous views including points, image, and BEV for each proposal. These comprehensive instance level features greatly help refine the proposals and generate high-quality predictions. Extensive experiments on public datasets show that our method outperforms the previous state-of-the-art methods by a large margin, with 9.10\% and 3.68\% mAP improvements on View-of-Delft (VoD) and TJ4DRadSet, respectively. Our code will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#335
Highlight
Stochastic Gradient Estimation for Higher-Order Differentiable Rendering

Zican Wang · Michael Fischer · Tobias Ritschel

We derive methods to compute higher order differentials (Hessians and Hessian-vector products) of the rendering operator. Our approach is based on importance sampling of a convolution that represents the differentials of rendering parameters and shows to be applicable to both rasterization and path tracing. We demonstrate that this information improves convergence when used in higher-order optimizers such as Newton or Conjugate Gradient relative to a gradient descent baseline in several inverse rendering tasks.

Thu 23 Oct. 17:45 - 19:45 PDT

#336
M2EIT: Multi-Domain Mixture of Experts for Robust Neural Inertial Tracking

Yan Li · Yang Xu · Changhao Chen · Zhongchen Shi · Wei Chen · Liang Xie · Hongbo Chen · Erwei Yin

Inertial tracking (IT), independent of the environment and external infrastructure, has long been the ideal solution for providing location services to humans. Despite significant strides in inertial tracking empowered by deep learning, prevailing neural inertial tracking predominantly utilizes conventional spatial-temporal features from inertial measurements. Unfortunately, the frequency domain dimension is usually overlooked in the current literature. To this end, in this paper, we propose a Multi-Domain Mixture of Experts model for Neural Inertial Tracking, named M$^2$EIT. Specifically, M$^2$EIT first leverages ResNet as a spatial decomposition expert to capture spatial relationships between multivariate timeseries, and State Space Model (SSM)-based Bi-Mamba, the other expert to focus on learning temporal correlations. In the frequency domain mapping, we then introduce the Wavelet-based frequency decomposition expert, which decomposes IMU samples into low-frequency bands and high-frequency bands using the Haar wavelet transform for simulating motion patterns at different temporal scales. To bridge the semantic gap across multiple domains and integrate them adaptively, we design the Multi-Representation Alignment Router (MAR), which consists of a dual cross-domain translation layer, followed by a dynamic router, to achieve multi-domain semantic alignment and optimize expert contributions. Extensive experiments conducted on three real-world datasets demonstrate that the proposed M$^2$EIT can achieve SOTA results in neural inertial tracking.

Thu 23 Oct. 17:45 - 19:45 PDT

#337
Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation

Akshay Krishnan · Xinchen Yan · Vincent Casser · Abhijit Kundu

We introduce Orchid, a unified latent diffusion model that learns a joint appearance-geometry learned prior to generate color, depth, and surface normal images in a single diffusion process. This unified approach is more efficient and coherent than current pipelines that use separate models for appearance and geometry. Orchid is versatile—it directly generates color, depth, and normal images from text, supports joint monocular depth and normal estimation with color-conditioned finetuning, and seamlessly inpaints large 3D regions by sampling from the joint distribution. It leverages a novel Variational Autoencoder (VAE) that jointly encodes RGB, relative depth, and surface normals into a shared latent space, combined with a latent diffusion model that denoises these latents. Our extensive experiments demonstrate that Orchid delivers competitive performance against SOTA task-specific geometry prediction methods, even surpassing them in normal-prediction accuracy and depth-normal consistency. It also inpaints color-depth-normal images jointly, with more qualitative realism than existing multi-step methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#338
CATSplat: Context-Aware Transformer with Spatial Guidance for Generalizable 3D Gaussian Splatting from A Single-View Image

Wonseok Roh · Hwanhee Jung · JongWook Kim · Seunggwan Lee · Innfarn Yoo · Andreas Lugmayr · Seunggeun Chi · Karthik Ramani · Sangpil Kim

Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources.These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass.Unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area.In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings.First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from single-view image features.By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under monocular settings.With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques.Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.

Thu 23 Oct. 17:45 - 19:45 PDT

#339
Highlight
TPG-INR: Target Prior-Guided Implicit 3D CT Reconstruction for Enhanced Sparse-view Imaging

QingleiCao QingleiCao · Ziyao Tang · Xiaoqin Tang

X-ray imaging, based on penetration, enables detailed visualization of internal structures. Building on this capability, existing implicit 3D reconstruction methods have adapted the NeRF model and its variants for internal CT reconstruction. However, these approaches often neglect the significance of objects' anatomical priors for implicit learning, limiting both reconstruction precision and learning efficiency, particularly in ultra-sparse view scenarios. To address these challenges, we propose a novel 3D CT reconstruction framework that employs a 'target prior' derived from the object's projection data to enhance implicit learning. Our approach integrates positional and structural encoding to facilitate voxel-wise implicit reconstruction, utilizing the target prior to guide voxel sampling and enrich structural encoding. This dual strategy significantly boosts both learning efficiency and reconstruction quality. Additionally, we introduce a CUDA-based algorithm for rapid estimation of high-quality 3D target priors from sparse-view projections. Experiments utilizing projection data from a complex abdominal dataset demonstrate that the proposed model substantially enhances learning efficiency, outperforming the current leading model, NAF, by a factor of ten. In terms of reconstruction quality, it also exceeds the most accurate model, NeRP, achieving PSNR improvements of 3.57 dB, 5.42 dB, and 5.70 dB with 10, 20, and 30 projections, respectively. The code is available upon request.

Thu 23 Oct. 17:45 - 19:45 PDT

#340
TrackAny3D: Transferring Pretrained 3D Models for Category-unified 3D Point Cloud Tracking

Mengmeng Wang · Haonan Wang · Yulong Li · Xiangjie Kong · Jiaxin Du · Feng Xia · Guojiang Shen

3D LiDAR-based single object tracking (SOT) relies on sparse and irregular point clouds, posing challenges from geometric variations in scale, motion patterns, and structural complexity across object categories. Current category-specific approaches achieve good accuracy but are impractical for real-world use, requiring separate models for each category and showing limited generalization.To tackle these issues, we propose TrackAny3D, the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT. We first integrate parameter-efficient adapters to bridge the gap between pretraining and tracking tasks while preserving geometric priors. Then, we introduce a Mixture-of-Geometry-Experts (MoGE) architecture that adaptively activates specialized subnetworks based on distinct geometric characteristics. Additionally, we design a temporal context optimization strategy that incorporates learnable temporal tokens and a dynamic mask weighting module to propagate historical information and mitigate temporal drift.Experiments on three commonly-used benchmarks show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT, demonstrating strong generalization and competitiveness. We hope this work will enlighten the community on the importance of unified models and further expand the use of large-scale pretrained models in this field. The source code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#341
Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives

ziyu zhang · Binbin Huang · Hanqing Jiang · Liyang Zhou · Xiaojun Xiang · Shuhan Shen

We propose Quadratic Gaussian Splatting (QGS), a novel representation that replaces static primitives with deformable quadric surfaces (e.g., ellipse, paraboloids) to capture intricate geometry. Unlike prior works that rely on Euclidean distance for primitive density modeling—a metric misaligned with surface geometry under deformation—QGS introduces geodesic distance-based density distributions. This innovation ensures that density weights adapt intrinsically to the primitive’s curvature, preserving consistency during shape changes (e.g., from planar disks to curved paraboloids). By solving geodesic distances in closed form on quadric surfaces, QGS enables surface-aware splatting, where a single primitive can represent complex curvature that previously required dozens of planar surfels, potentially reducing memory usage while maintaining real-time rendering via efficient ray-quadric intersection. Experiments on DTU, Tanks and Temples, and MipNeRF360 datasets demonstrate state-of-the-art surface reconstruction, with QGS reducing geometric error (chamfer distance) by 33% over 2DGS and 27% over GOF on the DTU dataset. Crucially, QGS retains competitive appearance quality, bridging the gap between geometric precision and visual fidelity for applications like robotics and immersive reality.

Thu 23 Oct. 17:45 - 19:45 PDT

#342
Uncertainty-Aware Diffusion-Guided Refinement of 3D Scenes

Sarosij Bose · Arindam Dutta · Sayak Nag · Junge Zhang · Jiachen Li · Konstantinos Karydis · Amit Roy-Chowdhury

Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, particularly in unseen regions far away from the input camera, existing single image to 3D reconstruction methods render incoherent and blurry views. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image’s view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module which calculates the per-pixel entropy and yields uncertainty maps which are used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#343
VoxelKP: A Voxel-based Network Architecture for Human Keypoint Estimation in LiDAR Data

Jian Shi · Peter Wonka

We present \textit{VoxelKP}, a novel fully sparse network architecture tailored for human keypoint estimation in LiDAR data.The key challenge is that objects are distributed sparsely in 3D space, while human keypoint detection requires detailed local information wherever humans are present.First, we introduce a dual-branch \textit{fully sparse spatial-context block} where the spatial branch focuses on learning the local spatial correlations between keypoints within each human instance, while the context branch aims to retain the global spatial information. Second, we use a \textit{spatially aware multi-scale BEV fusion} technique to leverage absolute 3D coordinates when projecting 3D voxels to a 2D grid encoding a bird's eye view for better preservation of the global context of each human instance.We evaluate our method on the Waymo dataset and achieve an improvement of $27\%$ on the MPJPE metric compared to the state-of-the-art, \textit{HUM3DIL}, trained on the same data, and $12\%$ against the state-of-the-art, \textit{GC-KPL}, pretrained on a $25\times$ larger dataset.To the best of our knowledge, \textit{VoxelKP} is the first single-staged, fully sparse network that is specifically designed for addressing the challenging task of 3D keypoint estimation from LiDAR data, achieving state-of-the-art performance. Our code is available at\url{https://}.

Thu 23 Oct. 17:45 - 19:45 PDT

#344
χ: Symmetry Understanding of 3D Shapes via Chirality Disentanglement

Weikang Wang · Tobias Weißberg · Nafie El Amrani · Florian Bernard

Chirality information (i.e., information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. Contrary to symmetry, for which there has been a lot of research in the image domain, chirality information in shape analysis (point clouds and meshes) has remained underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robust to rigid-body pose transformations), they are not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. In this paper, based on the recent framework Diff3f, we proposed an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. Quantitative and qualitative results of various experiments and downstream tasks include left-right disentanglement, shape matching, and part segmentation conducted on a variety of datasets proving the effectiveness and usefulness of our extracted chirality features. The code will be available once this work is accepted.

Thu 23 Oct. 17:45 - 19:45 PDT

#345
Foresight in Motion: Reinforcing Trajectory Prediction with Reward Heuristics

Muleilan Pei · Shaoshuai Shi · Xuesong Chen · Xu Liu · Shaojie Shen

Motion forecasting for on-road traffic agents presents both a significant challenge and a critical necessity for ensuring safety in autonomous driving systems. In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a "First Reasoning, Then Forecasting" strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. Our method first encodes traffic agents and scene elements into a unified vectorized representation, then aggregates contextual features through a query-centric paradigm. This enables the derivation of a reward distribution, a compact yet informative representation of the target agent's behavior within the given scene context via IRL. Guided by this reward heuristic, we perform policy rollouts to reason about multiple plausible intentions, providing valuable priors for subsequent trajectory generation. Finally, we develop a hierarchical DETR-like decoder integrated with bidirectional selective state space models to produce accurate future trajectories along with their associated probabilities. Extensive experiments on the large-scale Argoverse and nuScenes motion forecasting datasets demonstrate that our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#346
MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception

ChangWon Kang · Jisong Kim · Hongjae Shin · Junseo Park · Jun Won Choi

Multi-task learning (MTL) has emerged as a promising approach to jointly optimize multiple perception tasks in autonomous driving, but existing methods suffer from feature interference and inefficient task-specific learning. In this paper, we introduce MAESTRO, a novel query-based framework that explicitly generates task-specific features to mitigate feature interference and improve efficiency in multi-task 3D perception. Our model consists of three key components: Semantic Query Generator (SQG), Task-Specific Feature Generator (TSFG), and Scene Query Aggregator (SQA). SQG generates query features and decomposes them into foreground and background queries to facilitate selective feature sharing. TSFG refines task-specific features by integrating decomposed queries with voxel features while suppressing irrelevant information. The detection and map heads generate task-aware queries, which SQA aggregates with the initially extracted queries from SQG to enhance semantic occupancy prediction. Extensive evaluations on the nuScenes benchmark show that MAESTRO achieves state-of-the-art performance across all tasks. Our model overcomes the performance trade-off among tasks in multi-task learning, where improving one task often hinders others, and sets a new benchmark in multi-task 3D perception.

Thu 23 Oct. 17:45 - 19:45 PDT

#347
Domain-aware Category-level Geometry Learning Segmentation for 3D Point Clouds

Pei He · Lingling Li · Licheng Jiao · Ronghua Shang · Fang Liu · Shuang Wang · Xu Liu · wenping ma

Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods. The code will be available.

Thu 23 Oct. 17:45 - 19:45 PDT

#348
Event-boosted Deformable 3D Gaussians for Dynamic Scene Reconstruction

Wenhao Xu · Wenming Weng · Yueyi Zhang · Ruikang Xu · Zhiwei Xiong

Deformable 3D Gaussian Splatting (3D-GS) is limited by missing intermediate motion information due to the low temporal resolution of RGB cameras. To address this, we introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for dynamic scene reconstruction. We observe that threshold modeling for events plays a crucial role in achieving high-quality reconstruction. Therefore, we propose a GS-Threshold Joint Modeling strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling. Moreover, we introduce a Dynamic-Static Decomposition strategy that first identifies dynamic areas by exploiting the inability of static Gaussians to represent motions, then applies a buffer-based soft decomposition to separate dynamic and static areas. This strategy accelerates rendering by avoiding unnecessary deformation in static areas, and focuses on dynamic areas to enhance fidelity. Additionally, we contribute the first event-inclusive 4D benchmark with synthetic and real-world dynamic scenes, on which our method achieves state-of-the-art performance.

Thu 23 Oct. 17:45 - 19:45 PDT

#349
ToF-Splatting: Dense SLAM using Sparse Time-of-Flight Depth and Multi-Frame Integration

Andrea Conti · Matteo Poggi · Valerio Cambareri · Martin R. Oswald · Stefano Mattoccia

Time-of-Flight (ToF) sensors provide efficient active depth sensing at relatively low power budgets; among such designs, only very sparse measurements from low-resolution sensors are considered to meet the increasingly limited power constraints of mobile and AR/VR devices. However, such extreme sparsity levels limit the seamless usage of ToF depth in SLAM. In this work, we propose ToF-Splatting, the first 3D Gaussian Splatting-based SLAM pipeline tailored for using effectively very sparse ToF input data. Our approach improves upon the state of the art by introducing a multi-frame integration module, which produces dense depth maps by merging cues from extremely sparse ToF depth, monocular color, and multi-view geometry. Extensive experiments on both synthetic and real sparse ToF datasets demonstrate the viability of our approach, as it achieves state-of-the-art tracking and mapping performances on reference datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#350
Joint Semantic and Rendering Enhancements in 3D Gaussian Modeling with Anisotropic Local Encoding

Jingming He · Chongyi Li · Shiqi Wang · Sam Kwong

Recent works propose extending 3DGS with semantic feature vectors for simultaneous semantic segmentation and image rendering. However, these methods often treat the semantic and rendering branches separately, relying solely on 2D supervision while ignoring the 3D Gaussian geometry. Moreover, current adaptive strategies adapt the Gaussian set depending solely on rendering gradients, which can be insufficient in subtle or textureless regions. In this work, we propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches. Firstly, unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor using the Laplace–Beltrami operator to capture fine-grained 3D shape details, thereby distinguishing objects with similar appearances and reducing reliance on potentially noisy 2D guidance. In addition, without rely solely on rendering gradient, we adaptively adjust Gaussian allocation and spherical harmonics (SH) with local semantic and shape signals, enhancing rendering efficiency through selective resource allocation. Finally, we employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations without relearning shape information from scratch for each new scene. Experiments on multiple datasets demonstrate improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.

Thu 23 Oct. 17:45 - 19:45 PDT

#351
Unsupervised Imaging Inverse Problems with Diffusion Distribution Matching

Giacomo Meanti · Thomas Ryckeboer · Michael Arbel · Julien Mairal

Inverse problems provide a fundamental framework for image reconstruction tasks, spanning deblurring, calibration, or low-light enhancement for instance. While widely used, they often assume full knowledge of the forward model---an unrealistic expectation---while collecting ground truth and measurement pairs is time-consuming and labor-intensive.Without paired supervision or an invertible forward model, solving inverse problems becomes significantly more challenging and error-prone. To address this, strong priors have traditionally been introduced to regularize the problem, enabling solutions from single images alone.In this work, however, we demonstrate that with minimal assumptions on the forward model and by leveraging small, unpaired clean and degraded datasets, we can achieve good estimates of the true degradation. We employ conditional flow matching to efficiently model the degraded data distribution and explicitly learn the forward model using a tailored distribution-matching loss.Through experiments on uniform and non-uniform deblurring tasks, we show that our method outperforms both single-image blind and unsupervised approaches, narrowing the gap to non-blind methods. We also showcase the effectiveness of our method with a proof of concept for automatic lens calibration---a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort.

Thu 23 Oct. 17:45 - 19:45 PDT

#352
R-LiViT: A LiDAR-Visual-Thermal Dataset Enabling Vulnerable Road User Focused Roadside Perception

Jonas Mirlach · Lei Wan · Andreas Wiedholz · Hannan Keen · Andreas Eich

In autonomous driving, the integration of roadside perception systems is essential for overcoming occlusion challenges and enhancing the safety of vulnerable road users (VRUs). While LiDAR and visual (RGB) sensors are commonly used, thermal imaging remains underrepresented in datasets, despite its acknowledged advantages for VRU detection in extreme lighting conditions.In this paper, we present R-LiViT, the first dataset to combine LiDAR, RGB, and thermal imaging from a roadside perspective, with a strong focus on VRUs.R-LiViT captures three intersections during both day and night, ensuring a diverse dataset.It includes 10,000 LiDAR frames and 2,400 temporally and spatially aligned RGB and thermal images across over 150 traffic scenarios, with 6 and 8 annotated classes respectively, providing a comprehensive resource for tasks such as object detection and tracking.The dataset and the code for reproducing our evaluation results are made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#353
V2XScenes: A Multiple Challenging Traffic Conditions Dataset for Large-Range Vehicle-Infrastructure Collaborative Perception

Bowen Wang · Yafei Wang · Wei Gong · Siheng Chen · Genjia Liu · Minhao Xiong · Chin Long Ng

Whether autonomous driving can effectively handle challenging scenarios such as bad weather and complex traffic environments is still in doubt. One of the critical difficulties is that the single-view perception makes it hard to obtain the complementary perceptual information around the multi-condition scenes, such as meeting occlusion and congestion. To investigate the advantages of collaborative perception in high-risky driving scenarios, we construct a multiple challenging conditions dataset for large-range vehicle-infrastructure cooperative perception, called V2XScenes, which includes seven typical multi-modal layouts at successive road section. Particularly, each selected scene is labeled with a specific condition description, and we provide unique object tracking numbers across the entire road section and sequential frames to ensure consistency. Comprehensive cooperative perception benchmarks of 3D object detection and tracking for large-range roadside scenes are summarized, and the quantitative results based on the state-of-the-art demonstrate the effectiveness of collaborative perception facing challenging scenes. The data and benchmark codes of V2XScenes will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#354
mmCooper: A Multi-agent Multi-stage Communication-efficient and Collaboration-robust Cooperative Perception Framework

Bingyi Liu · Jian Teng · Hongfei Xue · Enshu Wang · Chuanhui Zhu · Pu Wang · Libing Wu

Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, real-world deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi-agent, multi-stage, communication-efficient, and collaboration-robust cooperative perception framework. Our framework leverages a multi-stage collaboration strategy that dynamically and adaptively balances intermediate- and late-stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework prevents misleading low-confidence sensing information from transmission and refines the received detection results from collaborators to improve accuracy. The extensive evaluation results on both real-world and simulated datasets demonstrate the effectiveness of the mmCooper framework and its components.

Thu 23 Oct. 17:45 - 19:45 PDT

#355
SC-Lane: Slope-aware and Consistent Road Height Estimation Framework for 3D Lane Detection

Chaesong Park · Eunbin Seo · JihyeonHwang JihyeonHwang · Jongwoo Lim

In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the optimal fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the optimal weights for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios.To evaluate the effectiveness of SC-Lane, we introduce a LiDAR-derived heightmap dataset and adopt standard evaluation metrics, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy. While these metrics are widely used in surface and depth estimation, their application to road height estimation has been underexplored. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3% — outperforming existing methods by a notable margin. These results highlight SC-Lane’s potential for enhancing the reliability of autonomous driving perception.The code and dataset used in this study will be made publicly available upon publication.

Thu 23 Oct. 17:45 - 19:45 PDT

#356
ForestFormer3D: A Unified Framework for End-to-End Segmentation of Forest LiDAR 3D Point Clouds

Binbin Xiang · Maciej Wielgosz · Stefano Puliti · Kamil Král · Martin Krůček · Azim Missarov · Rasmus Astrup

The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code will be released post-acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#357
Easy3D: A Simple Yet Effective Method for 3D Interactive Segmentation

Andrea Simonelli · Norman Müller · Peter Kontschieder

The increasing availability of digital 3D environments, whether through image reconstruction, generation, or scans obtained via lasers or robots, is driving innovation across various fields. Among the numerous applications, there is a significant demand for those that enable 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and consistently perform well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as Gaussian Splatting.

Thu 23 Oct. 17:45 - 19:45 PDT

#358
Robust 3D Object Detection using Probabilistic Point Clouds from Single-Photon LiDARs

Bhavya Goyal · Felipe Gutierrez-Barragan · Wei Lin · Andreas Velten · Yin Li · Mohit Gupta

LiDAR-based 3D sensors provide point clouds, a canonical 3D representation used in various 3D scene understanding tasks. Modern LiDARs face key challenges in various real-world scenarios such as long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors, which are rooted in the noisy raw LiDAR measurements, get propagated to downstream perception models, resulting in potentially severe loss of accuracy. This is because conventional 3D processing pipelines used to construct point clouds from raw LiDAR measurements do not retain the uncertainty information available in the raw sensor data. We propose a novel 3D scene representation called Probabilistic Point Clouds (PPC) where each point is augmented with a probability attribute that encapsulates the measurement uncertainty (confidence) in raw data. We further introduce inference approaches that leverage PPC for robust 3D object detection; these methods are versatile and can be used as computationally lightweight drop-in modules in 3D inference pipelines. We demonstrate, via both simulations and real captures, that PPC-based 3D inference methods outperform several baselines with LiDAR as well as camera-LiDAR fusion models, across challenging indoor and outdoor scenarios involving small, distant, and low-albedo objects, as well as strong ambient light.

Thu 23 Oct. 17:45 - 19:45 PDT

#359
SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

Andreas Engelhardt · Mark Boss · Vikram Voleti · Chun-Han Yao · Hendrik Lensch · Varun Jampani

We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional pipeline steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially-varying PBR parameters and surface normals jointly with each generated RGB view based on explicit camera control. This unique setup allows for direct relighting in a 2.5D setting, and for generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse image inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.

Thu 23 Oct. 17:45 - 19:45 PDT

#360
Highlight
HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

YIWEN CHEN · Hieu (Hayden) Nguyen · Vikram Voleti · Varun Jampani · Huaizu Jiang

We introduce HouseCrafter, a novel approach that can lift a 2D floorplan into a complete large 3D indoor scene (\eg, a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in batches along sampled locations derived from the floorplan. At each step, the diffusion model conditions on previously generated images to produce new images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-FRONT dataset, we demonstrate that HouseCrafter can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights.

Thu 23 Oct. 17:45 - 19:45 PDT

#361
Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

Junyan Ye · Jun He · Weijia Li · Zhutao Lv · Yi Lin · Jinhua Yu · Haote Yang · Conghui He

Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm. The Curved-BEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a "multi-to-one" mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, low-altitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and content-consistent aerial image generation. The code, datasets and more information of this work can be found at https://skydiffusion0307.github.io/.

Thu 23 Oct. 17:45 - 19:45 PDT

#362
EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting

Xiaobao Wei · Qingpo Wuwu · Zhongyu Zhao · Zhuangzhe Wu · Nan Huang · Ming Lu · ningning ma · Shanghang Zhang

Photorealistic reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. While recent methods based on 3D/4D Gaussian Splatting (GS) have demonstrated promising results, they still encounter challenges in complex street scenes due to the unpredictable motion of dynamic objects. Current methods typically decompose street scenes into static and dynamic objects, learning the Gaussians in either a supervised manner (e.g., w/ 3D bounding-box) or a self-supervised manner (e.g., w/o 3D bounding-box). However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. The proposed plug-and-play EMD module compensates for the lack of motion modeling in self-supervised street Gaussian splatting methods. We also introduce tailored training strategies to extend EMD to supervised approaches. Comprehensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art novel view synthesis performance in self-supervised settings.The code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#363
Learning Null Geodesics for Gravitational Lensing Rendering in General Relativity

Mingyuan Sun · Zheng Fang · Jiaxu Wang · Kun-Yi Zhang · Qiang Zhang · Renjing Xu

We present GravlensX, an innovative method for rendering black holes with gravitational lensing effects using neural networks. The methodology involves training neural networks to fit the spacetime around black holes and then employing these trained models to generate the path of light rays affected by gravitational lensing. This enables efficient and scalable simulations of black holes, significantly decreasing the time required for rendering compared to traditional methods. We validate our approach through extensive rendering of multiple black hole systems with superposed Kerr metric, demonstrating its capability to produce accurate visualizations with significantly $15\times$ reduced computational time. Our findings suggest that neural networks offer a promising alternative for rendering complex astrophysical phenomena, potentially paving a new path to astronomical visualization. Our code will be open-source soon.

Thu 23 Oct. 17:45 - 19:45 PDT

#364
Highlight
Kaleidoscopic Background Attack: Disrupting Pose Estimation with Multi-Fold Radial Symmetry Textures

Xinlong Ding · Hongwei Yu · Jiawei Li · Feifan Li · Yu Shang · Bochao Zou · Huimin Ma · Jiansheng Chen

Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that adversarial kaleidoscopic backgrounds optimized by KBA can effectively attack various camera pose estimation models.

Thu 23 Oct. 17:45 - 19:45 PDT

#365
Top2Pano: Learning to Generate Indoor Panoramas from Top-Down View

Zitong Zhang · Suranjan Gautam · Rui Yu

Generating immersive 360° indoor panoramas from 2D top-down views has applications in virtual reality, interior design, real estate, and robotics. This task is challenging due to the lack of explicit 3D structure and the need for geometric consistency and photorealism. We propose Top2Pano, an end-to-end model for synthesizing realistic indoor panoramas from top-down views. Our method estimates volumetric occupancy to infer 3D structures, then uses volumetric rendering to generate coarse color and depth panoramas. These guide a diffusion-based refinement stage using ControlNet, enhancing realism and structural fidelity. Evaluations on two datasets show Top2Pano outperforms baselines, effectively reconstructing geometry, occlusions, and spatial arrangements. It also generalizes well, producing high-quality panoramas from schematic floorplans. Our results highlight Top2Pano's potential in bridging top-down views with immersive indoor synthesis.

Thu 23 Oct. 17:45 - 19:45 PDT

#366
Perspective-aware 3D Gaussian Inpainting with Multi-view Consistency

Yuxin CHENG · Binxiao Huang · Taiqiang Wu · Wenyong Zhou · Chenchen Ding · Zhengwu Liu · Graziano Chesi · Ngai Wong

3D Gaussian inpainting, a critical technique for numerous applications in virtual reality and multimedia, has made significant progress with pretrained diffusion models. However, ensuring multi-view consistency, an essential requirement for high-quality inpainting, remains a key challenge. In this work, we present PAInpainter, a novel approach designed to advance 3D Gaussian inpainting by leveraging perspective-aware content propagation and consistency verification across multi-view inpainted images. Our method iteratively refines inpainting and optimizes the 3D Gaussian representation with multiple views adaptively sampled from a perspective graph. By propagating inpainted images as prior information and verifying consistency across neighboring views, PAInpainter substantially enhances global consistency and texture fidelity in restored 3D scenes. Extensive experiments demonstrate the superiority of PAInpainter over existing methods. Our approach achieves superior 3D inpainting quality, with PSNR scores of 26.03 dB and 29.51 dB on the SPIn-NeRF and NeRFiller datasets, respectively, highlighting its effectiveness and generalization capability.

Thu 23 Oct. 17:45 - 19:45 PDT

#367
SparseRecon: Neural Implicit Surface Reconstruction from Sparse Views with Feature and Depth Consistencies

Liang Han · Xu Zhang · Haichuan Song · Kanle Shi · Liang Han · Zhizhong Han

Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. However, existing generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural008 implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. Firstly, we introduce a feature con sistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios on small overlapping views.

Thu 23 Oct. 17:45 - 19:45 PDT

#368
SurfaceSplat: Connecting Surface Reconstruction and Gaussian Splatting

Zihui Gao · Jia-Wang Bian · Guosheng Lin · Hao Chen · Chunhua Shen

Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines both strengths: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine SDF details for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on DTU and MobileBrick datasets.The code will be released upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#369
SAM4D: Segment Anything in Camera and LiDAR Streams

Jianyun Xu · Song Wang · Ziqian Ni · Chunyong Hu · Sheng Yang · Jianke Zhu · Qiang Li

We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.

Thu 23 Oct. 17:45 - 19:45 PDT

#370
StruMamba3D: Exploring Structural Mamba for Self-supervised Point Cloud Representation Learning

Chuxin Wang · Yixin Zha · Wenfei Yang · Tianzhu Zhang

Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1\% accuracy on ModelNet40 and 92.75\% accuracy on the most challenging split of ScanObjectNN without voting strategy.

Thu 23 Oct. 17:45 - 19:45 PDT

#371
Representing 3D Shapes With 64 Latent Vectors for 3D Diffusion Models

In Cho · Youngbeom Yoo · Subin Jeon · Seon Joo Kim

Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models.This paper introduces COD-VAE, a VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality.COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency.First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency.Experimental results demonstrate that COD-VAE achieves 16$\times$ compression compared to the baseline while maintaining quality. This enables $20.8\times$ speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation.

Thu 23 Oct. 17:45 - 19:45 PDT

#372
Scaling Transformer-Based Novel View Synthesis with Models Token Disentanglement and Synthetic Data

Nithin Gopalakrishnan Nair · Srinivas Kaza · Xuan Luo · Jungyeon Park · Stephen Lombardi · Vishal Patel

Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs.

Thu 23 Oct. 17:45 - 19:45 PDT

#373
LINR-PCGC: Lossless Implicit Neural Representations for Point Cloud Geometry Compression

Wenjie Huang · Qi Yang · Shuting Xia · He Huang · Yiling Xu · Zhu Li

Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC.

Thu 23 Oct. 17:45 - 19:45 PDT

#374
CanFields: Consolidating Diffeomorphic Flows for Non-Rigid 4D Interpolation from Arbitrary-Length Sequences

Miaowei Wang · Changjian Li · Amir Vaxman

We introduce Canonical Consolidation Fields (CanFields). This novel method interpolates arbitrary-length sequences of independently sampled 3D point clouds into a unified, continuous, and coherent deforming shape. Unlike prior methods that oversmooth geometry or produce topological and geometric artifacts, CanFields optimizes fine-detailed geometry and deformation jointly in an unsupervised fitting with two novel bespoke modules. First, we introduce a dynamic consolidator module that adjusts the input and assigns confidence scores, balancing the optimization of the canonical shape and its motion. Second, we represent the motion as a diffeomorphic flow parameterized by a smooth velocity field. We have validated our robustness and accuracy on more than 50 diverse sequences, demonstrating its superior performance even with missing regions, noisy raw scans, and sparse data. The code is available in Supplemental and will be made publicly available upon publication.

Thu 23 Oct. 17:45 - 19:45 PDT

#375
DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

Chen Shi · Shaoshuai Shi · Kehua Sheng · Bo Zhang · Li Jiang

Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Panoptic Scene Modeling (PSM), a module that unifies multimodal supervision—3D point cloud forecasting, 2D semantic representation, and image generation—to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX’s predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX’s effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX’s capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.

Thu 23 Oct. 17:45 - 19:45 PDT

#376
Highlight
Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning

Giwon Lee · Wooseong Jeong · Daehee Park · Jaewoo Jeong · Kuk-Jin Yoon

Motion planning is a crucial component of autonomous robot driving. While various trajectory datasets exist, effectively utilizing them for a target domain remains challenging due to differences in agent interactions and environmental characteristics. Conventional approaches, such as domain adaptation or ensemble learning, leverage multiple source datasets but suffer from domain imbalance, catastrophic forgetting, and high computational costs. To address these challenges, we propose Interaction-Merged Motion Planning (IMMP), a novel approach that leverages parameter checkpoints trained on different domains during adaptation to the target domain. IMMP follows a two-step process: pre-merging to capture agent behaviors and interactions, sufficiently extracting diverse information from the source domain, followed by merging to construct an adaptable model that efficiently transfers diverse interactions to the target domain. Our method is evaluated on various planning benchmarks and models, demonstrating superior performance compared to conventional approaches.

Thu 23 Oct. 17:45 - 19:45 PDT

#377
Communication-Efficient Multi-Vehicle Collaborative Semantic Segmentation via Sparse 3D Gaussian Sharing

Tianyu Hong · Xiaobo Zhou · Wenkai Hu · Qi Xie · Zhihui Ke · Tie Qiu

Collaborative perception is considered a promising approach to address the inherent limitations of single-vehicle systems by sharing data among vehicles, thereby enhancing performance in perception tasks such as bird’s‐eye view (BEV) semantic segmentation. However, existing methods share the entire dense, scene-level BEV feature, which contains significant redundancy and lacks height information, ultimately leading to unavoidable bandwidth waste and performance degradation. To address these challenges, we present $\textit{GSCOOP}$, the first collaborative semantic segmentation framework that leverages sparse, object-centric 3D Gaussians to fundamentally overcome communication bottlenecks. By representing scenes with compact Gaussians that preserve complete spatial information, $\textit{GSCOOP}$ achieves both high perception accuracy and communication efficiency. To further optimize transmission, we introduce the Priority-Based Gaussian Selection (PGS) module to adaptively select critical Gaussians and a Semantic Gaussian Compression (SGC) module to compress Gaussian attributes with minimal overhead. Extensive experiments on OPV2V and V2X-Seq demonstrate that GSCOOP achieves state-of-the-art performance, even with more than $500\times$ lower communication volume.

Thu 23 Oct. 17:45 - 19:45 PDT

#378
World4Drive: End-to-End Autonomous Driving via Intention-aware Physical Latent World Model

Yupeng Zheng · Pengxuan Yang · Zebin Xing · Qichao Zhang · Yuhang Zheng · Yinfeng Gao · Pengfei Li · Teng Zhang · Zhongpu Xia · Peng Jia · XianPeng Lang · Dongbin Zhao

End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1% relative reduction in L2 error, 46.7% lower collision rate, and 3.75× faster.

Thu 23 Oct. 17:45 - 19:45 PDT

#379
DATA: Domain-And-Time Alignment for High-Quality Feature Fusion in Collaborative Perception

Chengchang Tian · Jianwei Ma · Yan Huang · Zhanye Chen · Honghao Wei · Hui Zhang · Wei Hong

Feature-level fusion shows promise in collaborative perception (CP) through balanced performance and communication bandwidth trade-off. However, its effectiveness critically relies on input feature quality. The acquisition of high-quality features faces domain gaps from hardware diversity and deployment conditions, alongside temporal misalignment from transmission delays. These challenges degrade feature quality with cumulative effects throughout the collaborative network. In this paper, we present the Domain-And-Time Alignment (DATA) network, designed to systematically align features while maximizing their semantic representations for fusion. Specifically, we propose a Consistency-preserving Domain Alignment Module (CDAM) that reduces domain gaps through proximal-region hierarchical downsampling and observability-constrained discriminator. We further propose a Progressive Temporal Alignment Module (PTAM) to handle transmission delays via multi-scale motion modeling and two-stage compensation. Building upon the aligned features, an Instance-focused Feature Aggregation Module (IFAM) is developed to enhance semantic representations. Extensive experiments demonstrate that DATA achieves state-of-the-art performance on three typical datasets, maintaining robustness with severe communication delays and pose errors.

Thu 23 Oct. 17:45 - 19:45 PDT

#380
GauUpdate: New Object Insertion in 3D Gaussian Fields with Consistent Global Illumination

Chengwei REN · Fan Zhang · Liangchao Xu · Liang Pan · Ziwei Liu · Wenping Wang · Xiao-Ping Zhang · Yuan Liu

3D Gaussian Splatting (3DGS) is a prevailing technique to reconstruct large-scale 3D scenes from multiview images for novel view synthesis, like a room, a block, and even a city. Such large-scale scenes are not static with changes constantly happening in these scenes, like a new building being built or a new decoration being set up. To keep the reconstructed 3D Gaussian fields up-to-date, a naive way is to reconstruct the whole scene after changing, which is extremely costly and inefficient. In this paper, we propose a new method called GauUpdate that allows partially updating an old 3D Gaussian field with new objects from a new 3D Gaussian field. However, simply inserting the new objects leads to inconsistent appearances because the old and new Gaussian fields may have different lighting environments from each other. GauUpdate addresses this problem by applying inverse rendering techniques in the 3DGS to recover both the materials and environmental lights. Based on the materials and lighting, we relight the new objects in the old 3D Gaussian field for consistent global illumination. For an accurate estimation of the materials and lighting, we put additional constraints on the materials and lighting conditions, that these two fields share the same materials but different environment lights, to improve their qualities. We conduct experiments on both synthetic scenes and real-world scenes to evaluate GauUpdate, which demonstrate that GauUpdate achieves realistic object insertion in 3D Gaussian fields with consistent appearances.

Thu 23 Oct. 17:45 - 19:45 PDT

#381
Hi-Gaussian: Hierarchical Gaussians under Normalized Spherical Projection for Single-View 3D Reconstruction

Binjian Xie · Pengju Zhang · Hao Wei · Yihong Wu

Single-view 3D reconstruction is a fundamental problem in computer vision, having a significant impact on downstream tasks such as autonomous driving, virtual reality and augmented reality. However, existing single-view reconstruction methods are unable to reconstruct the regions outside the input field-of-view or the areas occluded by visible parts. In this paper, we propose Hi-Gaussian, which employs feed-forward 3D Gaussians for efficient and generalizable single-view 3D reconstruction. A Normalized Spherical Projection module is introduced following an Encoder-Decoder network in our model, assigning a larger range to the transformed spherical coordinates, which can enlarge the field of view during scene reconstruction. Besides, to reconstruct occluded regions behind the visible part, we introduce a novel Hierarchical Gaussian Sampling strategy, utilizing two layers of Gaussians to hierarchically represent 3D scenes. We first use a pre-trained monocular depth estimation model to provide depth initialization for $leader$ Gaussians, and then leverage the $leader$ Gaussians to estimate the distribution followed by $follower$ Gaussians, which can flexibly move into occluded areas. Extensive experiments show that our method outperforms other methods for scene reconstruction and novel view synthesis, on both outdoor and indoor datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#382
VisHall3D: Monocular Semantic Scene Completion from Reconstructing the Visible Regions to Hallucinating the Invisible Regions

Haoang Lu · Yuanqi Su · Xiaoning Zhang · Longjun Gao · Yu Xue · Le Wang

This paper introduces VisHall3D, a novel two-stage framework for monocular semantic scene completion that aims to address the issues of feature entanglement and geometric inconsistency prevalent in existing methods. VisHall3D decomposes the scene completion task into two stages: reconstructing the visible regions (vision) and inferring the invisible regions (hallucination). In the first stage, VisFrontierNet, a visibility-aware projection module, is introduced to accurately trace the visual frontier while preserving fine-grained details. In the second stage, OcclusionMAE, a hallucination network, is employed to generate plausible geometries for the invisible regions using a noise injection mechanism. By decoupling scene completion into these two distinct stages, VisHall3D effectively mitigates feature entanglement and geometric inconsistency, leading to significantly improved reconstruction quality.The effectiveness of VisHall3D is validated through extensive experiments on two challenging benchmarks: SemanticKITTI and SSCBench-KITTI-360. VisHall3D achieves state-of-the-art performance, outperforming previous methods by a significant margin and paves the way for more accurate and reliable scene understanding in autonomous driving and other applications.

Thu 23 Oct. 17:45 - 19:45 PDT

#383
TrafficLoc: Localizing Traffic Surveillance Cameras in 3D Scenes

Yan Xia · Yunxiang Lu · Rui Song · Oussema Dhaouadi · Joao F. Henriques · Daniel Cremers

We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-to-fine matching fashion. To overcome the lack of large-scale real-world intersection datasets, we first introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. We find that current I2P methods struggle with cross-modal matching under large viewpoint differences, especially at traffic intersections. TrafficLoc thus employs a novel Geometry-guided Attention Loss (GAL) to focus only on the corresponding geometric regions under different viewpoints during 2D-3D feature fusion. To address feature inconsistency in paired image patch-point groups, we further propose Inter-intra Contrastive Learning (ICL) to enhance separating 2D patch / 3D group features within each intra-modality and introduce Dense Training Alignment (DTA) with soft-argmax for improving position regression. Extensive experiments show our TrafficLoc greatly improves the performance over the SOTA I2P methods (up to 86%) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating the superiority across both in-vehicle and traffic cameras. The code and dataset will be available upon acceptance.

Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to self-reconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D self-supervised learning. Specifically, it outperforms the self-reconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the Mlp-Linear evaluation protocol. Source code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#385
A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision

Chensheng Peng · Ido Sobol · Masayoshi Tomizuka · Kurt Keutzer · Chenfeng Xu · Or Litany

We present a novel framework for training 3D image-conditioned diffusion models using only 2D supervision. Recovering 3D structure from 2D images is inherently ill-posed due to the ambiguity of possible reconstructions, making generative models a natural choice. However, most existing 3D generative models rely on full 3D supervision, which is impractical due to the scarcity of large-scale 3D datasets. To address this, we propose leveraging sparse-view supervision as a scalable alternative. While recent reconstruction models use sparse-view supervision with differentiable rendering to lift 2D images to 3D, they are predominantly deterministic, failing to capture the diverse set of plausible solutions and producing blurry predictions in uncertain regions. A key challenge in training 3D diffusion models with 2D supervision is that the standard training paradigm requires both the denoising process and supervision to be in the same modality. We address this by decoupling the noisy samples being denoised from the supervision signal, allowing the former to remain in 3D while the latter is provided in 2D. Our approach leverages suboptimal predictions from a deterministic image-to-3D model—acting as a "teacher"—to generate noisy 3D inputs, enabling effective 3D diffusion training without requiring full 3D ground truth. We validate our framework on both object-level and scene-level datasets, using two different 3D Gaussian Splat (3DGS) teachers. Our results show that our approach consistently improves upon these deterministic teachers, demonstrating its effectiveness in scalable and high-fidelity 3D generative modeling.

Thu 23 Oct. 17:45 - 19:45 PDT

#386
Extrapolated Urban View Synthesis Benchmark

Xiangyu Han · Zhen Jia · Boyi Li · Yan Wang · Boris Ivanovic · Yurong You · Lingjie Liu · Yue Wang · Marco Pavone · Chen Feng · Yiming Li

Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct both quantitative and qualitative evaluations of state-of-the-art NVS methods across different evaluation settings. Our results show that current NVS methods are prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and large-scale training. We will release the data to help advance self-driving and urban robotics simulation technology.

Thu 23 Oct. 17:45 - 19:45 PDT

#387
Heatmap Regression without Soft-Argmax for Facial Landmark Detection

Chiao-An Yang · Raymond A. Yeh

Facial landmark detection is an important task in computer vision with numerous downstream applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been a strong contender in achieving state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. As argmax is not differentiable, to enable end-to-end training on deep-nets, these methods rely on a differentiable approximation of argmax, namely Soft-argmax. In this work, we revisit this long-standing choice of using Soft-argmax and find that it may not be necessary. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W) with faster training convergence by roughly $2.2\times$ while maintaining intuitive design choices in our model.

Thu 23 Oct. 17:45 - 19:45 PDT

#388
Demeter: A Parametric Model of Crop Plant Morphology from the Real World

Tianhang Cheng · Albert Zhai · Evan Chen · Rui Zhou · Yawen Deng · Zitong Li · Kejie Zhao · Janice Shiu · Qianyu Zhao · Yide Xu · Xinlei Wang · Yuan Shen · Sheng Wang · Lisa Ainsworth · Kaiyu Guan · Shenlong Wang

Learning 3D parametric shape models of objects has gained popularity in vision and graphics and has showed broad utility in 3D reconstruction, generation, understanding, and simulation. While powerful models exist for humans and animals, equally expressive approaches for modeling plants are lacking. In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. Unlike previous parametric models, Demeter handles varying shape topology across various species and models three sources of shape variation: articulation, subcomponent shape variation, and non-rigid deformation. To advance crop plant modeling, we collected a large-scale, ground-truthed dataset from a soybean farm as a testbed. Experiments show that Demeter effectively synthesizes shapes, reconstructs structures, and simulates biophysical processes. Code and data will be open-sourced.

Thu 23 Oct. 17:45 - 19:45 PDT

#389
3D Gaussian Splatting Driven Multi-View Robust Physical Adversarial Camouflage Generation

Tianrui Lou · Xiaojun Jia · Siyuan Liang · Jiawei Liang · Ming Zhang · Yanjun Xiao · Xiaochun Cao

Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA.

Thu 23 Oct. 17:45 - 19:45 PDT

#390
Mixed Signals: A Diverse Point Cloud Dataset for Heterogeneous LiDAR V2X Collaboration

Katie Luo · Minh-Quan Dao · Zhenzhen Liu · Mark Campbell · Wei-Lun (Harry) Chao · Kilian Weinberger · Ezio Malis · Vincent FREMONT · Bharath Hariharan · Mao Shan · Stewart Worrall · Julie Stephany Berrio Perez

Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. Mixed Signals is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. We hope our work advances research in the emerging, impactful field of V2X perception.

Thu 23 Oct. 17:45 - 19:45 PDT

#391
GroundFlow: A Plug-in Module for Temporal Reasoning on 3D Point Cloud Sequential Grounding

Zijun Lin · Shuting He · Cheston Tan · Bihan Wen

Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as "it", "here" and "the same" to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow — a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal understanding advantage as step counts increase. Overall, our work introduces temporal reasoning capabilities to existing 3DVG models and achieves state-of-the-art performance in the SG3D benchmark across five datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#392
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control

Teng Li · Guangcong Zheng · Rui Jiang · Shuigenzhan Shuigenzhan · Tao Wu · Yehao Lu · Yining Lin · Chuanyun Deng · Yepan Xiong · Min Chen · Lin Cheng · Xi Li

Recent advancements in camera-trajectory-guided image-to-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary real-world images without knowledge of their depth nor scene scale.To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to metric scales, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene.To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic and coherent video generation in lower noise stages.RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation.

Thu 23 Oct. 17:45 - 19:45 PDT

#393
Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation

Tiankai Chen · Yushu Li · Adam Goodge · Fei Teng · Xulei Yang · Tianrui Li · Xun Xu

Out-of-distribution (OOD) detection in 3D point cloud data remains a challenge, particularly in applications where safe and robust perception is critical. While existing OOD detection methods have shown progress for 2D image data, extending these to 3D environments involves unique obstacles. This paper introduces a training-free framework that leverages Vision-Language Models (VLMs) for effective OOD detection in 3D point clouds. By constructing a graph based on class prototypes and testing data, we exploit the data manifold structure to enhancing the effectiveness of VLMs for 3D OOD detection. We propose a novel Graph Score Propagation (GSP) method that incorporates prompt clustering and self-training negative prompting to improve OOD scoring with VLM. Our method is also adaptable to few-shot scenarios, providing options for practical applications. We demonstrate that GSP consistently outperforms state-of-the-art methods across synthetic and real-world datasets 3D point cloud OOD detection.

Thu 23 Oct. 17:45 - 19:45 PDT

#394
ScenePainter: Semantically Consistent Perpetual 3D Scene Generation with Concept Relation Alignment

Chong Xia · Shengjun Zhang · Fangfu Liu · Chang Liu · Khodchaphun Hirunyaratsameewong · Yueqi Duan

Perpetual 3D scene generation aims to produce long-range and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a "navigate-and-imagine" fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter's scene-specific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences.

Thu 23 Oct. 17:45 - 19:45 PDT

#395
FROSS: Faster-Than-Real-Time Online 3D Semantic Scene Graph Generation from RGB-D Images

Hao-Yu Hou · Chun-Yi Lee · Motoharu Sonogashira · Yasutomo Kawanishi

The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, in this work, we propose FROSS (Faster-than-Real-Time Online 3D Semantic Scene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation method that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationally-intensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while being orders of magnitude faster than prior 3D SSG generation methods.

Thu 23 Oct. 17:45 - 19:45 PDT

#396
Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering

Qing Li · Huifang Feng · Xun Gong · Liang Han

Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models will be made publicly available.

Thu 23 Oct. 17:45 - 19:45 PDT

#397
HUG: Hierarchical Urban Gaussian Splatting with Block-Based Reconstruction for Large-Scale Aerial Scenes

Mai Su · Zhongtao Wang · Huishan Au · Yilong Li · Xizhe Cao · Chengwei Pan · Yisong Chen · Guoping Wang

3DGS is an emerging and increasingly popular technology in the field of novel view synthesis. Its highly realistic rendering quality and real-time rendering capabilities make it promising for various applications. However, when applied to large-scale aerial urban scenes, 3DGS methods suffer from issues such as excessive memory consumption, slow training times, prolonged partitioning processes, and significant degradation in rendering quality due to the increased data volume. To tackle these challenges, we introduce $\textbf{HUG}$, a novel approach that enhances data partitioning and reconstruction quality by leveraging a hierarchical neural Gaussian representation. We first propose a visibility-based data partitioning method that is simple yet highly efficient, significantly outperforming existing methods in speed. Then, we introduce a novel hierarchical weighted training approach, combined with other optimization strategies, to substantially improve reconstruction quality. Our method achieves state-of-the-art results on one synthetic dataset and four real-world datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#398
AMD: Adaptive Momentum and Decoupled Contrastive Learning Framework for Robust Long-Tail Trajectory Prediction

Bin Rao · Haicheng Liao · Yanchen Guan · Chengyue Wang · Bonan Wang · Jiaxun Zhang · Zhenning Li

Accurately predicting the future trajectories of traffic agents is essential in autonomous driving. However, due to the inherent imbalance in trajectory distributions, tail data in natural datasets often represents more complex and hazardous scenarios. Existing studies typically rely solely on a base model’s prediction error, without considering the diversity and uncertainty of long-tail trajectory patterns. We propose an adaptive momentum and decoupled contrastive learning framework (AMD), which integrates unsupervised and supervised contrastive learning strategies. By leveraging an improved momentum contrast learning (MoCo-DT) and decoupled contrastive learning (DCL) module, our framework enhances the model’s ability to recognize rare and complex trajectories. Additionally, we design four types of trajectory random augmentation methods and introduce an online iterative clustering strategy, allowing the model to dynamically update pseudo-labels and better adapt to the distributional shifts in long-tail data. We propose three different criteria to define long-tail trajectories and conduct extensive comparative experiments on the nuScenes and ETH/UCY datasets. The results show that AMD not only achieves optimal performance in long-tail trajectory prediction but also demonstrates outstanding overall prediction accuracy.

Thu 23 Oct. 17:45 - 19:45 PDT

#399
Unraveling the Effects of Synthetic Data on End-to-End Autonomous Driving

Junhao Ge · Zuhong Liu · Longteng Fan · Yifan Jiang · Jiaqi Su · Yiming Li · Zhejun Zhang · Siheng Chen

End-to-end (E2E) autonomous driving (AD) models require diverse, high-quality data to perform well across various driving scenarios. However, collecting large-scale real-world data is expensive and time-consuming, making high-fidelity synthetic data essential for enhancing data diversity and model robustness. Existing driving simulators for synthetic data generation have significant limitations: game-engine-based simulators struggle to produce realistic sensor data, while NeRF-based and diffusion-based methods face efficiency challenges. Additionally, recent simulators designed for closed-loop evaluation provide limited interaction with other vehicles, failing to simulate complex real-world traffic dynamics. To address these issues, we introduce SceneCrafter, a realistic, interactive, and efficient AD simulator based on 3D Gaussian Splatting (3DGS). SceneCrafter not only efficiently generates realistic driving logs across diverse traffic scenarios but also enables robust closed-loop evaluation of end-to-end models. Experimental results demonstrate that SceneCrafter serves as both a reliable evaluation platform and a efficient data generator that significantly improves end-to-end model generalization.

Thu 23 Oct. 17:45 - 19:45 PDT

#400
BANet: Bilateral Aggregation Network for Mobile Stereo Matching

Gangwei Xu · Jiaxin Liu · Xianqi Wang · Junda Cheng · Yong Deng · Jinliang Zang · Yurui Chen · Xin Yang

State-of-the-art stereo matching methods typically use costly 3D convolutions to aggregate a full cost volume, but their computational demands make mobile deployment challenging. Directly applying 2D convolutions for cost aggregation often results in edge blurring, detail loss, and mismatches in textureless regions. Some complex operations, like deformable convolutions and iterative warping, can partially alleviate this issue; however, they are not mobile-friendly, limiting their deployment on mobile devices. In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. Specifically, we first separate the full cost volume into detailed and smooth volumes using a spatial attention map, then perform detailed and smooth aggregations accordingly, ultimately fusing both to obtain the final disparity map. Additionally, to accurately identify high-frequency detailed regions and low-frequency smooth/textureless regions, we propose a new scale-aware spatial attention module. Experimental results demonstrate that our BANet-2D significantly outperforms other mobile-friendly methods, achieving 35.3\% higher accuracy on the KITTI 2015 leaderboard than MobileStereoNet-2D, with faster runtime on mobile devices. The extended 3D version, BANet-3D, achieves the highest accuracy among all real-time methods on high-end GPUs.

Thu 23 Oct. 17:45 - 19:45 PDT

#401
Puzzle Similarity: A Perceptually-guided Cross-Reference Metric for Artifact Detection in 3D Scene Reconstructions

Nicolai Hermann · Jorge Condor · Piotr Didyk

Modern reconstruction techniques can effectively model complex 3D scenes from sparse 2D views. However, automatically assessing the quality of novel views and identifying artifacts is challenging due to the lack of ground truth images and the limitations of No-Reference image metrics in predicting reliable artifact maps. The absence of such metrics hinders assessment of the quality of novel views and limits the adoption of post-processing techniques, such as inpainting, to enhance reconstruction quality. To tackle this, recent work has established a new category of metrics (Cross-Reference), predicting image quality solely by leveraging context from alternate viewpoint captures. In this work, we propose a new Cross-Reference metric, Puzzle Similarity, which is designed to localize artifacts in novel views. Our approach utilizes image patch statistics from the input views to establish a scene-specific distribution, later used to identify poorly reconstructed regions in the novel views. Given the lack of good measures to evaluate Cross-Reference methods in the context of 3D reconstruction, we collected a novel human-labeled dataset of artifact and distortion maps in unseen reconstructed views. Through this dataset, we demonstrate that our method achieves state-of-the-art localization of artifacts in novel views, correlating with human assessment, even without aligned references. We can leverage our new metric to enhance applications like automatic image restoration, guided acquisition, or 3D reconstruction from sparse inputs.

Thu 23 Oct. 17:45 - 19:45 PDT

#402
Authentic 4D Driving Simulation with a Video Generation Model

Lening Wang · Wenzhao Zheng · Dalong Du · Yunpeng Zhang · Yilong Ren · Han Jiang · Zhiyong Cui · Haiyang Yu · Jie Zhou · Shanghang Zhang

Simulating driving environments in 4D is crucial for developing accurate and immersive autonomous driving systems. Despite progress in generating driving scenes, challenges in transforming views and modeling the dynamics of space and time remain. To tackle these issues, we propose a fresh methodology that reconstructs real-world driving environments and utilizes a generative network to enable 4D simulation. This approach builds continuous 4D point cloud scenes by leveraging surround-view data from autonomous vehicles. By separating the spatial and temporal elements, it creates smooth keyframe sequences. Furthermore, video generation techniques are employed to produce lifelike 4D simulation videos from any given perspective. To extend the range of possible viewpoints, we incorporate training using decomposed camera poses, which allows for enhanced modeling of distant scenes. Additionally, we merge camera trajectory data to synchronize 3D points across consecutive frames, fostering a richer understanding of the evolving scene. With training across multiple scene levels, our method is capable of simulating scenes from any viewpoint and offers deep insight into the evolution of scenes over time in a consistent spatial-temporal framework. In comparison with current methods, this approach excels in maintaining consistency across views, background coherence, and overall accuracy, significantly contributing to the development of more realistic autonomous driving simulations.

Thu 23 Oct. 17:45 - 19:45 PDT

#403
DONUT: A Decoder-Only Model for Trajectory Prediction

Markus Knoche · Daan de Geus · Bastian Leibe

Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Different from existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, enhancing the performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an 'overprediction' strategy that gives the network the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future, and further improves the performance. With experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark. Code will be made available upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#404
Highlight
Lidar Waveforms are Worth 40x128x33 Words

Dominik Scheuble · Hanno Holzhüter · Steven Peters · Mario Bijelic · Felix Heide

Lidar has become crucial for autonomous driving, providing high-resolution 3D scans that are key for accurate scene understanding. To this end, lidar sensors measure the time-resolved full waveforms from the returning laser light, which a subsequent digital signal processor (DSP) converts to point clouds by identifying peaks in the waveform. Conventional automotive lidar DSP pipelines process each waveform individually, ignoring potentially valuable context from neighboring waveforms. As a result, lidar point clouds are prone to artifacts from low signal-to-noise ratio (SNR) regions, highly reflective objects, and environmental conditions like fog. While leveraging neighboring waveforms has been investigated extensively in transient imaging, the application has been limited to scientific or experimental hardware. In this work, we propose a learned DSP that directly processes full waveforms using a transformer architecture leveraging features from adjacent waveforms to generate high-fidelity multi-echo point clouds. To assess our method, we modify a conventional automotive lidar and capture data in real-world driving scenarios. Furthermore, we collect dedicated test sets in a weather chamber to asses our method in different environmental conditions. Trained on both synthetic and real data, the method improves Chamfer distance by 32 cm and 20 cm compared to on-device peak finding methods and existing transient imaging approaches, respectively.

Thu 23 Oct. 17:45 - 19:45 PDT

#405
Spherical Epipolar Rectification for Deep Two-View Absolute Depth Estimation

Pierre-André Brousseau · Sébastien Roy

Absolute depth estimation from single camera sequence of images is a relevant task given that mobile machines increasingly rely on vision to navigate. Deep learning for stereo matching has been demonstrated to improve performance for stereo rectified depth estimation but these methods require straightforward left-right camera setups. This work proposes to introduce deep stereo matching to two views of a monocular image sequence obtained from a camera in motion in a static scene. This paper introduces a novel and principled spherical epipolar rectification model, which handles all camera motions. This rectification model is differentiable and allows self-supervised deep stereo matching algorithms to compute disparity and recover depth, given known camera pose. This paper also introduces a spherical crop operation which limits rectified image size and allows for competitive absolute depth estimation performance. This results in a spherical rectification model that is demonstrated to provide metric depth and compete favorably with a current state-of-the-art monocular depth estimator.

Thu 23 Oct. 17:45 - 19:45 PDT

#406
ContraGS: Codebook-Condensed and Trainable Gaussian Splatting for Fast, Memory-Efficient Reconstruction

Sankeerth Durvasula · Sharanshangar Muhunthan · Zain Moustafa · Richard Chen · Ruofan Liang · Yushi Guan · Nilesh Ahuja · Nilesh Jain · Selvakumar Panneer · Nandita Vijaykumar

3D Gaussian Splatting (3DGS) is a state-of-art technique to model real-world scenes with high quality and real-time rendering.Typically, a higher quality representation can be achieved by using a large number of 3D Gaussians. However, using large 3D Gaussian counts significantly increases the GPU device memory for storing model parameters. A large model thus requires powerful GPUs with high memory capacities for training and has slower training/rendering latencies due to the inefficiencies of memory access and data movement. In this work, we introduce ContraGS, a method to enable training directly on compressed 3DGS representations without reducing the Gaussian Counts, and thus with a little loss in model quality. ContraGS leverages codebooks to compactly store a set of Gaussian parameter vectors throughout the training process, thereby significantly reducing memory consumption. While codebooks have been demonstrated to be highly effective at compressing fully trained 3DGS models, directly training using codebook representations is an unsolved challenge. ContraGS solves the problem of learning non-differentiable parameters in codebook-compressed representations by posing parameter estimation as a Bayesian inference problem. To this end, ContraGS provides a framework that effectively uses MCMC sampling to sample over a posterior distribution of these compressed representations. With ContraGS, we demonstrate that ContraGS significantly reduces the peak memory during training (on average 3.49X) and accelerated training and rendering 1.36Xand 1.88X on average, respectively), while retraining close to state-of-art quality.

Thu 23 Oct. 17:45 - 19:45 PDT

#407
UAVScenes: A Multi-Modal Dataset for UAVs

Sijie Wang · Siqi Li · Yawei Zhang · Shangshu Yu · Shenghai Yuan · Rui She · Quanjiang Guo · JinXuan Zheng · Ong Howe · Leonrich Chandra · Shrivarshann Srijeyan · Aditya Sivadas · Toshan Aggarwal · Heyuan Liu · Hongming Zhang · CHEN CHUJIE · JIANG JUNYU · Lihua Xie · Wee Peng Tay

Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS).

Thu 23 Oct. 17:45 - 19:45 PDT

#408
PanoSplatt3R: Leveraging Perspective Pretraining for Generalized Unposed Wide-Baseline Panorama Reconstruction

Jiahui Ren · Mochu Xiang · Jiajun Zhu · Yuchao Dai

Wide-baseline panorama reconstruction has emerged as a highly effective and pivotal approach for not only achieving geometric reconstruction of the surrounding 3D environment, but also generating highly realistic and immersive novel views. Although existing methods have shown remarkable performance across various benchmarks, they are predominantly reliant on accurate pose information. In practical real-world scenarios, the acquisition of precise pose often requires additional computational resources and is highly susceptible to noise. These limitations hinder the broad applicability and practicality of such methods. In this paper, we present PanoSplatt3R, an unposed wide-baseline panorama reconstruction method. We extend and adapt the foundational reconstruction pretrainings from the perspective domain to the panoramic domain, thus enabling powerful generalization capabilities. To ensure a seamless and efficient domain-transfer process, we introduce RoPE rolling that spans rolled coordinates in rotary positional embeddings across different attention heads, maintaining a minimal modification to RoPE's mechanism, while modeling the horizontal periodicity of panorama images.Comprehensive experiments demonstrate that PanoSplatt3R, even in the absence of pose information, significantly outperforms current state-of-the-art methods. This superiority is evident in both the generation of high-quality novel views and the accuracy of depth estimation, thereby showcasing its great potential for practical applications.

Thu 23 Oct. 17:45 - 19:45 PDT

#409
Seam360GS: Seamless 360° Gaussian Splatting from Real-World Omnidirectional Images

Changha Shin · Woong Oh Cho · Seon Joo Kim

360° visual content is widely shared on platforms such as YouTube and plays a central role in virtual reality, robotics, and autonomous navigation. However, consumer-grade dual-fisheye systems consistently yield imperfect panoramas due to inherent lens separation and angular distortions. In this work, we introduce a novel calibration framework that incorporates a dual-fisheye camera model into the 3D Gaussian Splatting pipeline. Our approach not only simulates the realistic visual artifacts produced by dual-fisheye cameras but also enables the synthesis of seamlessly rendered 360° images. By jointly optimizing 3D Gaussian parameters alongside calibration variables that emulate lens gaps and angular distortions, our framework transforms imperfect omnidirectional inputs into flawless novel view synthesis. Extensive evaluations on real-world datasets confirm that our method produces seamless renderings—even from imperfect images—and outperforms existing 360° rendering models.

Thu 23 Oct. 17:45 - 19:45 PDT

#410
GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation with Gaussian Splatting

Wanshui Gan · Fang Liu · Hongbin Xu · Ningkai Mo · Naoto Yokoya

We introduce GaussianOcc, a systematic method that investigates Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D poses from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP) module to provide accurate scale information for fully self-supervised training from adjacent view projection. Additionally, existing methods rely on volume rendering for final 3D voxel representation learning using 2D signals (depth maps, semantic maps), which is both time-consuming and less effective. We propose Gaussian Splatting from Voxel space (GSV) to leverage the fast rendering properties of Gaussian splatting. As a result, the proposed GaussianOcc method enables fully self-supervised (no ground truth pose) 3D occupancy estimation in competitive performance with low computational cost (2.7 times faster in training and 5 times faster in rendering).

Thu 23 Oct. 17:45 - 19:45 PDT

#411
GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering

Kai Ye · Chong Gao · Guanbin Li · Wenzheng Chen · Baoquan Chen

Recent 3D Gaussian Splatting (3DGS) representations have demonstrated remarkable performance in novel view synthesis; further, material-lighting disentanglement on 3DGS warrants relighting capabilities and its adaptability to broader applications. While the general approach to the latter operation lies in integrating differentiable physically-based rendering (PBR) techniques to jointly recover BRDF materials and environment lighting, achieving a precise disentanglement remains an inherently difficult task due to the challenge of accurately modeling light transport. Existing approaches typically approximate Gaussian points' normals, which constitute an implicit geometric constraint. However, they usually suffer from inaccuracies in normal estimation that subsequently degrade light transport, resulting in noisy material decomposition and flawed relighting results. To address this, we propose GeoSplatting, a novel approach that augments 3DGS with explicit geometry guidance for precise light transport modeling. By differentiably constructing a surface-grounded 3DGS from an optimizable mesh, our approach leverages well-defined mesh normals and the opaque mesh surface, and additionally facilitates the use of mesh-based ray tracing techniques for efficient, occlusion-aware light transport calculations. This enhancement ensures precise material decomposition while preserving the efficiency and high-quality rendering capabilities of 3DGS. Comprehensive evaluations across diverse datasets demonstrate the effectiveness of GeoSplatting, highlighting its superior efficiency and state-of-the-art inverse rendering performance.

Thu 23 Oct. 17:45 - 19:45 PDT

#412
Wide2Long: Learning Lens Compression and Perspective Adjustment for Wide-Angle to Telephoto Translation

Soumyadipta Banerjee · Jiaul Paik · Debashis Sen

A translation framework that produces images as if they were captured with a telephoto lens, from images captured with a wide-angle lens, will help in reducing the necessity of complex, expensive and bulky lenses on smartphones. To this end, we propose an image-to-image translation pipeline to simulate the lens compression and perspective adjustment associated with this reconstruction, where the size of the main subject in the images remains the same. We judiciously design depth-based image layering, layer-wise in-painting, redundancy reduction and layer scaling modules to construct the desired tele-photo image, where the pipeline parameters are estimated by a convolutional network. Our approach is compatible with the related optical transformation, and hence, contents behind the main subject are enlarged and that before are diminished, achieving lens compression with appropriate perspective adjustment. Our pipeline performs well qualitatively and quantitatively on several source-target image pairs we have captured solely for this task, and also on images in-the-wild. We show that it can simulate the different amounts of lens compression associated with targeted $2\times$, $4\times$, $8\times$ changes in the focal length. Further, the pipeline is demonstrated to be effective for a sub-class of the lens-compression problem - portrait perspective distortion correction. We also provide an ablation study to show the significance of the various components in the pipeline.

Thu 23 Oct. 17:45 - 19:45 PDT

#413
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Fangfu Liu · Hao Li · Jiawei Chi · Hanyang Wang · Minghui Yang · Fudong Wang · Yueqi Duan

Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views.Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability.

Thu 23 Oct. 17:45 - 19:45 PDT

#414
MagShield: Towards Better Robustness in Sparse Inertial Motion Capture Under Magnetic Disturbances

Yunzhe Shao · Xinyu Yi · Lu Yin · Shihui Guo · Jun-Hai Yong · Feng Xu

This paper proposes a novel method called MagShield, designed to address the issue of magnetic interference in sparse inertial motion capture (MoCap) systems. Existing Inertial Measurement Unit (IMU) systems are prone to orientation estimation errors in magnetically disturbed environments, limiting their practical application in real-world scenarios. To address this problem, MagShield employs a "detect-then-correct" strategy, first detecting magnetic disturbances through multi-IMU joint analysis, and then correcting orientation errors using human motion priors. MagShield can be integrated with most existing sparse inertial MoCap systems, improving their performance in magnetically disturbed environments. Experimental results demonstrate that MagShield significantly enhances the accuracy of motion capture under magnetic interference and exhibits good compatibility across different sparse inertial MoCap systems. Code will be released.

Thu 23 Oct. 17:45 - 19:45 PDT

#415
Focal Plane Visual Feature Generation and Matching on a Pixel Processor Array

Hongyi Zhang · Laurie Bose · Jianing Chen · Piotr Dudek · Walterio Mayol-Cuevas

Pixel Processor Arrays (PPAs) are vision sensors that embed data and processing into every pixel element. PPAs can execute visual processing directly at the point of light capture, and output only sparse, high-level information. This is in sharp contrast with the conventional visual pipeline, where whole images must be transferred from sensor to processor. This sparse data readout also provides several major benefits such as higher frame rate, lower energy consumption and lower bandwidth requirements. In this work, we demonstrate generation, matching and storage of binary descriptors for visual keypoint features, entirely upon PPA with no need to output images to external processing, making our approach inherently privacy-aware.Our method spreads descriptors across multiple pixel-processors, which allows for significantly larger descriptors than any prior pixel-processing works. These large descriptors can be used for a range of tasks such as place and object recognition. We demonstrate the accuracy of our in-pixel feature matching up to $ \sim$94.5%, at $\sim$210fps, across a range of datasets, with a greater than $100\times$ reduction in data transfer and bandwidth requirements over traditional cameras.

Thu 23 Oct. 17:45 - 19:45 PDT

#416
IM360: Large-scale Indoor Mapping with 360 Cameras

Dongki Jung · Jaehoon Choi · Yonghan Lee · Dinesh Manocha

We present a novel 3D mapping pipeline for large-scale indoor environments. To address the significant challenges in large-scale indoor scenes, such as prevalent occlusions and textureless regions, we propose IM360, a novel approach that leverages the wide field of view of omnidirectional images and integrates the spherical camera model into the Structure-from-Motion (SfM) pipeline. Our SfM utilizes dense matching features specifically designed for 360$^\circ$ images, demonstrating superior capability in image registration. Furthermore, with the aid of mesh-based neural rendering techniques, we introduce a texture optimization method that refines texture maps and accurately captures view-dependent properties by combining diffuse and specular components. We evaluate our pipeline on large-scale indoor scenes, demonstrating its effectiveness in real-world scenarios. In practice, IM360 demonstrates superior performance, achieving a 3.5 PSNR increase in textured mesh reconstruction. We attain state-of-the-art performance in terms of camera localization and registration on Matterport3D and Stanford2D3D.

Thu 23 Oct. 17:45 - 19:45 PDT

#417
Leveraging 2D Priors and SDF Guidance for Urban Scene Rendering

Siddharth Tourani · Jayaram Reddy · Akash Kumbar · Satyajit Tourani · Nishant Goyal · Madhava Krishna · Dinesh Reddy Narapureddy · Muhammad Haris Khan

Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves near state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. Furthermore, when incorporating LiDAR, our approach surpasses existing methods in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks including scene decomposition, and scene composition.

Thu 23 Oct. 17:45 - 19:45 PDT

#418
CULTURE3D: A Large-Scale and Diverse Dataset of Cultural Landmarks and Terrains for Gaussian-Based Scene Rendering

xinyi zheng · Steve Zhang · Weizhe Lin · Fan Zhang · Walterio Mayol-Cuevas · Yunze Liu · Junxiao Shen

Current state-of-the-art 3D reconstruction models face limitations in building extra-large scale outdoor scenes, primarily due to the lack of sufficiently large-scale and detailed datasets. In this paper, we present a extra-large fine-grained dataset with 10 billion points composed of 41,006 drone-captured high-resolution aerial images, covering 20 diverse and culturally significant scenes from worldwide locations such as Cambridge campus, the Pyramids, and the Forbidden City. Compared to existing datasets, ours offers significantly larger scale and higher detail, uniquely suited for fine-grained 3D applications. Each scene contains an accurate spatial layout and comprehensive structural information, supporting detailed 3D reconstruction tasks. By reconstructing environments using these detailed images, our dataset supports multiple applications, including outputs in the widely adopted COLMAP format, establishing a novel benchmark for evaluating state-of-the-art large-scale Gaussian Splatting methods.The dataset’s flexibility encourages innovations and supports model plug-ins, paving the way for future 3D breakthroughs. All datasets and code will be open-sourced for community use.

Thu 23 Oct. 17:45 - 19:45 PDT

#419
Φ-GAN:Physics-Inspired GAN for Generating SAR Images Under Limited Data

Xidan Zhang · Yihan Zhuang · Qian Guo · Haodong Yang · Xuelin Qian · Gong Cheng · Junwei Han · Zhongling Huang

Approaches for improving generative adversarial networks (GANs) training under a few samples have been explored for natural images. However, these methods have limited effectiveness for synthetic aperture radar (SAR) images, as they do not account for the unique electromagnetic scattering properties of SAR. To remedy this, we propose a physics-inspired regularization method dubbed $\Phi$-GAN, which incorporates the ideal point scattering center (PSC) model of SAR with two physical consistency losses. The PSC model approximates SAR targets using physical parameters, ensuring that $\Phi$-GAN generates SAR images consistent with real physical properties while preventing discriminator overfitting by focusing on PSC-based decision cues. To embed the PSC model into GANs for end-to-end training, we introduce a physics-inspired neural module capable of estimating the physical parameters of SAR targets efficiently. This module retains the interpretability of the physical model and can be trained with limited data. We propose two physical loss functions: one for the generator, guiding it to produce SAR images with physical parameters consistent with real ones, and one for the discriminator, enhancing its robustness by basing decisions on PSC attributes. We evaluate $\Phi$-GAN across several conditional GAN (cGAN) models, demonstrating state-of-the-art performance in data-scarce scenarios on three SAR image datasets.

Thu 23 Oct. 17:45 - 19:45 PDT

#420
Highlight
LBM: Latent Bridge Matching for Fast Image-to-Image Translation

Clément Chadebec · Onur Tasar · Sanjeev Sreetharan · Benjamin Aubin

In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation.

Thu 23 Oct. 17:45 - 19:45 PDT

#421
SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection

Maximilian Pittner · Joel Janai · Mario Faigle · Alexandru Condurache

3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface.Conventional 3D methods detect lanes from dense Birds-Eye-View (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface.While recent sparse lane detectors have outperformed dense BEV approaches, they remain simple adaptations of the standard detection transformer, completely ignoring valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization.Identifying the weaknesses of existing 3D lane datasets, we further introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy.Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.We aim to release code and data by the publication date.

Thu 23 Oct. 17:45 - 19:45 PDT

#422
Relative Illumination Fields: Learning Medium and Light Independent Underwater Scenes

Mengkun She · Felix Seegräber · David Nakath · Patricia Schöntag · Kevin Köser

We address the challenge of constructing a consistent and photorealistic Neural Radiance Field (NeRF) in inhomogeneously illuminated, scattering environments with unknown, co-moving light sources. While most existing works on underwater scene representation focus on homogeneous, globally illuminated scattering mediums, limited attention has been given to such scenarios-such as when a robot explores water deeper than a few tens of meters, where sunlight becomes insufficient. To address this, we propose a novel illumination field that is locally attached to the camera, enabling the capture of uneven lighting effects within the viewing frustum. We combine this with a volumetric representation of the medium to an overall method which effectively handles the interaction between the dynamic illumination field and the static scattering medium. Evaluation results demonstrate the effectiveness and flexibility of our approach. We release our code and dataset at link.

Thu 23 Oct. 17:45 - 19:45 PDT

#423
GaSLight: Gaussian Splats for Spatially-Varying Lighting in HDR

Christophe Bolduc · Yannick Hold-Geoffroy · Jean-Francois Lalonde

We present GaSLight, a method that generates spatially-varying lighting from regular images. Our method proposes using HDR Gaussian Splats as light source representation, marking the first time regular images can serve as light sources in a 3D renderer. Our two-stage process first enhances the dynamic range of images plausibly and accurately by leveraging the priors embedded in diffusion models. Next, we employ Gaussian Splats to model 3D lighting, achieving spatially variant lighting. Our approach yields state-of-the-art results on HDR estimations and their applications in illuminating virtual objects and scenes. To facilitate the benchmarking of images as light sources, we introduce a novel dataset of calibrated and unsaturated HDR to evaluate images as light sources. We assess our method using a combination of this novel dataset and an existing dataset from the literature. The code to reproduce our method will be available upon acceptance.

Thu 23 Oct. 17:45 - 19:45 PDT

#424
HumanOLAT: A Large-Scale Dataset for Full-Body Human Relighting and Novel-View Synthesis

Timo Teufel · xilong zhou · Umar Iqbal · Pramod Rao · Pulkit Gera · Jan Kautz · Vladislav Golyanik · Christian Theobalt

Simultaneous relighting and novel-view rendering of digital human representations is an important yet challenging task with numerous applications. However, progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset providing multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes HDR RGB frames under various illumination conditions, such as white light, environment maps, color gradients and fine-grained OLAT illuminations. Our evaluations on state-of-the-art relighting and novel-view synthesis methods underscore both the dataset's value and the significant challenges still present in accurately modeling complex human-centric appearance and lighting interactions. We believe that HumanOLAT will significantly facilitate future research, enabling rigorous benchmarking and advancements in both general and human-specific relighting and rendering techniques.

Thu 23 Oct. 17:45 - 19:45 PDT

#425
Highlight
Super Resolved Imaging with Adaptive Optics

Robin Swanson · Esther Y. H. Lin · Masen Lamb · Suresh Sivanandam · Kiriakos N. Kutulakos

Astronomical telescopes suffer from a tradeoff between field-of-view (FoV) and image resolution: increasing the FoV leads to an optical field that is under-sampled by the science camera. This work presents a novel computational imaging approach to overcome this tradeoff by leveraging the existing adaptive optics (AO) systems in modern ground-based telescopes. Our key idea is to use the AO system’s deformable mirror to apply a series of learned, precisely controlled distortions to the optical wavefront, producing a sequence of images that exhibit distinct, high-frequency, sub-pixel shifts. These images can then be jointly upsampled to yield the final super-resolved image. Crucially, we show this can be done while simultaneously maintaining the core AO operation --- correcting for the unknown and rapidly changing wavefront distortions caused by Earth's atmosphere. To achieve this, we incorporate end-to-end optimization of both the induced mirror distortions and the upsampling algorithm, such that telescope-specific optics and temporal statistics of atmospheric wavefront distortions are accounted for. Our experimental results with a hardware prototype, as well as simulations, demonstrate significant SNR improvements of up to 12 dB over non-AO super-resolution baselines, using only existing telescope optics and no hardware modifications. Moreover, by using a precise bench-top replica of a complete telescope and AO system, we show that our methodology can be readily transferred to an operational telescope.

Thu 23 Oct. 17:45 - 19:45 PDT

#426
HVPUNet: Hybrid-Voxel Point-cloud Upsampling Network

Juhyung Ha · Vibhas Vats · Alimoor Reza · Soon-heung Jung · David Crandall

Point-cloud upsampling aims to generate dense point sets from sparse or incomplete 3D data while preserving geometric fidelity. Most existing works follow point-to-point (P2P) framework to produce denser point sets through iterative, fixed-scale upsampling, which can limit flexibility in handling various levels of detail in 3D models. Alternatively, voxel-based methods can dynamically upsample point density in voxel space but often struggle to preserve precise point locations due to quantization effects.In this work, we introduce Hybrid-Voxel Point-cloud Upsampling Network (HVPUNet), an efficient framework for dynamic point-cloud upsampling to address the limitations of both point-based and voxel-based methods. HVPUNet integrates two key modules: (1) a Shape Completion Module to restore missing geometry by filling empty voxels, and (2) a Super-Resolution Module to enhance spatial resolution and capture finer surface details. Moreover, we adopt progressive refinement, operational voxel expansion, and implicit learning to improve efficiency in 3D reconstruction. Experimental results demonstrate that HVPUNet effectively upscales large scenes and reconstructs intricate geometry at significantly lower computational cost, providing a scalable and versatile solution for 3D reconstruction, super-resolution, and high-fidelity surface generation.