Oral
Oral 2A: View Synthesis and Scene Reconstruction
Exhibit Hall III
RayZer: A Self-supervised Large View Synthesis Model
Hanwen Jiang · Hao Tan · Peng Wang · Haian Jin · Yue Zhao · Sai Bi · Kai Zhang · Fujun Luan · Kalyan Sunkavalli · Qixing Huang · Georgios Pavlakos
We present RayZer, a self-supervised multi-view 3D Vision model trained without any 3D supervision, i.e., camera poses and scene geometry, while exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and uncalibrated images as input, recovers camera parameters, reconstructs a scene representation, and synthesizes novel views. During training, RayZer relies solely on its self-predicted camera poses to render target views, eliminating the need for any ground-truth camera annotations and allowing RayZer to be trained with 2D image supervision. The emerging 3D awareness of RayZer is attributed to two key factors. First, we design a self-supervised framework, which achieves 3D-aware auto-encoding of input images by disentangling camera and scene representations. Second, we design a transformer-based model in which the only 3D prior is the ray structure, connecting camera, pixel, and scene simultaneously. RayZer demonstrates comparable or even superior novel view synthesis performance than ``oracle'' methods that rely on pose annotations in both training and testing.
EVER: Exact Volumetric Ellipsoid Rendering for Real-time View Synthesis
Alexander Mai · Peter Hedman · George Kopanas · Dor Verbin · David Futschik · Qiangeng Xu · Falko Kuester · Jonathan Barron · Yinda Zhang
We present Exact Volumetric Ellipsoid Rendering (EVER), a method for real-time 3D reconstruction.EVER accurately blends an unlimited number of overlapping primitives together in 3D space, eliminating the popping artifacts that 3D Gaussian Splatting (3DGS) and other related methods exhibit.EVER represents a radiance field as a set of constant-density volumetric ellipsoids, which are raytraced by intersecting each primitive twice (once upon ray entrance and another on ray exit) and accumulating the derivatives of the densities and colors along the ray.Because EVER is built around ray tracing, it also enables effects such as defocus blur and fish-eye camera distortion, while still achieving frame rates of ~30 FPS at 720p on an NVIDIA RTX4090. We show that our method is more accurate on the challenging large-scale scenes from the Zip-NeRF dataset, where it achieves state of the art SSIM, even higher than Zip-NeRF.
Self-Ensembling Gaussian Splatting for Few-Shot Novel View Synthesis
Chen Zhao · Xuan Wang · Tong Zhang · Saqib Javed · Mathieu Salzmann
3D Gaussian Splatting (3DGS) has demonstrated remarkable effectiveness in novel view synthesis (NVS). However, 3DGS tends to overfit when trained with sparse views, limiting its generalization to novel viewpoints. In this paper, we address this overfitting issue by introducing Self-Ensembling Gaussian Splatting (SE-GS). We achieve self-ensembling by incorporating an uncertainty-aware perturbation strategy during training. A $\mathbf{\Delta}$-model and a $\mathbf{\Sigma}$-model are jointly trained on the available images. The $\mathbf{\Delta}$-model is dynamically perturbed based on rendering uncertainty across training steps, generating diverse perturbed models with negligible computational overhead. Discrepancies between the $\mathbf{\Sigma}$-model and these perturbed models are minimized throughout training, forming a robust ensemble of 3DGS models. This ensemble, represented by the $\mathbf{\Sigma}$-model, is then used to generate novel-view images during inference. Experimental results on the LLFF, Mip-NeRF360, DTU, and MVImgNet datasets demonstrate that our approach enhances NVS quality under few-shot training conditions, outperforming existing state-of-the-art methods.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction
Weirong Chen · Ganlin Zhang · Felix Wimbauer · Rui Wang · Nikita Araslanov · Andrea Vedaldi · Daniel Cremers
Traditional SLAM systems, which rely on bundle adjustment, often struggle with highly dynamic scenes commonly found in casual videos. Such videos entangle the motion of dynamic elements, undermining the assumption of static environments required by traditional systems. Existing techniques either filter out dynamic elements or model their motion independently. However, the former often results in incomplete reconstructions, whereas the latter can lead to inconsistent motion estimates.This work proposes a novel approach that leverages a 3D point tracker to decouple the static and dynamic motion, effectively separating the camera-induced motion from the motion of dynamic objects.Bundle adjustment can therefore operate reliably considering only the camera-induced component of the observed motion. We further ensure depth consistency across video frames with lightweight post-processing based on scale maps.Our framework combines the core of traditional SLAM -- bundle adjustment -- with a robust learning-based 3D tracker front-end.By integrating motion decomposition, bundle adjustment, and depth refinement into a unified framework, our method accurately tracks the camera motion and produces temporally coherent and scale-consistent dense reconstructions, accommodating both static and dynamic elements. Our experiments on challenging datasets reveal significant improvements in camera pose estimation and 3D reconstruction accuracy.
SceneSplat: Gaussian Splatting-based Scene Understanding with Vision-Language Pretraining
Yue Li · Qi Ma · Runyi Yang · Huapeng Li · Mengjiao Ma · Bin Ren · Nikola Popovic · Nicu Sebe · Ender Konukoglu · Theo Gevers · Luc Gool · Martin R. Oswald · Danda Pani Paudel
Recognizing arbitrary or previously unseen categories is essential for comprehensive real-world 3D scene understanding. Currently, all existing methods rely on 2D or textual modalities during training, or together at inference. This highlights a clear absence of a model capable of processing 3D data alone for learning semantics end-to-end, along with the necessary data to train such a model. Meanwhile, 3D Gaussian Splatting (3DGS) has emerged as the de facto standard for 3D scene representation across various vision tasks. However, effectively integrating semantic reasoning into 3DGS in a generalizable fashion remains an open challenge.To address these limitations we introduce SceneSplat, to our knowledge the first large-scale 3D indoor scene understanding approach that operates natively on 3DGS. Furthermore, we propose a self-supervised learning scheme that unlocks rich 3D feature learning from unlabeled scenes. In order to power the proposed methods, we introduce SceneSplat-7K, the first large-scale 3DGS dataset for indoor scenes, comprising of 6868 scenes derived from 7 established datasets like ScanNet, Matterport3D, etc. Generating SceneSplat-7K required computational resources equivalent to 119 GPU-days on an L4 GPU, enabling standardized benchmarking for 3DGS-based reasoning for indoor scenes.Our exhaustive experiments on SceneSplat-7K demonstrate the significant benefit of the proposed methods over the established baselines. Our code, model, and datasets will be released to facilitate further research.