Oral
Oral 6A: Physical Scene Perception
Exhibit Hall III
SuperDec: 3D Scene Decomposition with Superquadrics Primitives
Elisabetta Fedele · Boyang Sun · Francis Engelmann · Marc Pollefeys · Leonidas Guibas
We present SuperDec, an approach for compact 3D scene representations based on geometric primitives, namely superquadrics.While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.
Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior(DIIP). We take inspiration from the Deep Image Prior (DIP). since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate \methodnameacr on various degradation-blind IR tasks, including JPEG artifact removal, deblurring, denoising and super-resolution with state-of-the-art results.
A lens brings a $\textit{single}$ plane into focus on a planar sensor; hence, parts of the scene that are outside this planar focus plane are resolved on the sensor under defocus. Can we break this precept by enabling a lens that can change its depth-of-field arbitrarily? This work investigates the design and implementation of such a computational lens with spatially-selective focusing. Our design uses an optical arrangement of Lohmann lenses and phase spatial light modulators to allow each pixel to focus onto a different depth. We extend classical techniques used in autofocusing to the spatially-varying scenario where the depth map is iteratively estimated using contrast and disparity cues, enabling the camera to progressively shape its depth-of-field to the scene's depth. By obtaining an optical all-in-focus image, our technique advances upon a broad swathe of prior work ranging from depth-from-focus/defocus to coded aperture techniques in two key aspects: the ability to bring an entire scene in focus simultaneously, and the ability to maintain the highest possible spatial resolution.
mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets.In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20\% per $10\times$ data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a $10\times$ increase in training data. Finally, we estimate a total data requirement of $\approx$100M samples (3000 hours) to fully exploit the potential of GRT.
Event-based Visual Vibrometry
Xinyu Zhou · Peiqi Duan · Yeliduosi Xiaokaiti · Chao Xu · Boxin Shi
Visual vibrometry has emerged as a powerful technique for remote acquisition of audio signals and the physical properties of materials. To capture high-frequency vibrations, frame-based visual vibrometry approaches often require a high-speed video camera and bright lighting to compensate for the short exposure time. In this paper, we introduce event-based visual vibrometry, a new high-speed visual vibration sensing method using an event camera. Exploiting the high temporal resolution, dynamic range, and low bandwidth characteristics of event cameras, event-based visual vibrometry achieves high-speed vibration sensing under common lighting conditions with enhanced data efficiency. Specifically, we leverage a hybrid camera system and propose an event-based subtle motion estimation framework that integrates an optimization-based approach for estimating coarse motion within short time intervals and a neural network to mitigate the inaccuracies in the coarse motion estimation. We demonstrate our method by capturing vibration caused by audio sources and estimating material properties for various objects.