Skip to yearly menu bar Skip to main content


Timezone: Pacific/Honolulu

Registration Desk: Registration/Badge Pickup Thu 23 Oct 07:30 a.m.  


Oral 5B: Applications and evaluation Thu 23 Oct 08:00 a.m.  

Oral
Hanyi Wang · Han Fang · Shi-Lin Wang · Ee-Chien Chang

[ Kalakaua Ballroom ]

Abstract
Generative image watermarking enables the proactive detection and traceability of generated images. Among existing methods, inversion-based frameworks achieve highly conceal ed watermark embedding by injecting watermarks into the latent representation before the diffusion process. The robustness of this approach hinges on both the embedding mechanism and inversion accuracy. However, prior works have predominantly focused on optimizing the embedding process while overlooking inversion errors, which significantly affect extraction fidelity. In this paper, we address the challenge of inversion errors and propose ROAR, a dual-domain optimization-based framework designed to mitigate errors arising from two key sources: 1) Latent-domain errors, which accumulate across inversion steps due to inherent approximation assumptions. 2) Pixel-domain errors, which result from channel distortions such as JPEG compression. To tackle these issues, we introduce two novel components: A \textbf{Regeneration-based Optimization (RO)} mechanism, which incorporates an optimizable starting latent to minimize latent-domain errors; A Mixture of Experts (MoE)-based \textbf{distortion-adaptive restoration (AR)} network, which effectively recovers watermarked distributions from pixel-level distortions.Extensive experiments demonstrate that ROAR significantly reduces inversion errors and enhances watermark extraction robustness, thereby improving the reliability of generative image watermarking.
Oral
Yi Chen · Yuying Ge · Weiliang Tang · Yizhuo Li · Yixiao Ge · Mingyu Ding · Ying Shan · Xihui Liu

[ Kalakaua Ballroom ]

Abstract
Recent developments in Large Language Models (LLMs) pre-trained on extensive corpora have shown significant success in various natural language processing (NLP) tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", **can a similar generative pre-training approach be effectively applied to enhance robot learning?** The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks.Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce **Moto**, which converts video content into latent **Mo**tion **To**ken sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output …
Oral
Seungju Yoo · Hyuk Kwon · Joong-Won Hwang · Kibok Lee

[ Kalakaua Ballroom ]

Abstract
Object detection is a fundamental task in computer vision that has received significant attention in recent years. Despite advances in training object detection models, evaluating their performance in real-world applications remains challenging due to the substantial costs associated with image annotation. To address this issue, we propose Prediction Consistency and Reliability (PCR) as an automated model evaluation (AutoEval) method for object detection. Our method is motivated by the observation that most existing object detection models generate many candidate predictions, which are subsequently filtered through non-maximum suppression (NMS). Specifically, we analyze 1) the consistency between the final and redundant predictions and 2) the reliability of these predictions determined by their confidence scores, and propose PCR by examining their relationships with object detection performance. Furthermore, to facilitate a more realistic assessment of AutoEval methods for object detection, we construct meta-datasets incorporating various corruptions. Experimental results demonstrate the superior performance of PCR compared to the existing AutoEval methods.
Oral
Corentin Dumery · Noa Ette · Aoxiang Fan · Ren Li · Jingyi Xu · Hieu Le · Pascal Fua

[ Kalakaua Ballroom ]

Abstract
Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learning-based depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on diverse real-world and large-scale synthetic datasets, which we will release publicly to facilitate further research.
Oral
George Ciubotariu · Zhuyun Zhou · Zongwei Wu · Radu Timofte

[ Kalakaua Ballroom ]

Abstract
We introduce MIORe and VAR-MIORe, novel multi-task datasets that address critical limitations in current benchmarks for motion restoration tasks. Our datasets capture a broad spectrum of motion scenarios—including complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects—using high-frame-rate (1000 FPS) acquisition and professional-grade optics. By averaging variable numbers of frames based on computed optical flow metrics, MIORe generates consistent motion blur while preserving sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends this framework by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark of its kind. Together, these datasets provide high-resolution, scalable ground truth that challenges existing algorithms under both controlled and adverse conditions, paving the way for next-generation research in non-uniform deblurring, video interpolation, and optical flow analysis.
Oral
Ziv Weiss Haddad · Oren Barkan · Yehonatan Elisha · Noam Koenigstein

[ Kalakaua Ballroom ]

Abstract
Completeness is a widely discussed property in explainability research, requiring that the attributions sum to the model’s response to the input. While completeness intuitively suggests that the model’s prediction is "completely explained" by the attributions, its global formulation alone is insufficient to ensure meaningful explanations. We contend that promoting completeness locally within attribution subregions, in a soft manner, can serve as a standalone guiding principle for producing high quality attributions. To this end, we introduce the concept of the completeness gap as a flexible measure of completeness and propose an optimization procedure that minimizes this gap across subregions within the attribution map. Extensive evaluations across various model architectures demonstrate that our method outperforms state-of-the-art explanation methods on multiple benchmarks.

Oral 5A: Content Generation Thu 23 Oct 08:00 a.m.  

Oral
Xiaohang Zhan · Dingming Liu

[ Exhibit Hall III ]

Abstract
We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to ``render'' the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.
Oral
Jiaxu Zhang · Xianfang Zeng · Xin Chen · Wei Zuo · Gang YU · Zhigang Tu

[ Exhibit Hall III ]

Abstract
We propose MikuDance, a diffusion-based pipeline incorporating mixed motion dynamics to animate stylized character art. MikuDance consists of two key techniques: Mixed Motion Modeling and Mixed-Control Diffusion, to address the challenges of high-dynamic motion and reference-guidance misalignment in character art animation. Specifically, a Scene Motion Tracking strategy is presented to explicitly model the dynamic camera in pixel-wise space, enabling unified character-scene motion modeling. Building on this, the Mixed-Control Diffusion implicitly aligns the scale and body shape of diverse characters with motion guidance, allowing flexible control of local character motion. Subsequently, a Motion-Adaptive Normalization module is incorporated to effectively inject global scene motion, paving the way for comprehensive character art animation. Through extensive experiments, we demonstrate the effectiveness and generalizability of MikuDance across various character art and motion guidance, consistently producing high-quality animations with remarkable motion dynamics.
Oral
Peng Du · Hui Li · Han Xu · Paul Jeon · Dongwook Lee · Daehyun Ji · Ran Yang · Feng Zhu

[ Exhibit Hall III ]

Abstract
Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image super-resolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multi-scale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multi-scale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform (MDWT) to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency (LF) and high-frequency (HF) sub-bands, without omiting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.
Oral
Federico Girella · Davide Talon · Ziyue Liu · Zanxi Ruan · Yiming Wang · Marco Cristani

[ Exhibit Hall III ]

Abstract
Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.
Oral
Vladimir Kulikov · Matan Kleiner · Inbar Huberman-Spiegelglas · Tomer Michaeli

[ Exhibit Hall III ]

Abstract
Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results, and therefore many methods additionally intervene in the sampling process. Such methods achieve improved results but are not seamlessly transferable between model architectures. Here, we introduce FlowEdit, a text-based editing method for pre-trained T2I flow models, which is inversion-free, optimization-free and model agnostic. Our method constructs an ODE that directly maps between the source and target distributions (corresponding to the source and target text prompts) and achieves a lower transport cost than the inversion approach. This leads to state-of-the-art results, as we illustrate with Stable Diffusion 3 and FLUX.
Oral
Yiren Song · Danze Chen · Mike Zheng Shou

[ Exhibit Hall III ]

Abstract
Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a DiT based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments show that LayerTracer surpasses optimization-based and neural baselines in generation quality and editability.

Invited Talk: Linda B Smith

The efficiency of learner generated experiences

Much of the information in the world is latent, not revealed without some action by the perceiver. What we see, for example, depends on our posture, on where we turn our heads and eyes, what we do with our hands, where we move and how we move. In human infants and toddlers, the tight tie between momentary behavior and the momentary properties of the visual input lead to highly biased training data at multiple levels –edge statistics, mid-level properties, similarity distributions, semantic level properties and temporal properties. I will present findings from our analyses of the visual statistics of infant ego-centric images (collected at the scale of daily life in the home) and argue that the quality of the training data is a key factor in the efficient visual learning of infants and toddlers. When the efficiency of human learning exceeds current understanding of learning mechanisms, theorists often posit intrinsic “inductive biases” in the learning machinery that constrain learning outcomes enabling faster and more certain learning from complex, variable, and noisy training data. The visual statistics generated by infants and toddlers interacting with their everyday world reveal intrinsic constraints that directly bias, not learned inferences from noisy training data, but the training data itself. The findings provide insights to potential principles of designing training data that may support efficient learning even by machines with learning mechanisms unlike those of humans.

Linda B Smith

 

Linda B. Smith, Distinguished Professor at Indiana University Bloomington, is an internationally recognized leader in cognitive science and cognitive development. Taking a complex systems perspective, she seeks to understand the interdependencies among perceptual, motor and cognitive developments during the first three years of post-natal life. Using wearable sensors, including head-mounted cameras, she studies how the young learner’s own behavior creates the statistical structure of the learning environments with a current focus on developmentally changing visual statistics at the scale of everyday life and their role in motor, perceptual, and language development. The work extended through collaborations has led to new insights in artificial intelligence and education. Smith received her PhD from the University of Pennsylvania in 1977 and immediately joined the faculty at Indiana University. Her work has been continuously funded by the National Science Foundation and/or the National Institutes of Health since 1978. She won the David E. Rumelhart Prize for Theoretical Contributions to Cognitive Science, the American Psychological Association Award for Distinguished Scientific Contributions, the William James Fellow Award from the American Psychological Society, the Norman Anderson Lifetime Achievement Award, and the Koffka Medal. She is an elected member of both the National Academy of Sciences and the American Academy of Arts and Science.



Poster Session 5 & Exhibit Hall Thu 23 Oct 10:45 a.m.  

Poster
Bo Peng · Jie Lu · Guangquan Zhang · Zhen Fang

[ Exhibit Hall I ]

Abstract
This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data.Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various real-world settings.
Poster
You Huang · Lichao Chen · Jiayi Ji · Liujuan Cao · Shengchuan Zhang · Rongrong Ji

[ Exhibit Hall I ]

Abstract
Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention ($O(N^2)$) for boundary regions or our proposed efficient BSQ attention ($O(N)$) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high …
Poster
Yuting He · Shuo Li

[ Exhibit Hall I ]

Abstract
Contrastive learning (CL) has become a cornerstone of self-supervised pretraining (SSP) in foundation models; however, extending CL to pixel-wise representation—crucial for medical vision—remains an open problem. Standard CL formulates SSP as a binary optimization problem (binary CL) where the excessive pursuit of feature dispersion leads to an ``over-dispersion" problem, breaking pixel-wise feature correlation thus disrupting the intra-class distribution. Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (\textbf{COVER}) framework. COVER establishes an extendable vector-based self-learning, enforces a consistent optimization flow from vector regression to distance modeling, and leverages a vector pyramid architecture for granularity adaptation, thus preserving pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks, spanning 2 dimensions and 4 modalities, show that COVER significantly improves pixel-wise SSP, advancing generalizable medical visual foundation models. Codes will be publicly available at [GitHub].
Poster
Ming Hu · Kun yuan · Yaling Shen · feilong tang · Xiaohao Xu · Lin Zhou · Wei Li · Ying Chen · Zhongxing Xu · Zelin Peng · Siyuan Yan · Vinkle Srivastav · Diping Song · Tianbin Li · Danli Shi · Jin Ye · Nicolas Padoy · Nassir Navab · Junjun He · Zongyuan Ge

[ Exhibit Hall I ]

Abstract
Vision-language pretraining (VLP) enables open-world generalization beyond predefined labels, a critical capability in surgery due to the diversity of procedures, instruments, and patient anatomies. However, applying VLP to ophthalmic surgery presents unique challenges, including limited vision-language data, intricate procedural workflows, and the need for hierarchical understanding, ranging from fine-grained surgical actions to global clinical reasoning. To address these, we introduce OphVL, a large-scale, hierarchically structured dataset containing over 375K video-text pairs, making it 15× larger than existing surgical VLP datasets. OphVL captures a diverse range of ophthalmic surgical attributes, including surgical phases, operations, actions, instruments, medications, disease causes, surgical objectives, and postoperative care recommendations. By aligning short clips with detailed narratives and full-length videos with structured titles, OphVL provides both fine-grained surgical details and high-level procedural context. Building on OphVL, we propose OphCLIP, a hierarchical retrieval-augmented VLP framework. OphCLIP leverages silent surgical videos as a knowledge base, retrieving semantically relevant content to enhance narrated procedure learning. This enables OphCLIP to integrate explicit linguistic supervision with implicit visual knowledge, improving ophthalmic workflow modeling. Evaluations across 11 benchmark datasets for surgical phase recognition and multi-instrument identification demonstrate OphCLIP’s robust generalization and superior performance, establishing it as a foundation model for ophthalmic surgery.
Poster
Xiaokun Feng · Shiyu Hu · Xuchen Li · Dailing Zhang · Meiqi Wu · Jing Zhang · Xiaotang Chen · Kaiqi Huang

[ Exhibit Hall I ]

Abstract
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual …
Poster
Simone Alberto Peirone · Francesca Pistilli · Giuseppe Averta

[ Exhibit Hall I ]

Abstract
Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content.We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads.By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture.We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.
Poster
Kuniaki Saito · Donghyun Kim · Kwanyong Park · Atsushi Hashimoto · Yoshitaka Ushiku

[ Exhibit Hall I ]

Abstract
An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506% despite better lexical alignment.
Poster
Guangyu Ren · Hengyan Liu · Michalis Lazarou · Tania Stathaki

[ Exhibit Hall I ]

Abstract
Camouflaged scenes, where objects blend seamlessly into their environments, pose significant challenges to both human observers and computer vision systems. These objects match the background in color, texture, and shape, making them difficult to detect. To this end, we propose leveraging the Segment Anything Model (SAM) to tackle this challenging task effectively. Specifically, we propose how to exploit SAM without requiring any manual prompts by proposing several ideas. At the core of our method lies the rich information extracted through multi-modal prompts. At first, we generate an image caption using the BLIP model and obtain its text embedding through the use of a text encoder. We then generate a visual embedding through the vision encoder of the BLIP model and use both as inputs to SAM to provide additional semantic information about the image. Finally, we propose a couple of architectural novelties, a) we effectively integrate the multi-modal information in SAM through a multi-level adapter and b) we replace the dense embedding of SAM with the image embedding of its image encoder. Our method achieves new state-of-the-art performance in 11 out of 12 metrics in three benchmark datasets for camouflaged detection. Additionally, our method can be successfully adapted to other …
Poster
Li Caoshuo · Zengmao Ding · Xiaobin Hu · Bang Li · Donghao Luo · AndyPianWu AndyPianWu · Chaoyang Wang · Chengjie Wang · Taisong Jin · SevenShu SevenShu · Yunsheng Wu · Yongge Liu · Rongrong Ji

[ Exhibit Hall I ]

Abstract
As one of the earliest ancient languages, Oracle Bone Script (**OBS**) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named **OracleFusion**. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (**OSVF**), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.
Poster
Haoning Wu · Ziheng Zhao · Ya Zhang · Yanfeng Wang · Weidi Xie

[ Exhibit Hall I ]

Abstract
Training medical image segmentation models for rare yet clinically significant imaging modalities is challenging due to the scarcity of annotated data, and manual mask annotations can be costly and labor-intensive to acquire.This paper investigates **leveraging generative models to synthesize training data, to train segmentation models for underrepresented modalities**, particularly on annotation-scarce MRI. Concretely, our contributions are threefold:(i) we introduce **MRGen-DB**, a large-scale radiology image-text dataset comprising extensive samples with rich metadata, including modality labels, attributes, regions, and organs information, with a subset having pixelwise mask annotations;(ii) we present **MRGen**, a diffusion-based data engine for controllable medical image synthesis, conditioned on text prompts and segmentation masks. MRGen can generate realistic images for diverse MRI modalities lacking mask annotations, facilitating segmentation training in low-source domains;(iii) extensive experiments across multiple modalities demonstrate that MRGen significantly improves segmentation performance on unannotated modalities by providing high-quality synthetic data. We believe that our method bridges a critical gap in medical image analysis, extending segmentation capabilities to scenarios that are challenging to acquire manual annotations. The codes, models, and data will be publicly available.
Poster
Fanhong Zeng · Huanan LI · Juntao Guan · Rui Fan · Tong Wu · Xilong Wang · Lai Rui

[ Exhibit Hall I ]

Abstract
To enable the deployment of Vision Transformers on resource-constrained mobile and edge devices, the development of efficient ViT models has attracted significant attention. Researchers achieving remarkable improvements in accuracy and speed by optimizing attention mechanisms and integrating lightweight CNN modules. However, existing designs often overlook runtime overhead from memory-bound operations and the shift in feature characteristics from spatial-dominant to semantic-dominant as networks deepen. This work introduces TinyNeXt, a family of efficient hybrid ViTs for TinyML, featuring Lean Single-Head Self-Attention to minimize memory-bound operations, and a macro design tailored to feature characteristics at different stages. TinyNeXt strikes a better accuracy-speed trade-off across diverse tasks and hardware platforms, outperforming state-of-the-art models of comparable scale. For instance, our TinyNeXt-T achieves a remarkable 71.5\% top-1 accuracy with only 1.0M parameters on ImageNet-1K. Furthermore, compared to recent efficient models like MobileViT-XXS and MobileViT-XS, TinyNeXt-S and TinyNeXt-M achieve 3.7\%/0.5\% higher accuracy, respectively, while running 2.1$\times$/2.6$\times$ faster on Nvidia Jetson Nano.
Poster
Yuntao Shou · Xiangyong Cao · PeiqiangYan PeiqiangYan · Qiaohui Qiaohui · Qian Zhao · Deyu Meng

[ Exhibit Hall I ]

Abstract
In recent years, whole slide image (WSI)-based survival analysis has attracted much attention. In practice, WSIs usually come from different hospitals (or domains) and may have significant differences. These differences generally result in large gaps in distribution between different WSI domains and thus, the survival analysis models trained on one domain may fail to transfer to another. To address this issue, we propose a Dual-branch Encoder and Two-level Alignment (DETA) framework to explore both feature and category-level alignment between different WSI domains. Specifically, we first formulate the concerned problem as graph domain adaptation (GDA) using the graph representation of WSIs. Then, we construct a dual-branch graph encoder, including the message passing (MP) and the shortest path (SP) branches, to explicitly and implicitly extract semantic information from the graph-represented WSIs. To realize GDA, we propose a two-level alignment approach: at the category level, we develop a coupling technique by virtue of the dual-branch structure, leading to reduced divergence between the category distributions of the two domains; at the feature level, we introduce an adversarial perturbation strategy to better augment source domain feature, resulting in improved alignment in feature distribution. Extensive experiments have demonstrated the effectiveness of our proposed DETA framework in …
Poster
Ming Dai · Wenxuan Cheng · Jiang-Jiang Liu · Sen Yang · Wenxiao Cai · Yanpeng Sun · Wankou Yang

[ Exhibit Hall I ]

Abstract
Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose $\textbf{DeRIS}$, a novel framework that decomposes RIS into two key components: $\textit{perception}$ and $\textit{cognition}$. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, $\textbf{DeRIS}$ demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability.
Poster
Hongwei Lin · Dongyu Pan · Qiming Xia · Hai Wu · Cheng Wang · Siqi Shen · Chenglu Wen

[ Exhibit Hall I ]

Abstract
Recently, learning-based multi-agent cooperative perception has garnered widespread attention. However, the inherent vulnerabilities of neural networks, combined with the risks posed by cooperative communication as a wide-open backdoor, render these systems highly susceptible to adversarial attacks.Existing attack methods lack stealth as they perturb transmitted information indiscriminately, producing numerous false positives that are readily detected by consensus-based defenses. This paper proposes Pretend Benign (PB), a novel stealthy adversarial attack method that exploits vulnerabilities in cooperative perception to enable the attacker to disguise as a benign cooperator. To achieve this, we first introduce the Attack Region Selection (ARS) module, which divides the perception area into sub-regions based on confidence levels to pinpoint optimal attack locations. Then, we propose Multi-target Adversarial Perturbation Generation (MAPG), which maintains consensus, gain the victim’s trust, and thereby reverse the normal cooperative role of perception. To mitigate the latency in adversarial signal generation and communication, we further propose a real-time attack by predicting future information through historical feature flow. Extensive experiments on the OPV2V and V2XSet datasets demonstrate that PB effectively bypasses state-of-the-art defense methods, underscoring its stealth and efficacy.
Poster
Sunghyun Park · Jungsoo Lee · Shubhankar Borse · Munawar Hayat · Sungha Choi · Kyuwoong Hwang · Fatih Porikli

[ Exhibit Hall I ]

Abstract
While open-vocabulary semantic segmentation (OVSS) can segment an image into semantic regions based on arbitrarily given text descriptions even for classes unseen during training, it fails to understand personal texts (e.g., 'my mug cup') for segmenting regions of specific interest to users. This paper addresses challenges like recognizing 'my mug cup' among 'multiple mug cups'. To overcome this challenge, we introduce a novel task termed personalized open-vocabulary semantic segmentation and propose a text prompt tuning-based plug-in method designed to recognize personal visual concepts using a few pairs of images and masks, while maintaining the performance of the original OVSS. Based on the observation that reducing false predictions is essential when applying text prompt tuning to this task, our proposed method employs 'negative mask proposal' that captures visual concepts other than the personalized concept. We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them. This approach enhances personalized OVSS without compromising the original OVSS performance. We demonstrate the superiority of our method on our newly established benchmarks for this task, including FSS$^{per}$, CUB$^{per}$, and ADE$^{per}$.
Poster
Zheng Ziqiang · Wong Kwan · Binh-Son Hua · Jianbo Shi · Sai-Kit Yeung

[ Exhibit Hall I ]

Abstract
We investigate coral reef semantic segmentation, in which coral reefs are governed by multifaceted factors, like genes, environmental changes, and internal interactions. Unlike segmenting structural units/instances, which are predictable and follow a set pattern, also referred to as commonsense or prior, segmenting coral reefs involves modeling \textit{self-repeated}, \textit{asymmetric}, and \textit{amorphous} distribution of elements, \emph{e.g.}, corals can grow in almost any shape and appearance. We revisited existing segmentation approaches and found that both computer vision and coral reef research communities failed to incorporate the intrinsic properties of the corals into model design. In this work, we propose a simple formulation for coral reef semantic segmentation: \textit{segment} as the basis to model both \textit{within-segment} and \textit{cross-segment} affinities. We propose \textbf{CoralSRT}, a feature rectification module via self-supervised guidance, to reduce the stochasticity of coral features extracted by powerful foundation models (FMs), as demonstrated in Fig.~\ref{fig:teaser}. We incorporate the intrinsic properties of corals to strengthen within-segment affinity by guiding the features within the self-supervised segments to align with the centrality. We investigate the features from FMs that were optimized by various pretext tasks on significantly large-scale unlabeled or labeled data, already contain rich information for modeling both within-segment and cross-segment affinity, enabling the adaptation …
Poster
Olaf Dünkel · Artur Jesslen · Jiahao Xie · Christian Theobalt · Christian Rupprecht · Adam Kortylewski

[ Exhibit Hall I ]

Abstract
An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they mostly do not capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To remove failure cases, we propose a filtering mechanism that outperforms previous methods and hence enables a reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model …
Poster
Tiange Luo · Lajanugen Logeswaran · Justin Johnson · Honglak Lee

[ Exhibit Hall I ]

Abstract
We introduce RegionFocus, a visual test-time scaling approach that enhances GUI-based AI agents by leveraging visual cues to navigate the complexity of modern web interfaces. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving action accuracy without relying on extensive text-based reasoning. To support this process, we propose an image-as-history mechanism that visualizes key landmarks at each step, providing a transparent action record and enabling the agent to effectively choose among action candidates.Even with a simple region selection strategy, we observe significant performance gains of 31.7\% on Screenspot-pro and 34.9\% on WebVoyager benchmarks on top of a state-of-the-art open Vision Language Model Agent, highlighting the effectiveness of visual test-time scaling in interactive settings.Our code will be released publicly.
Poster
Jiangming Shi · Xiangbo Yin · yeyunchen yeyunchen · Yachao Zhang · zhizhong zhang · Yuan Xie · Yanyun Qu

[ Exhibit Hall I ]

Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image using a query that combines a reference image and a textual description, benefiting users to express their intent more effectively. Despite significant advances in CIR methods, two unresolved problems remain: 1) existing methods overlook multi-schema interaction due to the lack of fine-grained explicit visual supervision, which hinders the capture of complex correspondences, and 2) existing methods overlook noisy negative pairs formed by potential corresponding query-target pairs, which increases confusion. To address these problems, we propose a Multi-schemA Proximity Network (MAPNet) for CIR, consisting of two key components: Multi-Schema Interaction (MSI) and Relaxed Proximity Loss (RPLoss). Specifically, MSI leverages textual descriptions as an implicit guide to establish correspondences between multiple objects and attributes in the reference and target images, enabling multi-schema interactions. Then, RPLoss further aligns the query and target features while avoiding the poison of noisy negative pairs by denoising and reweighting strategy. Comprehensive experiments conducted on CIRR, FashionIQ, and LaSCo demonstrate that MAPNet achieves competitive results against state-of-the-art CIR methods. The source code will be made publicly available after the paper is accepted.
Poster
Tianming Liang · Kun-Yu Lin · Chaolei Tan · Jianguo Zhang · Wei-Shi Zheng · Jian-Fang Hu

[ Exhibit Hall I ]

Abstract
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose \textbf{ReferDINO}, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9\% \(\mathcal{J}\&\mathcal{F}\) on Ref-YouTube-VOS) while maintaining real-time inference speed (51 FPS). Code and models will be released.
Poster
Saarthak Kapse · Pushpak Pati · Srikar Yellapragada · Srijan Das · Rajarsi Gupta · Joel Saltz · Dimitris Samaras · Prateek Prasanna

[ Exhibit Hall I ]

Abstract
Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior to a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise.
Poster
Qi Qin · Le Zhuo · Yi Xin · Ruoyi Du · Zhen Li · Bin Fu · Yiting Lu · Xinyue Li · Dongyang Liu · Xiangyang Zhu · Will Beddow · Erwann Millon · Victor Perez · Wenhai Wang · Yu Qiao · Bo Zhang · Xiaohong Liu · Hongsheng Li · Chang Xu · Peng Gao

[ Exhibit Hall I ]

Abstract
We introduce **Lumina-Image 2.0**, an advanced text-to-image (T2I) model that surpasses previous state-of-the-art methods across multiple benchmarks. Lumina-Image 2.0 is characterized by two key features: (1) *Unification* – it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), which can generate detailed and accurate multilingual captions for our model. This not only accelerates model convergence, but also enhances prompt adherence, multi-granularity prompt handling, and task expansion with customized prompt templates. (2)*Efficiency* – to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies to optimize our model, alongside inference-time acceleration strategies without compromising image quality. We evaluate our model on academic benchmarks and T2I arenas, with results confirming that it matches or exceeds existing state-of-the-art models across various metrics, highlighting the effectiveness of our methods.
Poster
Doriand Petit · Steve Bourgeois · Vincent Gay-Bellile · Florian Chabot · Loïc Barthe

[ Exhibit Hall I ]

Abstract
3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, etc. Traditional methods adapt exclusively to either task-specific goals (open-vocabulary segmentation) or scene content (unsupervised semantic segmentation). We propose DiSCO-3D, the first method addressing the broader problem of 3D Open-Vocabulary Sub-concepts Discovery, which aims to provide a 3D semantic segmentation that adapts to both the scene and user queries. We build DiSCO-3D on Neural Fields representations, combining unsupervised segmentation with weak open-vocabulary guidance. Our evaluations demonstrate that DiSCO-3D achieves effective performance in Open-Vocabulary Sub-concepts Discovery and exhibits state-of-the-art results in the edge cases of both open-vocabulary and unsupervised segmentation.
Poster
Sheng Ye · Xin Chen · Yan Zhang · Xianming Lin · Liujuan Cao

[ Exhibit Hall I ]

Abstract
Camouflaged object detection (COD) faces unique challenges where target boundaries are intrinsically ambiguous due to their textural similarity to backgrounds. Existing methods relying on single-modality features often produce fragmented predictions due to insufficient boundary constraints.To address this, we propose ESCNet with dynamically coupled edge-texture perception. Our framework introduces three core innovations that work in concert:1) Adaptive Edge-Texture Perceptor (AETP), which creates an edge prediction behaviour where edge and texture information are mutually reinforcing based on the multi-scale features of the image integrated with the global semantic context of the Transformer;2) Dual-Stream Feature Augmentor (DSFA), which dynamically adjusts the kernel sampling position according to the local texture complexity and edge orientation, thus accurately enhancing the feature information at fractal boundaries and amorphous texture locations;3) Multi-Feature Modulation Module (MFMM), which establishes incremental fine-grained improvements for feature calibration and model prediction through enhanced characterisation of edge perception and hierarchical integration of multiple textures. This interconnected system forms a feedback loop where enhanced representations of edge perception enhance model texture prediction and vice versa. Our ESCNet demonstrates significant performance advantages on all three authoritative datasets. On the $F^w_\beta$ metric, ESCNet achieves 0.859 and 0.843 on the NC4K and CAMO datasets, respectively.
Poster
Zhenwei Shao · Mingyang Wang · Zhou Yu · Wenwen Pan · Yan Yang · Tao Wei · Hongyuan Zhang · Ning Mao · Chen Wei · Jun Yu

[ Exhibit Hall I ]

Abstract
Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM---a simple and general architecture by ``growing'' a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods.
Poster
Kecheng Chen · Xinyu Luo · Tiexin Qin · Jie Liu · Hui Liu · Victor Ho Fun Lee · Hong Yan · Haoliang Li

[ Exhibit Hall I ]

Abstract
Foundation medical segmentation models, with MedSAM being the most popular, have achieved promising performance across organs and lesions. However, MedSAM still suffers from compromised performance on specific lesions with intricate structures and appearance, as well as bounding box prompt-induced perturbations. Although current test-time adaptation (TTA) methods for medical image segmentation may tackle this issue, partial (e.g., batch normalization) or whole parametric updates restrict their effectiveness due to limited update signals or catastrophic forgetting in large models. Meanwhile, these approaches ignore the computational complexity during adaptation, which is particularly significant for modern foundation models. To this end, our theoretical analyses reveal that directly refining image embeddings is feasible to approach the same goal as parametric updates under the MedSAM architecture, which enables us to realize high computational efficiency and segmentation performance without the risk of catastrophic forgetting. Under this framework, we propose to encourage maximizing factorized conditional probabilities of the posterior prediction probability using a proposed distribution-approximated latent conditional random field loss combined with an entropy minimization loss. Experiments show that we achieve about 3% Dice score improvements across three datasets while reducing computational complexity by over 7 times.
Poster
Nicholas DiBrita · Jason Han · Tirthak Patel

[ Exhibit Hall I ]

Abstract
Research in quantum machine learning has recently proliferated due to the potential of quantum computing to accelerate machine learning. An area of machine learning that has not yet been explored is neural ordinary differential equation (neural ODE) based residual neural networks (ResNets), which aim to improve the effectiveness of neural networks using the principles of ordinary differential equations. In this work, we present our insights about why analog Rydberg atom quantum computers are especially well-suited for ResNets. We also introduce ResQ, a novel framework to optimize the dynamics of Rydberg atom quantum computers to solve classification problems in machine learning using analog quantum neural ODEs.
Poster
hahyeon choi · Junhoo Lee · Nojun Kwak

[ Exhibit Hall I ]

Abstract
Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.
Poster
Youngeun Kim · Seunghwan Lee · Aecheon Jung · Bogon Ryu · Sungeun Hong

[ Exhibit Hall I ]

Abstract
Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to larger models and diverse tasks. In this paper, we propose quantizing task vectors (i.e., the difference between pre-trained and fine-tuned checkpoints) instead of quantizing fine-tuned checkpoints. We observe that task vectors exhibit a narrow weight range, enabling low-precision quantization (≤ 4 bit) within existing task vector merging frameworks. To further mitigate quantization errors within ultra-low bit precision (e.g., 2 bit), we introduce Residual Task Vector Quantization, which decomposes the task vector into a base vector and offset component. We allocate bits based on quantization sensitivity, ensuring precision while minimizing error within a memory budget. Experiments on image classification and dense prediction show our method maintains or improves model merging performance while using only 8% of the memory required for full-precision checkpoints. Code and quantized task vectors will be released.
Poster
Jiacheng Lu · Hui Ding · Shiyu Zhang · Guoping Huo

[ Exhibit Hall I ]

Abstract
MRI tumor segmentation remains a critical challenge in medical imaging, where volumetric analysis faces unique computational demands due to the complexity of 3D data. The spatially sequential arrangement of adjacent MRI slices provides valuable information that enhances segmentation continuity and accuracy, yet this characteristic remains underutilized in many existing models. The spatial correlations between adjacent MRI slices can be regarded as “temporal-like” data, similar to frame sequences in video segmentation tasks. To bridge this gap, we propose M-Net, a flexible framework specifically designed for sequential image segmentation. M-Net introduces the novel Mesh-Cast mechanism, which seamlessly integrates arbitrary sequential models into the processing of both channel and temporal information, thereby systematically capturing the inherent “temporal-like” spatial correlations between MRI slices and ensuring consistent segmentation across sequences. Additionally, we define an MRI sequential input pattern and design a Two-Phase Sequential (TPS) training strategy, which first focuses on learning common patterns across sequences before refining slice-specific feature extraction. This approach leverages temporal modeling techniques to preserve volumetric contextual information while avoiding the high computational cost of full 3D convolutions, thereby enhancing the generalizability and robustness of M-Net in sequential segmentation tasks. Experiments on the BraTS2019 and BraTS2023 datasets demonstrate that M-Net outperforms existing …
Poster
Yupeng Hu · Changxing Ding · Chang Sun · Shaoli Huang · Xiangmin Xu

[ Exhibit Hall I ]

Abstract
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
Poster
Xiaolong Sun · Le Wang · Sanping Zhou · Liushuai Shi · Kun Xia · Mengnan Liu · Yabing Wang · Gang Hua

[ Exhibit Hall I ]

Abstract
Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination. The code …
Poster
Yongkun Du · Zhineng Chen · Hongtao Xie · Caiyan Jia · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due to struggling with text irregularity and linguistic missing. To address these challenges, we propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context. First, a multi-size resizing strategy is proposed to resize text instances to appropriate predefined sizes, effectively avoiding severe text distortion. Meanwhile, we introduce a feature rearrangement module to ensure that visual features accommodate the requirement of CTC, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module. It integrates linguistic context into the visual features, allowing CTC model to leverage language information for improved accuracy. Moreover, this module can be omitted at the inference stage and would not increase the time cost. We extensively evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared to mainstream STR models across multiple scenarios, including different types of text irregularity, languages, long text, and whether employing pretraining. The …
Poster
Yuan Gao · Sangwook Kim · Jianzhong You · Chris Mcintosh

[ Exhibit Hall I ]

Abstract
Medical decision-making requires integrating diverse medical information, from imaging to clinical narratives. These medical modalities are often acquired in a many-to-many manner. However, current medical vision-language pretraining models (Med-VLPMs) fail to directly account for this many-to-many mapping in their model training and embeddings. To address this, we present Probabilistic Modality-Enhanced Diagnosis (ProbMED), a multi-modal Med-VLPM that employs probabilistic contrastive learning to model distributions over embeddings rather than fixed-point, deterministic estimates. ProbMED aligns four distinct modalities—chest X-rays, electrocardiograms, echocardiograms, and clinical text—into a unified probabilistic embedding space. Our framework uses InfoNCE objective with a probabilistic distance metric (Hellinger distance) to integrate inter-modality distributions. To improve intra-modality binding, we introduce a synthetic sampling loss powered by probabilistic embeddings to capture modality-specific mean and variance. Extensive experiments across 13 medical datasets demonstrate that our model outperforms state-of-the-art Med-VLPMs in cross-modality retrieval, zero-shot and few-shot classification. We also show the robust integration of multiple modalities for prognostication, demonstrating the improved intra and inter-modality binding of multimodal medical data embeddings. The anonymized code can be found in https://anonymous.4open.science/r/probMED-8564.
Poster
Zeyuan Yang · Delin Chen · Xueyang Yu · Maohao Shen · Chuang Gan

[ Exhibit Hall I ]

Abstract
Long video understanding poses unique challenges due to its temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as "VCA". Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences.Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM's self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our approach’s superior effectiveness and efficiency.
Poster
Yiwu Zhong · Zhuoming Liu · Yin Li · Liwei Wang

[ Exhibit Hall I ]

Abstract
Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (\eg, a \textbf{7-fold} reduction in FLOPs) while preserving the performance of video and image LLMs. Further, under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (\eg, \textbf{+4.6} on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs.
Poster
Cong Wei · Yujie Zhong · yingsen zeng · Haoxian Tan · Yong Liu · Hongfa Wang · Yujiu Yang

[ Exhibit Hall I ]

Abstract
Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with fine-grained visual guidance. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM-based methods with a single model.
Poster
Cihang Peng · Qiming HOU · Zhong Ren · Kun Zhou

[ Exhibit Hall I ]

Abstract
We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a GLIGEN model trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. We will release our dataset and reproducible pipeline to facilitate future research.
Poster
Yogesh Kumar · Uday Agarwal · Manish Gupta · Anand Mishra

[ Exhibit Hall I ]

Abstract
Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU …
Poster
Heeji Yoon · Heeseong Shin · Eunbeen Hong · Hyunwook Choi · Hansang Cho · Daun Jeong · Seungryong Kim

[ Exhibit Hall I ]

Abstract
Semi-supervised instance segmentation poses challenges due to limited labeled data, causing difficulties in accurately localizing distinct object instances. Current teacher-student frameworks still suffer from performance constraints due to unreliable pseudo-label quality stemming from limited labeled data. While the Segment Anything Model (SAM) offers robust segmentation capabilities at various granularities, directly applying SAM introduces challenges such as class-agnostic predictions and potential over-segmentation. To address these complexities, we carefully integrate SAM into the semi-supervised instance segmentation framework, developing a novel distillation method that effectively captures the precise localization capabilities of SAM without compromising semantic recognition. Furthermore, we incorporate pseudo-label refinement as well as a specialized data augmentation with the refined pseudo-labels, resulting in superior performance. We establish state-of-the-art performance, and provide comprehensive experiments and ablation studies to validate the effectiveness of our proposed approach.
Poster
Boyu Chen · Zhengrong Yue · Siran Chen · Zikang Wang · Yang Liu · Peng Li · Yali Wang

[ Exhibit Hall I ]

Abstract
Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency‌. 3) Action: Agents answer long video-related questions and exchange reasons. 4) Reflection: We evaluate each agent's performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in …
Poster
Wei Suo · Ji Ma · Mengyang Sun · Lin Wu · PENG WANG · Yanning Zhang

[ Exhibit Hall I ]

Abstract
Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational costs pose a significant barrier to wide application. To enhance inference efficiency, most existing approaches can be categorized as parameter-dependent or token-dependent strategies to reduce computational demands. However, parameter-dependent methods require retraining LVLMs to recover performance while token-dependent strategies struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios. The code for this work will be made publicly available.
Poster
Tan Pan · Zhaorui Tan · Kaiyu Guo · Dongli Xu · Weidi Xu · Chen Jiang · Xin Guo · Yuan Qi · Yuan Cheng

[ Exhibit Hall I ]

Abstract
3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.
Poster
Ziling Wu · Armaghan Moemeni · Praminda Caleb-Solly

[ Exhibit Hall I ]

Abstract
Unsupervised object discovery (UOD) aims to detect and segment objects in 2D images without handcrafted annotations. Recent progress in self-supervised representation learning has led to some success in UOD algorithms. However, the absence of ground truth provides existing UOD methods with two challenges: 1) determining if a discovered region is foreground or background, and 2) knowing how many objects remain undiscovered. To address these two problems, previous solutions rely on foreground priors to distinguish if the discovered region is foreground, and conduct one or fixed iterations of discovery. However, the existing foreground priors are heuristic and not always robust, and a fixed number of discoveries leads to under or over-segmentation, since the number of objects in images varies. This paper introduces UnionCut, a robust foreground prior based on ensemble methods that detects the union of foreground areas of an image, allowing UOD algorithms to identify foreground objects and stop discovery once the majority of the foreground union in the image is segmented. On top of that, we propose UnionSeg, a vision transformer distilled from UnionCut that outputs the foreground union faster and more accurately. Our experiments show that by combining with UnionCut or UnionSeg, previous state-of-the-art UOD methods witness an …
Poster
Ruchit Rawal · Reza Shirkavand · Heng Huang · Gowthami Somepalli · Tom Goldstein

[ Exhibit Hall I ]

Abstract
Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple choice questions. Unfortunately, it has been observed that VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.
Poster
Shuo Jin · Siyue Yu · Bingfeng Zhang · Mingjie Sun · Yi Dong · Jimin XIAO

[ Exhibit Hall I ]

Abstract
Training-free open-vocabulary semantic segmentation has advanced with vision-language models like CLIP, which exhibit strong zero-shot abilities. However, CLIP's attention mechanism often wrongly emphasises specific image tokens, namely outliers, which results in irrelevant over-activation. Existing approaches struggle with these outliers that arise in intermediate layers and propagate through the model, ultimately degrading spatial perception. In this paper, we propose a Self-adaptive Feature Purifier framework (SFP) to suppress propagated outliers and enhance semantic representations for open-vocabulary semantic segmentation. Specifically, based on an in-depth analysis of attention responses between image and class tokens, we design a self-adaptive outlier mitigator to detect and mitigate outliers at each layer for propagated feature purification. In addition, we introduce a semantic-aware attention enhancer to augment attention intensity in semantically relevant regions, which strengthens the purified feature to focus on objects. Further, we introduce a hierarchical attention integrator to aggregate multi-layer attention maps to refine spatially coherent feature representations for final segmentation. Our proposed SFP enables robust outlier suppression and object-centric feature representation, leading to a more precise segmentation. Extensive experiments show that our method achieves state-of-the-art performance and surpasses existing methods by an average of 4.6% mIoU on eight segmentation benchmarks. The code will be released.
Poster
Giyeol Kim · Sooyoung Yang · Jihyong Oh · Myungjoo Kang · Chanho Eom

[ Exhibit Hall I ]

Abstract
Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be less effective at capturing the contextual and fine-grained features crucial for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. Recently, diffusion models have emerged as powerful vision backbones, capturing rich visual priors from large-scale datasets. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a frozen pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW. Our code will be available online at the time of the publication.
Poster
Yuhan Liu · Jingwen Fu · Yang Wu · Kangyi Wu · Pengna Li · Jiayi Wu · Sanping Zhou · Jingmin Xin

[ Exhibit Hall I ]

Abstract
Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models to effectively capture instance-level details. 2) We leverage the prompt mechanism in generative model as a natural tunnel, propose a novel cross-image interaction prompting module to facilitate bidirectional information interaction between image pairs. To more accurately measure the misalignment, we propose a new benchmark called IMIM, which focuses on multi-instance scenarios. Our …
Poster
Sanghyun Jo · Seo Lee · Seungwoo Lee · Seohyung Hong · Hyungseok Seo · Kyungsu Kim

[ Exhibit Hall I ]

Abstract
Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. While unsupervised CIS (UCIS) models aim to reduce the heavy reliance on labor-intensive image annotations, they fail to accurately capture cell boundaries, causing missed detections and poor performance. Recognizing the absence of error-free instances as a key limitation, we present COIN (COnfidence score-guided INstance distillation), a novel annotation-free framework with three key steps: (1) Increasing the sensitivity for the presence of error-free instances via unsupervised semantic segmentation with optimal transport, leveraging its ability to discriminate spatially minor instances, (2) Instance-level confidence scoring to measure the consistency between model prediction and refined mask and identify highly confident instances, offering an alternative to ground truth annotations, and (3) Progressive expansion of confidence with recursive self-distillation. Extensive experiments across six datasets show COIN outperforming existing UCIS methods, even surpassing semi- and weakly-supervised approaches across all metrics on the MoNuSeg and TNBC datasets. The code will be made available upon publication.
Poster
Walid Bousselham · Angie Boggust · Sofian Chaybouti · Hendrik Strobelt · Hilde Kuehne

[ Exhibit Hall I ]

Abstract
Vision Transformers (ViTs) have become a standard architecture in computer vision. However, because of their modeling of long-range dependencies through self-attention mechanisms, the explainability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of single ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement method to enhance the transparency of ViTs. We evaluate LeGrad in various setups, including segmentation, perturbation, and open-vocabulary settings, showcasing its improved spatial fidelity and its versatility compared to other SotA explainability methods. Code will be released.
Poster
Zhengyin Liang · Hui Yin · Min Liang · Qianqian Du · Ying Yang · Hua Huang

[ Exhibit Hall I ]

Abstract
Modality or domain distribution shifts pose formidable challenges in 3D semantic segmentation. Existing methods predominantly address either cross-modal or cross-domain adaptation in isolation, leading to insufficient exploration of semantic associations and complementary features in heterogeneous data. To bridge this gap, we present UniDxMD, a unified representation method for cross-modal unsupervised domain adaptation (UDA) in 3D semantic segmentation that simultaneously tackles both cross-modal and cross-domain adaptation objectives. Our core insight is deriving a unified discrete representation from heterogeneous data to mitigate distribution shifts, inspired by vector quantization. Specifically, we propose a differentiable, cluster-based soft quantization mechanism (CSQM) that maps heterogeneous data (spanning modalities and domains) into a shared discrete latent space. Then, we introduce latent space regularization (LSR), leveraging joint prototypes that satisfy semantic relational consistency as learnable anchors to enhance the compactness and semantic discriminability of the discrete latent space. Our method paves the way for advancing cross-modal UDA in 3D semantic segmentation towards the unified representation. Extensive results across four challenging cross-modal UDA scenarios demonstrate the superiority of our method, achieving state-of-the-art performance on multiple benchmarks. Code will be available publicly.
Poster
Rui Sun · Huayu Mai · Wangkai Li · Yujia Chen · Yuan Wang

[ Exhibit Hall I ]

Abstract
Semi-supervised semantic segmentation has attracted considerable attention as it alleviates the need for extensive pixel-level annotations. However, existing methods often overlook the potential optimization conflict between supervised and unsupervised learning objectives, leading to suboptimal performance. In this paper, we identify this under-explored issue and propose a novel Pareto Optimization Strategy (POS) to tackle it. POS aims to find a descent gradient direction that benefits both learning objectives, thereby facilitating model training. By dynamically assigning weights to the gradients at each iteration based on the model's learning status, POS effectively reconciles the intrinsic tension between the two objectives. Furthermore, we analyze POS from the perspective of gradient descent in random batch sampling and propose the Magnitude Enhancement Operation (MEO) to further unleash its potential by considering both direction and magnitude during gradient integration. Extensive experiments on challenging benchmarks demonstrate that integrating POS into existing semi-supervised segmentation methods yields consistent improvements across different data splits and architectures (CNN, Transformer), showcasing its effectiveness.
Poster
Xiaoling Hu · Xiangrui Zeng · Oula Puonti · Juan Iglesias · Bruce Fischl · Yaël Balbastre

[ Exhibit Hall I ]

Abstract
Domain randomization through synthesis is a powerful strategy to train networks that are unbiased with respect to the domain of the input images. Randomization allows networks to see a virtually infinite range of intensities and artifacts during training, thereby minimizing overfitting to appearance and maximizing generalization to unseen data. Although powerful, this approach relies on the accurate tuning of a large set of hyperparameters that govern the probabilistic distribution of the synthesized images. Instead of manually tuning these parameters, we introduce Learn2Synth, a novel procedure in which synthesis parameters are learned using a small set of real labeled data. Unlike methods that impose constraints to align synthetic data with real data (e.g., contrastive or adversarial techniques), which risk misaligning the image and its label map, we tune an augmentation engine such that a segmentation network trained on synthetic data has optimal accuracy when applied to real data. This approach allows the training procedure to benefit from real labeled examples, without ever using these real examples to train the segmentation network, which avoids biasing the network towards the properties of the training set. Specifically, we develop parametric and nonparametric strategies to enhance synthetic images in a way that improves the performance …
Poster
Chunxiao Li · Xiaoxiao Wang · Meiling Li · Boming Miao · Peng Sun · Yunjian Zhang · Xiangyang Ji · Yao Zhu

[ Exhibit Hall I ]

Abstract
With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization – RRDataset encompasses high-quality images from seven major scenarios (War \& Conflict, Disasters \& Accidents, Political \& Social Events, Medical \& Public Health, Culture \& Religion, Labor \& Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness – examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms.3) Re-digitization Robustness – assessing model effectiveness on images altered through four distinct re-digitization methods.We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms. Our dataset …
Poster
Soonwoo Cha · Jiwoo Song · Juan Yeo · Hyunbin Jin · Taesup Kim

[ Exhibit Hall I ]

Abstract
Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging a model’s own knowledge across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine CLIP’s representations, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.
Poster
Yicheng Feng · Yijiang Li · Wanpeng Zhang · Sipeng Zheng · Hao Luo · Zihao Yue · Zongqing Lu

[ Exhibit Hall I ]

Abstract
We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos—the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.
Poster
Juntao Chen · Wen Shen · Zhihua Wei · Lijun Sun · Hongyun Zhang

[ Exhibit Hall I ]

Abstract
Zero-shot Referring Expression Comprehension (REC) aims at locating an object described by a natural language query without training on task-specific datasets. Current approaches often utilize Vision-Language Models (VLMs) to perform region-text matching based on region proposals. However, this may downgrade their performance since VLMs often fail in relation understanding and isolated proposals inevitably lack global image context. To tackle these challenges, we first design a general formulation for code-based relation reasoning. It instructs Large Language Models (LLMs) to decompose complex relations and adaptively implement code for spatial and relation computation. Moreover, we directly extract region-text relevance from cross-modal attention maps in VLMs. Observing the inherent bias in VLMs, we further develop a simple yet effective bias deduction method, which enhances attention maps' capability to align text with the corresponding regions. Experimental results on four representative datasets demonstrate the SOTA performance of our method. On the RefCOCO dataset centered on spatial understanding, our method gets an average improvement of 10\% over the previous zero-shot SOTA. Code will be released as our paper is accepted.
Poster
Hongchi Ma · Guanglei Yang · Debin Zhao · Yanli JI · Wangmeng Zuo

[ Exhibit Hall I ]

Abstract
Industrial visual inspection is crucial for detecting defects in manufactured products, but it traditionally relies on human operators, leading to inefficiencies. Industrial Visual Anomaly Detection (IVAD) has emerged as a promising solution, with methods such as zero-shot, few-shot, and reconstruction-based techniques. However, zero-shot methods struggle with subtle anomalies, and reconstruction-based methods fail to capture fine-grained details. Few-shot methods, which use limited samples and prompts, offer a more efficient approach. Despite their promise, challenges remain in managing intra-class variation among references and in effectively extracting more representative anomaly features.This paper presents \textbf{R}etrieval-\textbf{e}nhanced \textbf{M}ulti-modal \textbf{P}rompt Fusion \textbf{A}nomaly \textbf{D}etection (ReMP-AD), a framework that introduces Intra-Class Token Retrieval (ICTR) to reduce noise in the memory bank and Vision-Language Prior Fusion (VLPF) to guide the encoder in capturing more distinctive and relevant features of anomalies. Experiments on the VisA and MVTec-AD datasets demonstrate that ReMP-AD outperforms existing methods, achieving 97.8\%/94.1\% performance in 4-shot anomaly segmentation and classification. Our approach also shows strong results on the PCB-Bank dataset, highlighting its effectiveness in few-shot industrial anomaly detection.
Poster
Omkar Thawakar · Dmitry Demidov · Ritesh Thawkar · Rao Anwer · Mubarak Shah · Fahad Khan · Salman Khan

[ Exhibit Hall I ]

Abstract
Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content.The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4\%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model will be publicly released.
Poster
Yang Xiao · Wang Lu · Jie Ji · Ruimeng Ye · Li · Xiaolong Ma · Bo Hui

[ Exhibit Hall I ]

Abstract
The design of artificial neural networks (ANNs) is inspired by the structure of the human brain, and in turn, ANNs offer a potential means to interpret and understand brain signals. Existing methods primarily align brain signals with real-world signals using Mean Squared Error (MSE), which solely focuses on local point-wise alignment, and ignores global matching, leading to coarse interpretations and inaccuracies in brain signal decoding. In this paper, we address these issues through optimal transport (OT) and theoretically demonstrate why OT provides a more effective alignment strategy than MSE. Specifically, we construct a transport plan between brain voxel embeddings and image embeddings, enabling more precise matching. By controlling the amount of transport, we mitigate the influence of redundant information.We apply our alignment model directly to the Brain Captioning task by feeding brain siginals into a large language model (LLM) instead of images. Our approach achieves state-of-the-art performance across ten evaluation metrics, surpassing the previous best method by an average of 6.11\% in single-subject training and 3.81\% in cross-subject training.Additionally, we have uncovered several insightful conclusions that align with existing brain research. We unveil the redundancy and synergy of brain information processing through region masking and data dimensionality reduction visualization experiments. …
Poster
Joonmyung Choi · Sanghyeok Lee · Byungoh Ko · Eunseo Kim · Jihyung Kil · Hyunwoo Kim

[ Exhibit Hall I ]

Abstract
Transformers have demonstrated remarkable success across various vision tasks, yet the quadratic complexity of self-attention remains a challenge for efficient inference.To address this, previous works such as FlashAttention optimize GPU memory access, and token compression techniques have been explored to reduce computational cost by reducing the number of tokens.However, conventional token importance measures rely on additional learnable modules or attention maps, making them impractical in training-free settings and incompatible with FlashAttention due to the inaccessibility of intermediate attention maps to minimize memory access.Here, we propose a novel training-free, model-agnostic token importance criterion, representation shift, which quantifies the information injected by each operation.Combined with the proposed representation shift, we can apply token compression on FlashAttention to further boost inference speed without requiring additional training or attention maps. This method also extends naturally beyond Transformers, e.g., convolutional neural networks (CNNs).Extensive experiments demonstrate that our representation shift, allowing token compression with FlashAttention and CNNs, results in up to 5.5$\times$ speed-up in video understandings.Through quantitative and qualitative experiments, we have shown that representation shift is a more robust alternative to conventional attention-based scores.
Poster
SungMin Jang · Wonjun Kim

[ Exhibit Hall I ]

Abstract
Open-vocabulary 3D semantic segmentation has been actively studied by incorporating language features into 3D scene representations.Even though many methods have shown the notable improvement in this task, they still have difficulties to make language embeddings be consistent across different views. This inconsistency highly results in mis-labeling where different language embeddings are assigned to the same part of an object. To address this issue, we propose a simple yet powerful method that aligns language embeddings via the identity information. The key idea is to locate language embeddings for the same identity closely in the latent space while putting them apart otherwise. This approach allows the same object to have identical language embeddings in novel views with accurate semantic masks, which are well aligned with the input text. Furthermore, we propose a progressive mask expanding scheme that enables more accurate extraction of semantic mask boundaries. This scheme is very effective in preserving the boundary shape of the target region by allowing the model to consider the local relationship between segments. Experimental results on benchmark datasets demonstrate that our method delivers state-of-the-art performance in open-vocabulary 3D semantic segmentation.
Poster
Yefei He · Feng Chen · Jing Liu · Wenqi Shao · Hong Zhou · Kaipeng Zhang · Bohan Zhuang

[ Exhibit Hall I ]

Abstract
The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform sparse attention mechanism solely on those important tokens, reducing the latency in the prefill phase. Tokens deemed less important will be discarded to reduce KV cache size, alleviating the memory bottleneck in the decoding …
Poster
Xiaoqi Wang · Clint Sebastian · Wenbin He · Liu Ren

[ Exhibit Hall I ]

Abstract
The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique flexibility and strong zero-shot capabilities. Recently, several SAM-based methods have made notable progress in this task by automatically generating prompts to guide SAM. However, these methods often generate prompts at object boundaries due to suboptimal prompt encoder, which results in instability and reduced robustness. In this work, we introduce ProSAM, a simple but effective method to address the stability challenges we identified in existing SAM-based visual reference segmentation approaches. By learning a variational prompt encoder to predict multivariate prompt distributions, ProSAM avoids generating prompts that lie in unstable regions, overcoming the instability caused by less robust prompts. Our approach consistently surpasses state-of-the-art methods on the Pascal-5$^i$ and COCO-20$^i$ datasets, providing a more robust solution for visual reference segmentation.
Poster
Victor Quétu · Zhu LIAO · Nour Hezbri · Fabio Pizzati · Enzo Tartaglione

[ Exhibit Hall I ]

Abstract
Although deep neural networks are well-known for their outstanding performance in tackling complex tasks, their hunger for computational resources remains a significant hurdle, posing energy-consumption issues and restricting their deployment on resource-constrained devices, preventing their widespread adoption. In this paper, we present an optimal transport-based method to reduce the depth of over-parametrized deep neural networks, alleviating their computational burden. More specifically, we propose a new regularization strategy based on the Max-Sliced Wasserstein distance to minimize the distance between the intermediate feature distributions in the neural network. We show that minimizing this distance enables the complete removal of intermediate layers in the network, achieving better performance/depth trade-off compared to existing techniques.We assess the effectiveness of our method on traditional image classification setups and extend it to generative image models. Both source code and models will be released upon acceptance of the article.
Poster
Heeseok Jung · Jun-Hyeon Bak · Yujin Jeong · Gyugeun Lee · Jinwoo Ahn · Eun-Sol Kim

[ Exhibit Hall I ]

Abstract
In this paper, we propose a novel zero-shot compositional video understanding method inspired by how young children efficiently learn new concepts and flexibly expand their existing knowledge framework. While recent large-scale visual language models (VLMs) have achieved remarkable advancements and demonstrated impressive performance improvements across various tasks, they require massive amounts of data and computational resources. However, despite their high benchmark performance, they often fail to solve simple zero-shot composition tasks. Moreover, VLMs designed for video data demand even greater computational resources. We introduce a new video representation learning method inspired by human compositional learning to address these challenges. Specifically, we demonstrate that achieving zero-shot compositional learning requires effective representation learning that disentangles given data into meaningful semantic units. We propose a novel method that learns such disentangled representations based on an information-theoretic measure. By optimizing coding rate reduction, we successfully learn spatio-temporally disentangled features from videos, one of the most challenging data. Our approach significantly enhances compositional generalizability, demonstrating its effectiveness in zero-shot learning scenarios.
Poster
Zhen Qu · Xian Tao · Xinyi Gong · ShiChen Qu · Xiaopei Zhang · Xingang Wang · Fei Shen · Zhengtao Zhang · Mukesh Prasad · Guiguang Ding

[ Exhibit Hall I ]

Abstract
Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their ability to generalize across categories mainly relies on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) **Dictionary Construction** - to simulate the index and content of a real dictionary by building it with normal reference image features. (2) **Dictionary Lookup** - to retrieve queried region features from the dictionary using a sparse lookup strategy. When the queried feature cannot be successfully retrieved from the dictionary, it is classified as an anomaly. (3) **Query Discrimination Regularization** - to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To …
Poster
Chunhao Lu · Qiang Lu · Meichen Dong · Jake Luo

[ Exhibit Hall I ]

Abstract
Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.
Poster
Peijun Bao · Chenqi Kong · SIYUAN YANG · Zihao Shao · Xinghao Jiang · Boon Ng · Meng Er · Alex Kot

[ Exhibit Hall I ]

Abstract
Temporal video grounding aims to localize the described temporal moment in an untrimmed video based on a natural language query. A major challenge of this task is its heavy reliance on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. Although this dataset is not perfectly accurate, it is easily scalable without requiring extensive manual effort. To support this, we introduce Temporal Video Grounding Pretraining (Vid-Group), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate …
Poster
Chengyu Tao · Xuanming Cao · Juan Du

[ Exhibit Hall I ]

Abstract
Industrial quality inspection plays a critical role in modern manufacturing by identifying defective products during production. While single-modality approaches using either 3D point clouds or 2D RGB images suffer from information incompleteness, multimodal anomaly detection offers promise through the complementary fusion of crossmodal data. However, existing methods face challenges in effectively integrating unimodal results and improving discriminative power. To address these limitations, we first reinterpret memory bank-based anomaly scores in single modalities as isotropic Euclidean distances in local feature spaces. Dynamically evolving from Eulidean metrics, we propose a novel $\underline{G}$eometry-$\underline{G}$uided $\underline{S}$core $\underline{F}$usion (G$^{2}$SF) framework that progressively learns an anisotropic local distance metric as a unified score for the fusion task. Through a geometric encoding operator, a novel Local Scale Prediction Network (LSPN) is proposed to predict direction-aware scaling factors that characterize first-order local feature distributions, thereby enhancing discrimination between normal and anomalous patterns. Additionally, we develop specialized loss functions and score aggregation strategy from geometric priors to ensure both metric generalization and efficacy. Comprehensive evaluations on the MVTec-3D AD dataset demonstrate the state-of-the-art detection performance of our method with low positive rate and better recall, which is essential in industrial application, and detailed ablation analysis validates each component's contribution. (\textit{Code …
Poster
Tianyu Zou · Shengwu Xiong · Ruilin Yao · Yi Rong

[ Exhibit Hall I ]

Abstract
This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more aggressive. This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a **P**rototype-**A**ffinity **H**ybrid **Net**work (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). These two modules utilize the predictions generated by a pre-trained prototype learning model (called prototype predictor) to enhance the foreground information in support and query image representations and suppress the mismatched foreground-background (FG-BG) relationships between them, respectively. In this way, the aggressiveness of the affinity learner can be effectively mitigated, thereby eventually increasing the segmentation accuracy of our PAHNet method. Experimental results show that PAHNet achieves new state-of-the-art performance across 1-shot and 5-shot settings …
Poster
Jieun Kim · Jinmyeong Kim · Yoonji Kim · Sung-Bae Cho

[ Exhibit Hall I ]

Abstract
Large vision-language models (LVLMs) often exhibit object hallucination, a phenomenon where models generate descriptions of non-existent objects within images. Prior methods have sought to mitigate this issue by adjusting model logits to reduce linguistic bias, but they often lack precise control over visual uncertainty, sometimes exacerbating hallucinations instead of mitigating them. To address this limitation, we propose a novel decoding strategy called fuzzy contrastive decoding (FuzzyCD) that uses Takagi-Sugeno fuzzy inference to refine hallucination control. FuzzyCD adaptively assigns weights to high-hallucination logits while mitigating unnecessary linguistic bias. Specifically, it transforms the log-probabilities of top-1 tokens from both standard and hallucination logits into a \textit{confidence} linguistic fuzzy set. Through Takagi-Sugeno fuzzy inference, it dynamically adjusts hallucination logits to prevent the model from over-relying on spurious linguistic patterns. Experimental results on object hallucination datasets demonstrate that hallucination is mitigated by 11\%p compared to conventional LVLMs. In-depth analyses highlight the effectiveness of FuzzyCD in enhancing the reliability of vision-language models.
Poster
Jianting Tang · Yubo Wang · Haoyu Cao · Linli Xu

[ Exhibit Hall I ]

Abstract
Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM’s shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision.
Poster
Corentin Dumery · Noa Ette · Aoxiang Fan · Ren Li · Jingyi Xu · Hieu Le · Pascal Fua

[ Exhibit Hall I ]

Abstract
Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learning-based depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on diverse real-world and large-scale synthetic datasets, which we will release publicly to facilitate further research.
Poster
Xiaohang Zhan · Dingming Liu

[ Exhibit Hall I ]

Abstract
We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to ``render'' the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.
Poster
Chao Liu · Yangbo Jiang · Nenggan Zheng

[ Exhibit Hall I ]

Abstract
Extracting tubular structures from images is a widespread and challenging task in computer vision. To explore these continuous structures, iterative tracing methods offer a promising direction. However, in scenes with dense and blurred branches, existing tracing methods tend to jump to adjacent branches during tracing process, leading a significant topological mistake. The reason of this shortcoming is that the tracing model only focuses on the estimation of discrete nodes and ignores their connection attribution. To solve this problem, we introduce NETracer, a topology-aware iterative tracing method to improve the continuity and topological accuracy. In our approach, a node-edge estimation network with local connectivity loss is trained to produce the future nodes and their connective edges. Then, a geodesic distance-based search strategy is employed with the help of predicted edge cues to trace the future branches more accurately. Additionally, to comprehensively assess the effect of the tracing model, an new tracing metric is proposed to evaluate the local accuracy, continuity, and topological correctness of the traced branches. We demonstrate that our proposed method outperforms existing segmentation and tracing methods on five 2D road, vessel and 3D neuron datasets.
Poster
Yujia Tong · Yuze Wang · Jingling Yuan · Chuang Hu

[ Exhibit Hall I ]

Abstract
Model quantization enables efficient deployment of deep neural networks on edge devices through low-bit parameter representation, yet raises critical challenges for implementing machine unlearning (MU) under data privacy regulations. Existing MU methods designed for full-precision models fail to address two fundamental limitations in quantized networks: 1) Noise amplification from label mismatch during data processing, and 2) Gradient imbalance between forgotten and retained data during training. These issues are exacerbated by quantized models' constrained parameter space and discrete optimization. We propose Q-MUL, the first dedicated unlearning framework for quantized models. Our method introduces two key innovations: 1) Similar Labels assignment replaces random labels with semantically consistent alternatives to minimize noise injection, and 2) Adaptive Gradient Reweighting dynamically aligns parameter update contributions from forgotten and retained data. Through systematic analysis of quantized model vulnerabilities, we establish theoretical foundations for these mechanisms. Extensive evaluations on benchmark datasets demonstrate Q-MUL's superiority over existing approaches.
Poster
Yuan Tian · Shuo Wang · Rongzhao Zhang · Zijian Chen · Yankai Jiang · Chunyi Li · Xiangyang Zhu · Fang Yan · Qiang Hu · Xiaosong Wang · Guangtao Zhai

[ Exhibit Hall I ]

Abstract
Medical imaging has significantly advanced computer-aided diagnosis, yet its re-identification (ReID) risks raise critical privacy concerns, calling for de-identification (DeID) techniques. Unfortunately, existing DeID methods neither particularly preserve medical semantics, nor are flexibly adjustable towards different privacy levels. To address these issues, we propose a divide-and-conquer framework that comprises two steps: (1) \textbf{Identity-Blocking}, which blocks varying proportions of identity-related regions, to achieve different privacy levels; and (2) \textbf{Medical-Semantics-Compensation}, which leverages pre-trained Medical Foundation Models (MFMs) to extract medical semantic features to compensate the blocked regions. Moreover, recognizing that features from MFMs may still contain residual identity information, we introduce a \textbf{Minimum Description Length} principle-based feature decoupling strategy, to effectively decouple and discard such identity components. Extensive evaluations against existing approaches across seven datasets and three downstream tasks, demonstrating our state-of-the-art performance.
Poster
Yuanhan Zhang · Yunice Chew · Yuhao Dong · Aria Leo · Bo Hu · Ziwei Liu

[ Exhibit Hall I ]

Abstract
Human intelligence requires both correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Turing Test (Video-TT), a benchmark designed to assess if video LLMs can interpret real-world videos as effectively as humans.Video-TT differentiates between errors due to inadequate frame sampling and 1) genuine gaps in understanding complex visual narratives, and 2) evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance, underscoring the need for benchmarks like Video-TT to advance video understanding.
Poster
Langyu Wang · Langyu Wang · Yingying Chen · Yiyuan Zhang · Ming Tang · Jinqiao Wang

[ Exhibit Hall I ]

Abstract
The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model’s ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset, especially in visual metrics (e.g., gains of 2.8\% and 1.1\% in terms of Segment-level visual and Event-level visual metrics).
Poster
Xin Shen · Xinyu Wang · Lei Shen · Kaihao Zhang · Xin Yu

[ Exhibit Hall I ]

Abstract
Cross-view isolated sign language recognition (CV-ISLR) addresses the challenge of identifying isolated signs from viewpoints unseen during training, a problem aggravated by the scarcity of multi-view data in existing benchmarks. To bridge this gap, we introduce a novel two-stage framework comprising View Synthesis and Contrastive Multi-task View-Semantics Recognition. In the View Synthesis stage, we simulate unseen viewpoints by extracting 3D keypoints from the frontal-view training dataset and synthesizing common-view 2D skeleton sequences with virtual camera rotation, which enriches view diversity without the cost of multi-camera setups. However, direct training on these synthetic samples leads to limited improvement, as viewpoint-specific and semantics-specific features remain entangled. To overcome this drawback, the Contrastive Multi-task View-Semantics Recognition stage employs the cross-attention mechanism and contrastive learning objective, explicitly disentangling viewpoint-related information from sign semantics, thus obtaining robust view-invariant representations. We evaluate our approach on the MM-WLAuslan dataset, the first benchmark for CV-ISLR, and on our extended protocol (MTV-Test) that includes additional multi-view data captured in the wild. Experimental results demonstrate that our method not only improves the accuracy of frontal-view skeleton-based isolated sign language recognition, but also exhibits superior generalization to novel viewpoints. The MTV-Test set and code will be publicly released here.
Poster
Zeren Jiang · Chuanxia Zheng · Iro Laina · Diane Larlus · Andrea Vedaldi

[ Exhibit Hall I ]

Abstract
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes.By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.
Poster
Xuehan Chen · Guangyu Ren · Tianhong Dai · Tania Stathaki · Hengyan Liu

[ Exhibit Hall I ]

Abstract
Foundation models, such as Segment Anything (SAM), have exhibited remarkable performance in conventional segmentation tasks, primarily due to their training on large-scale datasets. Nonetheless, challenges remain in specific downstream tasks, such as Camouflaged Object Detection (COD). Existing research primarily aims to enhance performance by integrating additional multimodal information derived from other foundation models. However, directly leveraging the information generated by these models may introduce additional biases due to domain shifts. To address this issue, we propose an Adaptive Refinement Module (ARM), which efficiently processes multimodal information and simultaneously enhances refined mask prompt. Furthermore, we construct an auxiliary embedding that effectively exploits the intermediate information generated during ARM, providing SAM with richer feature representations. Experimental results indicate that our proposed architecture surpasses most state-of-the-art (SOTA) models in the COD task, particularly excelling in structured target segmentation.
Poster
Wenzheng Zeng · Difei Gao · Mike Zheng Shou · Hwee Tou Ng

[ Exhibit Hall I ]

Abstract
Video LLMs show great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that \underline{d}ecouples the learning of these two tasks while also emphasizing their inherent \underline{d}ependency. We adopt a ``grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear …
Poster
Bingchao Wang · Zhiwei Ning · Jianyu Ding · Xuanang Gao · Yin Li · Dongsheng Jiang · JIE YANG · Wei Liu

[ Exhibit Hall I ]

Abstract
CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To improve long-text understanding while preserving short-text capabilities, we propose Fix-CLIP which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that Fix-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that Fix-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input.
Poster
Mengdi Liu · Zhangyang Gao · Hong Chang · Stan Li · Shiguang Shan · Xilin Chen

[ Exhibit Hall I ]

Abstract
Understanding how genes influence phenotype across species is a fundamental challenge in genetic engineering, which will facilitate advances in various fields such as crop breeding, conservation biology, and personalized medicine. However, current phenotype prediction models are limited to individual species and expensive phenotype labeling process, making the genotype-to-phenotype prediction a highly domain-dependent and data-scarce problem. To this end, we suggest taking images as morphological proxies, facilitating cross-species generalization through large-scale multimodal pretraining. We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA considering two critical evolutionary signals, i.e., multiple sequence alignments (MSA) and environmental contexts. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency. Extensive experiments show that integrating evolutionary signals with environmental context enriches the model's understanding of phenotype variability across species, thereby offering a valuable and promising exploration into advanced AI-assisted genomic analysis.
Poster
Hyolim Kang · Yunsu Park · Youngbeom Yoo · Yeeun Choi · Seon Joo Kim

[ Exhibit Hall I ]

Abstract
We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets.We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods.We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.
Poster
Mattia Segu · Marta Tintore Gazulla · Yongqin Xian · Luc Gool · Federico Tombari

[ Exhibit Hall I ]

Abstract
Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
Poster
Guoyizhe Wei · Rama Chellappa

[ Exhibit Hall I ]

Abstract
Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher’s representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures’ performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.
Poster
Meng Tian · Shuo Yang · Xinxiao Wu

[ Exhibit Hall I ]

Abstract
Driven by large-scale contrastive vision-language pre-trained models such as CLIP, recent advancements in the image-text matching task have achieved remarkable success in representation learning. Due to image-level visual-language alignment, CLIP falls short in understanding fine-grained details such as object attributes and spatial relationships between objects. Recent efforts have attempted to compel CLIP to acquire structured visual representations by introducing prompt learning to achieve object-level alignment. While achieving promising results, they still lack the capability to perceive actions, which are crucial for describing the states or relationships between objects. Therefore, we propose to endow CLIP with fine-grained action-level understanding by introducing an LLM-enhanced action-aware multi-modal prompt-tuning method, incorporating the action-related external knowledge generated by large language models (LLMs). Specifically, we design an action triplet prompt and an action state prompt to exploit compositional semantic knowledge and state-related causal knowledge implicitly stored in LLMs. Subsequently, we propose an adaptive interaction module to aggregate attentive visual features conditioned on action-aware prompted knowledge for establishing discriminative and action-aware visual representations, which further improves the performance. Comprehensive experimental results on two benchmark datasets demonstrate the effectiveness of our method.
Poster
Weixian Lei · Jiacong Wang · Haochen Wang · Xiangtai Li · Jun Hao Liew · Jiashi Feng · Zilong Huang

[ Exhibit Hall I ]

Abstract
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models will be released.
Poster
Pablo Garcia-Fernandez · Lorenzo Vaquero · Mingxuan Liu · Feng Xue · Daniel Cores · Nicu Sebe · Manuel Mucientes · Elisa Ricci

[ Exhibit Hall I ]

Abstract
Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset will be made available.
Poster
Hao LU · Yuting Zhang · Jiaqi Tang · Bowen Fu · Wenhang Ge · Wei Wei · Kaishun Wu · Ying-Cong Chen

[ Exhibit Hall I ]

Abstract
Remote Photoplethysmography (rPPG) enables non-contact extraction of physiological signals, providing significant advantages in medical monitoring, emotion recognition, and face anti-spoofing. However, the extraction of reliable rPPG signals is hindered by motion variations in real-world environments, leading to entanglement issue. To address the challenge, we employ the Generalizable Gaussian Model (GGM) to disentangle geometry and chroma components with 4D Gaussian representations. Employing the GGM for robust rPPG estimation is non-trivial. Firstly, there are no camera parameters in the dataset, resulting in the inability to render video from 4D Gaussian. The ``4D virtual camera'' is proposed to construct extra Gaussian parameters to describe view and motion changes, giving the ability to render video with the fixed virtual camera parameters. Further, the chroma component is still not explicitly decoupled in 4D Gaussian representation. Explicit motion modeling (EMM) is designed to decouple the motion variation in an unsupervised manner. Explicit chroma modeling (ECM) is tailored to decouple specular, physiological, and noise signals, respectively. To validate our approach, we expand existing rPPG datasets to include various motion and illumination interference scenarios, demonstrating the effectiveness of our method in real-world settings. The code will be available after acceptance.
Poster
Seogkyu Jeon · Kibeom Hong · Hyeran Byun

[ Exhibit Hall I ]

Abstract
Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks.
Poster
Yudong Liu · Jingwei Sun · Yueqian Lin · Jingyang Zhang · Ming Yin · Qinsi Wang · Jianyi Zhang · Hai Li · Yiran Chen

[ Exhibit Hall I ]

Abstract
Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach's effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.
Poster
Han Wang · Yuxiang Nie · Yongjie Ye · Yanjie Wang · SHUAI LI · Haiyang Yu · Jinghui Lu · Can Huang

[ Exhibit Hall I ]

Abstract
The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos.In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed Dynamic-VLM achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, Dynamic-VLM delivers an absolute improvement of 2.7% over LLaVA-OneVision on VideoMME and 10.7% on MuirBench.
Poster
Min Yang · Zihan Jia · Zhilin Dai · Sheng Guo · Limin Wang

[ Exhibit Hall I ]

Abstract
Although big models have achieved good results in increasing numbers of vision tasks, efficient lightweight neural networks have received increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video models still focus on the larger ViT architecture, and few works attempt to build efficient architecture. Since many efficient contrastive language-image pre-training (CLIP) models have shown strong zero-shot classification and retrieval capability, we attempt to fill the gap in video-text understanding models and propose a fast and efficient video-text model \textbf{MobileViCLIP} with strong zero-shot reasoning capability that can be deployed on mobile devices. In particular, our MobileViCLIP-Small obtains similar zero-shot retrieval performance as InternVideo2-L14 on text-to-video dataset MSR-VTT while being $46.7\times$ faster when deployed on the mobile device. Furthermore, MobileViCLIP-Small can generalize to zero-shot action recognition task and obtains 1.0\% better Top-1 accuracy than InternVideo2-S14 while being $5.6\times$ faster on the mobile device.
Poster
Junpeng Jing · Weixun Luo · Ye Mao · Krystian Mikolajczyk

[ Exhibit Hall I ]

Abstract
This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios. Code and models will be publicly released.
Poster
Yehao Lu · Minghe Weng · Zekang Xiao · Rui Jiang · Wei Su · Guangcong Zheng · Luping Luping · Xi Li

[ Exhibit Hall I ]

Abstract
The Mixture of Experts (MoE) architecture has excelled in Large Vision-Language Models (LVLMs), yet its potential in real-time open-vocabulary object detectors, which also leverage large-scale vision-language datasets but smaller models, remains unexplored. This work investigates this domain, revealing intriguing insights. In the shallow layers, experts tend to cooperate with diverse peers to expand the search space. While in the deeper layers, fixed collaborative structures emerge, where each expert maintains 2-3 fixed partners and distinct expert combinations are specialized in processing specific patterns. Concretely, we propose Dynamic-DINO, which extends Grounding DINO 1.5 Edge from a dense model to a dynamic inference framework via an efficient MoE-Tuning strategy. Additionally, we design a granularity decomposition mechanism to decompose the Feed-Forward Network (FFN) of base model into multiple smaller expert networks, expanding the subnet search space. To prevent performance degradation at the start of fine-tuning, we further propose a pre-trained weight allocation strategy for the experts, coupled with a specific router initialization. During inference, only the input-relevant experts are activated to form a compact subnet. Experiments show that, pretrained with merely 1.56M open-source data, Dynamic-DINO outperforms Grounding DINO 1.5 Edge, pretrained on the private Grounding20M dataset.
Poster
Qizhe Zhang · Aosong Cheng · Ming Lu · Renrui Zhang · Zhiyong Zhuo · Jiajun Cao · Shaobo Guo · Qi She · Shanghang Zhang

[ Exhibit Hall I ]

Abstract
Large vision-language models (VLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for visual token pruning. Based on the analysis, We propose **VisPruner**, a plug-and-play method that utilizes visual cues for more effective token pruning in visual language models (VLMs). Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable …
Poster
Qiao Zhang · Mingwen Shao · Xinyuan Chen · Xiang Lv · Kai Xu

[ Exhibit Hall I ]

Abstract
The Mamba model excels in anomaly detection through efficient long-range dependency modeling and linear complexity. However, Mamba-based anomaly detectors still face two critical challenges: (1) insufficient modeling of diverse local features leading to inaccurate detection of subtle anomalies; (2) spatial-wise scanning mechanism disrupting the spatial continuity of large-scale anomalies, resulting in incomplete localization. To address these challenges, we propose Wave-MambaAD, a wavelet-driven state space model for unified subtle and large-scale anomaly detection. Firstly, to capture subtle anomalies, we design a high-frequency state space model that employs horizontal, vertical, and diagonal scanning mechanisms for processing directionally aligned high-frequency components, enabling precise anomaly detection through multidimensional feature extraction. Secondly, for comprehensive localization of large-scale anomalies, we propose a low-frequency state space model implementing channel-adaptive dynamic scanning mechanisms to maintain structural coherence in global contexts, which facilitates large-scale anomaly detection via adaptive feature integration. Finally, we develop a dynamic spatial enhancement block to improve anomalous feature representation by enhancing feature diversity through coordinated inter-channel communication and adaptive gating mechanisms. Comprehensive experiments on benchmark anomaly detection datasets show that Wave-MambaAD achieves competitive performance at lower parameters and computational costs.
Poster
Yingyue Li · Bencheng Liao · Wenyu Liu · Xinggang Wang

[ Exhibit Hall I ]

Abstract
With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, leading to slow convergence, high resource demands, and suboptimal performance on downstream understanding and complex reasoning tasks. In this work, we introduce MaTVLM, a hybrid model that replaces a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. By leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. We further enhance training efficiency through a single-stage distillation process, using the pre-trained VLM as a teacher model to transfer knowledge to MaTVLM. Additionally, we explore the impact of differential distillation losses within our training framework. Evaluations across multiple benchmarks demonstrate that MaTVLM achieves competitive performance against the teacher model and existing VLMs while outperforming both Mamba-based VLMs and models with similar parameter scales. Remarkably, MaTVLM attains up to 3.6× faster inference than the teacher model and reduces GPU memory consumption by 27.5%, all without compromising performance.
Poster
Tianyuan Qu · Longxiang Tang · Bohao PENG · Senqiao Yang · Bei Yu · Jiaya Jia

[ Exhibit Hall I ]

Abstract
The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the "Sampling Dilemma'': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions—where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain.
Poster
Yongjian Wu · Yang Zhou · Jiya Saiyin · Bingzheng Wei · Yan Xu

[ Exhibit Hall I ]

Abstract
We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization —-- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at VisTex-OVLM.
Poster
Yuxuan Yuan · Luyao Tang · Chaoqi Chen · Yixin Chen · Yue Huang · Xinghao Ding

[ Exhibit Hall I ]

Abstract
Albeit existing Single-Domain Generalized Object Detection (Single-DGOD) methods enable models to generalize to unseen domains, most assume that the training and testing data share the same label space. In real-world scenarios, unseen domains often introduce previously unknown objects, a challenge that has been largely overlooked. In this paper, we tackle the practical problem of Single-domain Generalizable Open-Set Object Detection (SG-OSOD), which addresses both unseen domains and unknown classes. We identify two key challenges: (1) detecting unknown classes with only known-class data, and (2) learning robust features to mitigate domain shift. To address these challenges, we propose the framework termed $\texttt{ASGS}$, which leverages adaptive subgraph structures to enhance the understanding of unknown scenes and classes. $\texttt{ASGS}$ consists of Subgraph-wise Unknown-class Learning (SUL) and Class-wise Embedding Compaction (CEC). SUL employs non-parametric methods to detect unknown samples and performs Adaptive Subgraph Searching (ASS) for high-order structural feature extraction, enabling domain-robust unknown class learning. Moreover, the CEC module enhances class discrimination robustness through contrastive learning, which results in more compact class clusters in unknown scenarios. Experimental results demonstrate the effectiveness of the proposed $\texttt{ASGS}$.
Poster
Yuhao Wang · Wei Xi

[ Exhibit Hall I ]

Abstract
Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and effcient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as $7\times{7}$, $9\times{9}$, $11\times{11}$. This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal ConvNet, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves $84.2\%$ ImageNet top-1 accuracy with $30M$ parameters and $5.1G$ FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring $88.4\%$ top-1 accuracy on ImageNet and $56.9\%$ on COCO.
Poster
Rakshith Madhavan · Federica Arrigoni

[ Exhibit Hall I ]

Abstract
The viewing graph is a compact tool to encode the geometry of multiple views: nodes represent uncalibrated cameras and edges represent fundamental matrices (when available). Most research focuses on theoretical analyses, exploring for which viewing graphs it is possible (in principle) to retrieve cameras from fundamental matrices, in the sense that the problem admits a unique solution for noiseless data. However, the practical task of recovering cameras from noisy fundamental matrices is still open, as available methods are limited to special graphs (such as those covered by triplets). In this paper, we develop the first method that can deal with the recovery of cameras from noisy fundamental matrices in a general viewing graph. Experimental results demonstrate the promise of the proposed approach on a variety of synthetic and real scenarios.
Poster
Alex Costanzino · Pierluigi Zama Ramirez · Luigi Lella · Matteo Ragaglia · Alessandro Oliva · Giuseppe Lisanti · Luigi Stefano

[ Exhibit Hall I ]

Abstract
We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS) where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalizing from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 ${\tt Mpx}$) and point clouds ($\sim$7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
Poster
Xiwei Xuan · Ziquan Deng · Kwan-Liu Ma

[ Exhibit Hall I ]

Abstract
Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training.
Poster
Hallee Wong · Jose Javier Gonzalez Ortiz · John Guttag · Adrian Dalca

[ Exhibit Hall I ]

Abstract
Medical researchers and clinicians often need to perform novel segmentation tasks on a set of related images. Existing methods for segmenting a new dataset are either interactive, requiring substantial human effort for each image, or require an existing set of previously labeled images. We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without requiring access to any existing labeled data from that task or domain. Along with the image to segment, the model takes user interactions such as clicks, bounding boxes or scribbles as input, and predicts a segmentation. As the user segments more images, those images and segmentations become additional inputs to the model, providing context. As the context set of labeled images grows, the number of interactions required to segment each new image decreases. We demonstrate that MultiverSeg enables users to interactively segment new datasets efficiently, by amortizing the number of interactions per image to achieve an accurate segmentation. Compared to using a state-of-the-art interactive segmentation method, MultiverSeg reduced the total number of clicks by 40% and scribble steps by 29% to achieve 90% Dice on sets of images from unseen tasks. We will release code and model weights.
Poster
Xinyang Zhou · Fanyue Wei · Lixin Duan · Angela Yao · Wen Li

[ Exhibit Hall I ]

Abstract
Given a textual query along with a corresponding video, the objective of moment retrieval aims to localize the moments relevant to the query within the video. While commendable results have been demonstrated by existing transformer-based approaches, predicting the accurate temporal span of the target moment is still a major challenge. This paper reveals that a crucial reason stems from the spurious correlation between the text query and the moment context. Namely, the model makes predictions by overly associating queries with background frames rather than distinguishing target moments. To address this issue, we propose a dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation. First, we introduce a novel video synthesis approach to construct a dynamic context for the queried moment, enabling the model to attend to the target moment of the corresponding query across dynamic backgrounds. Second, to alleviate the over-association with backgrounds, we enhance representations temporally by incorporating text-dynamics interaction, which encourages the model to align text with target moments through complementary dynamic representations. With the proposed method, our model significantly alleviates the spurious correlation issue in moment retrieval and establishes new state-of-the-art performance on two popular benchmarks, \ie, QVHighlights and Charades-STA. …
Poster
Soorena Salari · Arash Harirpoush · Hassan Rivaz · Yiming Xiao

[ Exhibit Hall I ]

Abstract
Anatomical landmark detection in medical images is essential for various clinical and research applications, including disease diagnosis and surgical planning. However, manual landmark annotation is time-consuming and requires significant expertise. Existing deep learning (DL) methods often require large amounts of well-annotated data, which are costly to acquire. In this paper, we introduce CABLD, a novel self-supervised DL framework for 3D brain landmark detection in unlabeled scans with varying contrasts by using only a single reference example. To achieve this, we employed an inter-subject landmark consistency loss with an image registration loss while introducing a 3D convolution-based contrast augmentation strategy to promote model generalization to new contrasts. Additionally, we utilize an adaptive mixed loss function to schedule the contributions of different sub-tasks for optimal outcomes. We demonstrate the proposed method with the intricate task of MRI-based 3D brain landmark detection. With comprehensive experiments on four diverse clinical and public datasets, including both T1w and T2w MRI scans at different MRI field strengths, we demonstrate that CABLD outperforms the state-of-the-art methods in terms of mean radial errors (MREs) and success detection rates (SDRs). Our framework provides a robust and accurate solution for anatomical landmark detection, reducing the need for extensively annotated datasets …
Poster
Tinghan Yang · Md Ashiqur Rahman · Raymond A. Yeh

[ Exhibit Hall I ]

Abstract
Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique.
Poster
Haiwen Diao · Xiaotong Li · Yufeng Cui · Yueze Wang · Haoge Deng · Ting Pan · Wenxuan Wang · Huchuan Lu · Xinlong Wang

[ Exhibit Hall I ]

Abstract
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability.
Poster
Zhichuan Wang · Yang Zhou · Zhe Liu · Rui Yu · Song Bai · Yulong Wang · Xinwei He · Xiang Bai

[ Exhibit Hall I ]

Abstract
Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01\% mAP on four open-set 3DOR datasets. Moreover, its generalization is also …
Poster
Junhao Dong · Piotr Koniusz · Liaoyuan Feng · Yifei Zhang · Hao Zhu · Weiming Liu · Xinghua Qu · YEW-SOON ONG

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLMs) enjoy superb zero-shot performance but are vulnerable to adversarial attacks posing security risks. Adversarially robust fine-tuning enhances zero-shot robustness on new datasets while preserving the natural performance of pre-trained VLMs. However, prior methods use sample-wise adversarial fine-tuning, neglecting the underlying second-order statistics that represent entire groups of samples. This leads to a feature-level discrepancy between clean and adversarial samples of their augmented variants. Thus, we propose to represent groups of samples as subspaces to capture distributions and turn the traditional sample-wise adversarial fine-tuning into its distributional counterpart. For each image, we build distributions from (i) a clean sample with its augmentations and (ii) their adversarial counterparts. For text, we build distributions from (iii) a clean prompt and its synonymous prompts and (iv) their adversarial counterparts. We then perform alignment between image and text subspaces, and "adversarial" subspaces are also aligned toward "clean" subspaces. Thus, all samples underlying these distributions (think infinite number) also get aligned, leading to generalizable robustness. Evaluations on 15 datasets are provided.
Poster
Jungeun Kim · Hyeongwoo Jeon · Jongseong Bae · Ha Young Kim

[ Exhibit Hall I ]

Abstract
Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we use MLLMs to generate detailed textual descriptions of sign language components. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be utilized effectively in SLT.
Poster
Haoji Zhang · Yiqin Wang · Yansong Tang · Yong Liu · Jiashi Feng · Xiaojie Jin

[ Exhibit Hall I ]

Abstract
Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is not efficient enough for real-world applications and is difficult to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context synopsis memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity detail augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. All code, models, and datasets will be made publicly available.
Poster
Junqi Ge · Ziyi Chen · Jintao Lin · Jinguo Zhu · Xihui Liu · Jifeng Dai · Xizhou Zhu

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly tasks involving videos, high-resolution images, or lengthy image-text documents. In our work, we first conduct an empirical analysis of VLMs' long-context capabilities using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model's context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE in enhancing VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to fine-tune the open-source VLMs. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications. We shall release the code, model weights, …
Poster
Runhao Zeng · Jiaqi Mao · Minghao Lai · Vu Phan · Yanjie Dong · Wei Wang · Qi Chen · Xiping Hu

[ Exhibit Hall I ]

Abstract
Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that uses neural network parameters to dynamically retain past context and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@$n$, IoU=$m$, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. We will release …
Poster
Shiwei Zhang · Qi Zhou · Wei Ke

[ Exhibit Hall I ]

Abstract
Text-guided zero-shot object counting leverages vision-language models (VLMs) to count objects of an arbitrary class given by a text prompt. Existing approaches for this challenging task only utilize local patch-level features to fuse with text feature, ignoring the important influence of the global image-level feature. In this paper, we propose a universal strategy that can exploit both local patch-level features and global image-level feature simultaneously. Specifically, to improve the localization ability of VLMs, we propose Text-guided Local Ranking. Depending on the prior knowledge that foreground patches have higher similarity with the text prompt, a new local-text rank loss is designed to increase the differences between the similarity scores of foreground and background patches which push foreground and background patches apart. To enhance the counting ability of VLMs, Number-evoked Global Attention is introduced to first align global image-level feature with multiple number-conditioned text prompts. Then, the one with the highest similarity is selected to compute cross-attention with the global image-level feature. Through extensive experiments on widely used datasets and methods, the proposed approach has demonstrated superior advancements in performance, generalization, and scalability. Furthermore, to better evaluate text-guided zero-shot object counting methods, we propose a dataset named ZSC-8K, which is larger and …
Poster
Yichi Zhang · Le Xue · Wenbo zhang · Lanlan Li · Yuchen Liu · Chen Jiang · Yuan Cheng · Yuan Qi

[ Exhibit Hall I ]

Abstract
Positron Emission Tomography (PET) is a powerful molecular imaging tool that plays a crucial role in modern medical diagnostics by visualizing radio-tracer distribution to reveal physiological processes. Accurate organ segmentation from PET images is essential for comprehensive multi-systemic analysis of interactions between different organs and pathologies. Existing segmentation methods are limited by insufficient annotation data and varying levels of annotation, resulting in weak generalization ability and difficulty in clinical application. Recent developments in segmentation foundation models have shown superior versatility across diverse segmentation tasks. Despite the efforts of medical adaptations, these works primarily focus on structural medical images with detailed physiological structural information and exhibit limited generalization performance on molecular PET imaging. In this paper, we collect and construct PETS-5k, the largest PET segmentation dataset to date, comprising 5,731 three-dimensional whole-body PET images and encompassing over 1.3M 2D images. Based on the established dataset, we develop SegAnyPET, a modality-specific 3D foundation model for universal promptable segmentation from PET images. To issue the challenge of discrepant annotation quality, we adopt a cross prompting confident learning (CPCL) strategy with an uncertainty-guided self-rectification process to robustly learn segmentation from high-quality labeled data and low-quality noisy labeled data for promptable segmentation. Experimental results demonstrate …
Poster
Jeongmin Yu · Susang Kim · Kisu Lee · Taekyoung Kwon · Won-Yong Shin · Ha Young Kim

[ Exhibit Hall I ]

Abstract
Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP’s patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g, 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets.
Poster
Wenxuan Zhu · Bing Li · Cheng Zheng · Jinjie Mai · Jun Chen · Letian Jiang · Abdullah Hamdi · Sara Rojas Martinez · Chia-Wen Lin · Mohamed Elhoseiny · Bernard Ghanem

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities.However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning.4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks.With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs.The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding.4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%.These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.
Poster
Xiao Liang · Di Wang · Zhicheng Jiao · Ronghan Li · Pengfei Yang · Quan Wang · Tat-Seng Chua

[ Exhibit Hall I ]

Abstract
The rapid advancements in Vision Language Models (VLMs) have prompted the development of multi-modal medical assistant systems. Despite this progress, current models still have inherent probabilistic uncertainties, often producing erroneous or unverified responses—an issue with serious implications in medical applications. Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning. However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. This framework introduces an uncertainty estimation strategy to identify unreliable outputs. It then retrieves relevant references to assist experts in highlighting key terms and applies classifier-free guidance to refine the token embeddings of MedVLM, ensuring that the adjusted outputs are correct and align with expert highlights. Evaluations across three medical visual question answering benchmarks demonstrate that the proposed Expert-CFG, with 4.2B parameters and limited expert annotations, outperforms state-of-the-art models with 13B parameters. The results demonstrate the feasibility of deploying such a system in resource-limited settings for clinical use. The anonymous link to our project can be found in …
Poster
Jie Liu · Jiayi Shen · Pan Zhou · Jan-Jakob Sonke · Stratis Gavves

[ Exhibit Hall I ]

Abstract
Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, our work propose Probabilistic Prototype Calibration Network (PPCN) - a probabilistic modeling framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, PPCN first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, PPCN introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrate that our proposed PPCN significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The source code will be released publicly.
Poster
Xu Zheng · Yuanhuiyi Lyu · Lutao Jiang · Danda Pani Paudel · Luc Gool · Xuming Hu

[ Exhibit Hall I ]

Abstract
Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug-and-play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log-Sobolev inequality to bound functional entropy using functional-Fisher-information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi-scale regularization module is proposed to apply our proposed plug-and-paly term on high-level features and also segmentation predictions for more balanced multi-modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25% and +3.64%, without introducing any additional parameters.
Poster
Akshat Ramachandran · Mingyu Lee · Huan Xu · Souvik Kundu · Tushar Krishna

[ Exhibit Hall I ]

Abstract
We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code and synthetic dataset will be released upon acceptance.
Poster
Sabbir Ahmed · Jingtao Li · Weiming Zhuang · Chen Chen · Lingjuan Lyu

[ Exhibit Hall I ]

Abstract
Vision transformers (ViTs) have become widely popular due to their strong performance across various computer vision tasks. However, deploying ViTs on edge devices remains a persistent challenge due to their high computational demands primarily caused by the over use of self-attention layers with quadratic complexity together with the resource-intensive softmax operation. To resolve this challenge, linear self-attention approach has emerged as an efficient alternative. Nonetheless, current linear attention methods experience considerable performance degradation compared to the softmax-based quadratic attention. Hence, we propose MixA, a novel mixed attention approach that enhances efficiency of ViT models while maintaining comparable performance to softmax-based quadratic attention. MixA takes a pretrained ViT model and analyzes the significance of each attention layer, and selectively apply ReLU-based quadratic attention in the critical layers to ensure high model performance. To enhance efficiency, MixA selects the less critical layers and replaces them with our novel ReLU-based linear attention module called \emph{Stable Lightweight Linear Attention} (SteLLA). SteLLA utilizes theoretically motivated normalization terms that improve stability of prior ReLU-based linear attention, resulting in better performance (see Figure 1) while achieving significant speedup compared to softmax based quadratic attention (see Figure 2). Experiments conducted on three benchmark vision tasks show that MixA …
Poster
Weiming Ren · Wentao Ma · Huan Yang · Cong Wei · Ge Zhang · Wenhu Chen

[ Exhibit Hall I ]

Abstract
State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.6% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks. Our code and model will be fully released to facilitate open research.
Poster
Shuhang Chen · Hangjie Yuan · Pengwei Liu · Hanxue Gu · Tao Feng · Dong Ni

[ Exhibit Hall I ]

Abstract
The Segment Anything Model (SAM) has demonstrated significant potential in medical image segmentation, yet its performance is limited when only a small amount of labeled data is available, while there are abundance of valuable yet often overlooked hierarchical information inherent in medical data. To address this limitation, we draw inspiration from self-supervised learning and propose SAMora, an innovative framework that captures hierarchical medical knowledge by applying complementary self-supervised learning objectives at the image, patch, and pixel levels. To fully exploit the complementarity of hierarchical knowledge within LoRAs, we introduce HL-Attn, a hierarchical fusion module that integrates multi-scale features while maintaining their distinct characteristics. SAMora is compatible with various SAM variants, including SAM2, SAMed and H-SAM. Experimental results on the Synapse, LA, and PROMISE12 datasets demonstrate that SAMora outperforms existing SAM variants, achieving state-of-the-art performance in both few-shot and fully-supervised settings, while reducing fine-tuning epochs by 90\%.
Poster
Tao Gong · Qi Chu · Bin Liu · Zhou Wei · Nenghai Yu

[ Exhibit Hall I ]

Abstract
Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is challenging since the models need to generalize to anomalies across different domains. Recently, CLIP-based anomaly detection methods, such as WinCLIP and AnomalyCLIP, have demonstrated superior performance in the ZSAD task, due to the strong zero-shot recognition of the CLIP model. However, they overlook the utilization of frequency information of images. In this paper, we find that frequency information could benefit the ZSAD task, since some properties of the anomaly area, such as appearance defects, can also be reflected based on its frequency information. To this end, We propose Frequency Enhanced CLIP (FE-CLIP), taking advantage of two different but complementary frequency-aware clues, (1) Frequency-aware Feature Extraction adapter, and (2) Local Frequency Statistics adapter, in the visual encoder of CLIP, to deeply mine frequency information for the ZSAD task. We apply DCT as the frequency-domain transformation. Through comprehensive experiments, we show that the proposed FE-CLIP has good generalization across different domains and achieves superior zero-shot performance of detecting and segmenting anomalies in 10 datasets of highly diverse class semantics from various defect inspections and medical domains. Besides, the …
Poster
Kanoko Goto · Takumi Hirose · Mahiro Ukai · Shuhei Kurita · Nakamasa Inoue

[ Exhibit Hall I ]

Abstract
Referring expression comprehension (REC) aims to localize the target object described by a natural language expression.Recent advances in vision-language learning have led to significant performance improvements in REC tasks.However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving.To address this issue, we introduce a novel dataset and method for REC targeting small objects.First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios.Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects.In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset.Our dataset, codes and pre-trained models are provided in the supplementary material and will be publicly released.
Poster
Zhen Xing · Qi Dai · Zejia Weng · Zuxuan Wu · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. However, they struggle with frame consistency and temporal stability primarily due to the limited scale of video datasets.We observe that pretrained Image2Video diffusion models possess good video dynamics priors but lack fine-grained textual control.Hence, transferring pretrained models to leverage their video dynamic priors while injecting fine-grained control to generate controllable videos is both a meaningful and challenging task.To achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions. More specifically, we design a dual query transformer (DQFormer) architecture, which integrates the instructions and frames into the conditional embeddings for future frame prediction. Additionally, we develop Temporal and Spatial Adapters that can quickly transfer general video diffusion models to specific scenarios with minimal training costs. Experimental results show that our method significantly outperforms state-of-the-art techniques on four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and UCF-101. Notably, AID achieves 91.2\% and 55.5\% FVD improvements on …
Poster
Xinyue Hao · Li · Shreyank Gowda · Robert Fisher · Jonathan Huang · Anurag Arnab · Laura Sevilla-Lara

[ Exhibit Hall I ]

Abstract
Video understanding has made huge strides in recent years, relying largely on the power of transformers. As this architecture is notoriously expensive and video data is highly redundant, research into improving efficiency has become particularly relevant. Some creative solutions include token selection and merging. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the baseline of randomly discarding tokens. In this paper we take a closer look at this phenomenon and observe 5 principles of the nature of visual tokens. For example, we observe that the value of tokens follows a clear Pareto-distribution where most tokens have remarkably low value, and just a few carry most of the perceptual information. We build on these and further insights to propose a lightweight video model, LITE, that can select a small number of tokens effectively, outperforming state-of-the-art and existing baselines across datasets (Kinetics-400 and Something-Something-V2) in the challenging trade-off of computation (GFLOPs) vs accuracy. Experiments also show that LITE generalizes across datasets and even other tasks without the need for retraining.
Poster
Leon Sick · Dominik Engel · Sebastian Hartwig · Pedro Hermosilla · Timo Ropinski

[ Exhibit Hall I ]

Abstract
Traditionally, algorithms that learn to segment object instances in 2D images have heavily relied on large amounts of human-annotated data. Only recently, novel approaches have emerged tackling this problem in an unsupervised fashion. Generally, these approaches first generate pseudo-masks and then train a class-agnostic detector. While such methods deliver the current state of the art, they often fail to correctly separate instances overlapping in 2D image space since only semantics are considered. To tackle this issue, we instead propose to cut the semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene. Furthermore, we derive a Spatial Importance function, which we use to resharpen the semantics along the 3D borders of instances. Nevertheless, these pseudo-masks are still subject to mask ambiguity. To address this issue, we further propose to augment the training of a class-agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal. With these contributions, our approach outperforms competing methods across multiple standard benchmarks for unsupervised instance segmentation and object detection.
Poster
Yun Wang · Longguang Wang · Chenghao Zhang · Yongjian Zhang · Zhanjie Zhang · Ao Ma · Chenyou Fan · Tin Lun Lam · Junjie Hu

[ Exhibit Hall I ]

Abstract
Recently, learning-based stereo matching networks have advanced significantly.However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets.Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge.To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules.SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction.Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation.
Poster
Yuchen Guan · Chong Sun · Canmiao Fu · Zhipeng Huang · Chun Yuan · Chen Li

[ Exhibit Hall I ]

Abstract
Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybird prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose \modelName, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts, and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the \textit{\rapLongName (\rapName)} model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5\% compared to conventional approaches. Extensive experiments demonstrate that \modelName achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data\&Code will be made available.
Poster
Kaisi Guan · Zhengfeng Lai · Yuchong Sun · Peng Zhang · Wei Liu · Xiaojiang Liu · Meng Cao · Ruihua Song

[ Exhibit Hall I ]

Abstract
Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism . Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation. All codes and datasets will be publicly available soon.
Poster
Bo Liu · Ke Zou · Li-Ming Zhan · ZEXIN LU · Xiaoyu DONG · Chengqiang Xie · Yidi Chen · Jiannong Cao · Xiao-Ming Wu · Huazhu Fu

[ Exhibit Hall I ]

Abstract
Medical Visual Question Answering (Med-VQA) combines computer vision and natural language processing to automatically answer clinical inquiries about medical images. However, current Med-VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, hindering comprehension for patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements in practical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) a multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) four question types—open-ended, closed-ended, single-choice, and multiple-choice—to better reflect practical needs. With 151,025 images and 1,605,575 questions, GEMeX is the currently largest chest X-ray VQA dataset. Evaluation of 12 representative large vision language models (LVLMs) on GEMeX reveals suboptimal performance, underscoring the dataset's complexity. Meanwhile, we propose a strong model by fine-tuning an existing LVLM on the GEMeX training set. The substantial performance improvement showcases the dataset's effectiveness. The benchmark is available at \url{https://anonymous.4open.science/r/GEMeX}.
Poster
Xianglin Qiu · Xiaoyang Wang · Zhen Zhang · Jimin XIAO

[ Exhibit Hall I ]

Abstract
Weakly supervised semantic segmentation (WSSS) aims to generate dense labels using sparse annotations, such as image-level labels. The existing class activation map (CAM) generation methods have been able to locate rough objects. However, due to the limited information provided by image level labels, the bias activation problem, including over-activation, becomes another key obstacle in WSSS. To rectify such bias activation, we attempt to mine pixel level class feature distribution information from the entire dataset. Specifically, we propose to use normalizing flow to model the class feature distribution of all pixels across the entire dataset and design a Bias-Resilient WSSS framework based on Normalizing Flow (BRNF). Normalizing flow has the ability to map complex distributions to normal distributions. Building upon it, we designed an additional Gaussian mixture classifier which classifies pixels from the perspective of feature distributions, providing supplementary information to the conventional MLP based classifier. In addition, we use this distribution to sample low bias features as positive anchors for contrastive learning, thereby encouraging feature optimization toward the correct low-bias direction. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving state-of-the-art performance on WSSS benchmarks. Code will be released soon.
Poster
Tim Elsner · Paula Usinger · Julius Nehring-Wirxel · Gregor Kobsik · Victor Czech · Yanjiang He · Isaak Lim · Leif Kobbelt

[ Exhibit Hall I ]

Abstract
In language processing, transformers benefit greatly from characters being condensed into word fragments, building outputs from a larger vocabulary of bigger pieces. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, not using such further abstraction of regions.Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. Our approach only increases computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. We further propose how networks can digest the new tokens that are no longer in a regular grid.Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences have more uniformly distributed information content, e.g. by condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to …
Poster
Jaeseok Byun · Young Kyun Jang · Seokhyeon Jeong · Donghyun Kim · Taesup Moon

[ Exhibit Hall I ]

Abstract
Composed Image Retrieval (CIR) seeks to retrieve a target image by using a reference image and conditioning text specifying desired modifications. While recent approaches have shown steady performance improvements on existing CIR benchmarks, we argue that it remains unclear whether these gains genuinely reflect an enhanced compositional understanding of both visual and textual information.For example, current benchmarks do not explicitly consider negation cases and offer limited semantic diversity, with insufficient hard negatives to thoroughly evaluate the CIR task.To bridge this gap, we introduce Multimodal Arithmetic Benchmark for CIR (MA-CIR), a challenging CIR benchmark that integrates arithmetic types (negation, replacement, and addition) across seven complex semantic categories (e.g., spatial reasoning, object reasoning, etc). Moreover, carefully constructed hard negatives are incorporated to assess models in a controlled setting.In MA-CIR, we observe that current CIR models struggle with negation (or replacement) arithmetic types and semantic types that require complex reasoning, indicating a potential reliance on object or entity information.To address this challenge, we propose leveraging strong text encoders, particularly those based on large language models (LLMs), in conjunction with carefully constructed text triplets that incorporate hard negatives to enhance compositional understanding.As a result, MA-CIR achieves a 14\% gain while also improving R@1 on …
Poster
Xiwen Chen · Peijie Qiu · Wenhui Zhu · Hao Wang · Huayu Li · XUANZHAO DONG · Xiaotong Sun · Xiaobing Yu · Yalin Wang · Abolfazl Razi · Aristedis Sotiras

[ Exhibit Hall I ]

Abstract
While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but better capture spatial correlations between instances can offer more effective solutions. In light of these findings, we propose a novel alternative to existing MIL for WSI analysis by learning to restore the order of instances from their randomly shuffled arrangement. We term this task as cracking an instance jigsaw puzzle problem, where semantic correlations between instances are uncovered. To tackle the instance jigsaw puzzles, we propose a novel Siamese network solution, which is theoretically justified by optimal transport theory. We validate the proposed method on WSI classification and survival prediction tasks, where the proposed method outperforms the recent state-of-the-art MIL competitors.
Poster
Guilian Chen · Huisi Wu · Jing Qin

[ Exhibit Hall I ]

Abstract
Automatic segmentation of polyps from colonoscopy videos is of great clinical significance as it can assist clinicians in making more accurate diagnoses and precise interventions. However, video polyp segmentation (VPS) poses significant challenges due to ambiguous boundaries between polyps and surrounding mucosae tissues, as well as variations in polyp scale, contrast, and position across consecutive frames. Moreover, to meet clinical requirements, the inference process must operate in real-time to enable intraoperative tracking and guidance. In this paper, we propose a novel and efficient segmentation network, STDDNet, which integrates a spatial-aligned temporal modeling strategy and a discriminative dynamic representation learning mechanism, to comprehensively address these challenges by harnessing the advantages of mamba. Specifically, a spatial-aligned temporal dependency propagation (STDP) module is developed to model temporal consistency from the consecutive frames based on a bidirectional scanning mamba block. Furthermore, we design a discriminative dynamic feature extraction (DDFE) module to explore frame-wise dynamic information from the structural feature generated by the mamba block. Such dynamic features can effectively deal with the variations across colonoscopy frames, providing more details for refined segmentation. We extensively evaluate STDDNet on two benchmark datasets, SUN-SEG and CVC-ClinicDB, demonstrating superior segmentation performance of our method over state-of-the-art methods while …
Poster
Seonghoon Yu · Junbeom Hong · Joonseok Lee · Jeany Son

[ Exhibit Hall I ]

Abstract
Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that $\textbf{leverages multiple latent expressions}$ generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both $\textbf{shared-subject and distinct-attributes}$ concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.
Poster
Weili Zeng · Ziyuan Huang · Kaixiang Ji · Yichao Yan

[ Exhibit Hall I ]

Abstract
Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance. Experimental results demonstrate that Skip-Vision reduces training time by up to 35\%, inference FLOPs by 75\%, and latency by 45\%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.
Poster
Ilan Naiman · Emanuel Baruch Baruch · Oron Anschel · Alon Shoshan · Igor Kviatkovsky · Manoj Aggarwal · Gerard Medioni

[ Exhibit Hall I ]

Abstract
In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation.Our approach treats short- and long-span dependencies as two separate tasks.Such decoupling allows for a more intuitive video processing where short-span spatiotemporal primitives are first encoded and are then used to capture long-range dependencies across consecutive video segments. To achieve this, we leverage advanced off-the-shelf multimodal encoders to extract representations from short segments within the long video, followed by pre-training a masked-embedding autoencoder capturing high-level interactions across segments.LV-MAE is highly efficient to train and enables the processing of much longer videos by alleviating the constraint on the number of input frames.Furthermore, unlike existing methods that typically pre-train on short-video datasets, our approach offers self-supervised pre-training using long video samples (e.g., 20+ minutes video clips) at scale.Using LV-MAE representations, we achieve state-of-the-art results on three long-video benchmarks -- LVU, COIN, and Breakfast -- employing only a simple classification head for either attentive or linear probing.Finally, to assess LV-MAE pre-training and visualize its reconstruction quality, we leverage the video-language aligned space of short video representations to monitor LV-MAE through video-text retrieval.Our code will be made available upon publication.
Poster
Feixiang Wang · Shuang Yang · Shiguang Shan · Xilin Chen

[ Exhibit Hall I ]

Abstract
Audio-Visual Speech Enhancement (AVSE) leverages both audio and visual information to improve speech quality.Despite noisy real-world conditions, humans are generally able to perceive and interpret corrupted speech segments as clear. Researches in cognitive science have shown how the brain merges auditory and visual inputs to achieve this.These studies uncover four key insights for AVSE, reflecting a hierarchical synergy of semantic and signal processes with visual cues enriching both levels:(1) Humans utilize high-level semantic context to reconstruct corrupted speech signals.(2) Visual cues are shown to strongly correlate with semantic information, enabling visual cues to facilitate semantic context modeling.(3) Visual appearance and vocal information jointly benefit identification, implying that visual cues strengthen low-level signal context modeling.(4) High-level semantic knowledge and low-level auditory processing operate concurrently, allowing the semantics to guide signal-level context modeling.Motivated by these insights, we propose CogCM, a cognition-inspired hierarchical contextual modeling framework. The CogCM framework includes three core modules: (1) A semantic context modeling module (SeCM) to capture high-level semantic context from both audio and visual modalities; (2) A signal context modeling module (SiCM) to model fine-grained temporal-spectral structures under multi-modal semantic context guidance; (3) A semantic-to-signal guidance module (SSGM) to leverage semantic context in guiding signal context modeling …
Poster
Lei Fan · Junjie Huang · Donglin Di · Anyang Su · Tianyou Song · Maurice Pagnucco · Yang Song

[ Exhibit Hall I ]

Abstract
For anomaly detection (AD), early approaches often train separate models for individual classes, yielding high performance but posing challenges in scalability and resource management. Recent efforts have shifted toward training a single model capable of handling multiple classes. However, directly extending early AD methods to multi-class settings often results in degraded performance. In this paper, we investigate this performance degradation observed in reconstruction-based methods, identifying the key issue: inter-class confusion. This confusion emerges when a model trained in multi-class scenarios incorrectly reconstructs samples from one class as another, thereby exacerbating reconstruction errors. To this end, we propose a simple yet effective modification, called class-aware contrastive learning (CCL). By explicitly leveraging raw object category information (e.g., carpet or wood) as supervised signals, we introduce local CL to refine multiscale dense features, and global CL to obtain more compact feature representations of normal patterns, thereby effectively adapting the models to multi-class settings. Experiments across four datasets (over 60 categories) validate the effectiveness of our approach, demonstrating significant improvements and superior performance compared to state-of-the-art methods. Notably, ablation studies indicate that pseudo-class labels can achieve comparable performance.
Poster
Qi Fan · Kaiqi Liu · Nian Liu · Hisham Cholakkal · Rao Anwer · Wenbin Li · Yang Gao

[ Exhibit Hall I ]

Abstract
Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using various domain-generalization techniques, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for retraining. Specifically, we first adaptively identify domain-specific model structures by measuring parameter importance using a novel structure Fisher score in a data-dependent manner. Then, we progressively train the selected informative model structures with hierarchically constructed training samples, progressing from fewer to more support shots. The resulting Informative Structure Adaptation (ISA) method effectively addresses domain shifts and equips existing well-trained in-domain FSS models with flexible adaptation capabilities for new domains, eliminating the need to redesign or retrain CD-FSS models on base data. Extensive experiments validate the effectiveness of our method, demonstrating superior performance across multiple CD-FSS benchmarks.
Poster
Yuan Liu · Saihui Hou · Saijie Hou · Jiabao Du · Shibei Meng · Yongzhen Huang

[ Exhibit Hall I ]

Abstract
Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements, existing datasets often lack breadth and depth, limiting their applicability in complex and dynamic environments: (1) from a breadth perspective, current datasets are constrained to limited variations of objects in specific scenes, and (2) from a depth perspective, prior benchmarks often provide overly simplistic descriptions. To address these challenges, we introduce $\textbf{OmniDiff}$, a comprehensive dataset comprising 324 diverse scenarios—spanning real-world complex environments and 3D synthetic settings—with fine-grained human annotations averaging 60 words in length and covering 12 distinct change types. Building on this foundation, we propose $\textbf{M$^3$Diff}$, a $\textbf{M}$ulti$\textbf{M}$odal large language model enhanced by a plug-and-play $\textbf{M}$ulti-scale $\textbf{Diff}$erential Perception (MDP) module. This module improves the model's ability to accurately identify and describe inter-image differences while maintaining the foundational model's generalization capabilities. With the addition of the OmniDiff dataset, M$^3$Diff achieves state-of-the-art performance across multiple benchmarks, including Spot-the-Diff, IEdit, CLEVR-Change, CLEVR-DC, and OmniDiff, demonstrating significant improvements in cross-scenario difference recognition accuracy compared to existing methods. The dataset, code, and models will be made publicly available to support further research.
Poster
Tao Lei · Ziyao Yang · Xingwu wang · Yi Wang · Xuan Wang · FeimanSun FeimanSun · Asoke Nandi

[ Exhibit Hall I ]

Abstract
Existing semi-supervised learning methods typically mitigate the impact of unreliable predictions by suppressing low-confidence regions. However, these methods fail to explore which regions hold higher learning value and how to design adaptive learning strategies for these regions, thereby limiting the model's performance in critical areas. To address this issue, we propose a novel adaptive learning of high-value regions (ALHVR) framework. By exploiting the diversity of predictions from mutli-branch networks, the prediction regions are classified into three types: reliable stable region, reliable unstable region, and unreliable stable region. For high-value regions (reliable unstable region and unreliable stable region), different training strategies are designed. Specifically, for reliable unstable region, we propose a confidence-guided cross-prototype consistency learning (CG-CPCL) module, which enforces prototype consistency constraints in the feature space. By leveraging confidence information, the high-confidence predictions from one network selectively supervise the low-confidence predictions of the other, thus helping the model learn inter-class discrimination more stably. Additionally, for unreliable stable region, we design a dynamic teacher competition teaching (DTCT) module, which dynamically selects the most reliable pixels as teachers by evaluating the unperturbed predictions from both networks in real-time. These selected pixels are then used to supervise perturbed predictions, thereby enhancing the model's learning …
Poster
Lingyu Chen · Yawen Zeng · Yue Wang · Peng Wan · Guo-chen Ning · Hongen Liao · Daoqiang Zhang · Fang Chen

[ Exhibit Hall I ]

Abstract
Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise.Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME's superiority, achieving significant mean AP improvements over state-of-the-art methods.
Poster
George Ciubotariu · Zhuyun Zhou · Zongwei Wu · Radu Timofte

[ Exhibit Hall I ]

Abstract
We introduce MIORe and VAR-MIORe, novel multi-task datasets that address critical limitations in current benchmarks for motion restoration tasks. Our datasets capture a broad spectrum of motion scenarios—including complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects—using high-frame-rate (1000 FPS) acquisition and professional-grade optics. By averaging variable numbers of frames based on computed optical flow metrics, MIORe generates consistent motion blur while preserving sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends this framework by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark of its kind. Together, these datasets provide high-resolution, scalable ground truth that challenges existing algorithms under both controlled and adverse conditions, paving the way for next-generation research in non-uniform deblurring, video interpolation, and optical flow analysis.
Poster
Jiaxu Zhang · Xianfang Zeng · Xin Chen · Wei Zuo · Gang YU · Zhigang Tu

[ Exhibit Hall I ]

Abstract
We propose MikuDance, a diffusion-based pipeline incorporating mixed motion dynamics to animate stylized character art. MikuDance consists of two key techniques: Mixed Motion Modeling and Mixed-Control Diffusion, to address the challenges of high-dynamic motion and reference-guidance misalignment in character art animation. Specifically, a Scene Motion Tracking strategy is presented to explicitly model the dynamic camera in pixel-wise space, enabling unified character-scene motion modeling. Building on this, the Mixed-Control Diffusion implicitly aligns the scale and body shape of diverse characters with motion guidance, allowing flexible control of local character motion. Subsequently, a Motion-Adaptive Normalization module is incorporated to effectively inject global scene motion, paving the way for comprehensive character art animation. Through extensive experiments, we demonstrate the effectiveness and generalizability of MikuDance across various character art and motion guidance, consistently producing high-quality animations with remarkable motion dynamics.
Poster
Yasser Benigmim · Mohammad Fahes · Tuan-Hung Vu · Andrei Bursuc · Raoul de Charette

[ Exhibit Hall I ]

Abstract
Recent Open-Vocabulary Semantic Segmentation (OVSS) models extend the CLIP model to segmentation while maintaining the use of multiple templates (e.g., a photo of <class>, a sketch of a <class>, etc.) for constructing class-wise averaged text embeddings, acting as a classifier. In this paper, we challenge this status quo and investigate the impact of templates for OVSS. Empirically, we observe that for each class, there exist single-template classifiers significantly outperforming the conventional averaged classifier. We refer to them as class-experts. Given access to unlabeled images and without any training involved, we estimate these experts by leveraging the class-wise prediction entropy of single-template classifiers, selecting as class-wise experts those which yield the lowest entropy. All experts, each specializing in a specific class, collaborate in a newly proposed fusion method to generate more accurate OVSS predictions. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering a “free lunch” to systematically improve OVSS without labels and additional training. Extensive experiments demonstrate that FLOSS consistently boosts state-of-the-art methods on various OVSS benchmarks. Moreover, the selected expert templates can generalize well from one dataset to others sharing the same semantic categories, yet exhibiting distribution shifts. Additionally, we obtain satisfactory improvements under …</class></class>
Poster
Weihao Yu · Xiaoqing Guo · Xinyu Liu · Yifan Liu · Hao Zheng · Yawen Huang · Yixuan Yuan

[ Exhibit Hall I ]

Abstract
Intraoperative 2D/3D registration, which aligns preoperative CT scans with intraoperative X-ray images, is critical for surgical navigation. However, existing methods require extensive preoperative training (several hours), making them unsuitable for emergency surgeries where minutes significantly impact patient outcomes. We present GaussianReg, a novel registration framework that achieves clinically acceptable accuracy within minutes of preprocessing. Unlike prior approaches that learn primarily from 2D projections, we explicitly utilize 3D information by representing CT volumes as sparse Gaussian primitives and propose an innovative ray-based registration approach. These primitives emit rays toward potential camera positions, creating a hypothesis space of viewpoints. The registration problem then reduces to identifying rays that best match the target X-ray through our cross-modality attention mechanism. We further introduce canonical ellipsoid ray parameterization for stable optimization, bipartite matching-based patch aggregation for computational efficiency, and network pruning to accelerate training. Extensive experiments demonstrate that GaussianReg achieves 10mm-level accuracy with only 10 minutes of training, compared to hours required by existing methods. Our approach thus offers a promising solution for emergency surgical scenarios where rapid adaptation to patient-specific anatomy is critical.
Poster
Yuchen Liu · Yaoming Wang · Bowen Shi · XIAOPENG ZHANG · Wenrui Dai · Chenglin Li · Hongkai Xiong · Qi Tian

[ Exhibit Hall I ]

Abstract
Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduces prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely **M**ulti-**E**ncoder Collabora**T**iv**E** t**O**ken p**R**uning (**METEOR**), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratio for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, **METEOR** reduces 76\% visual tokens with only 0.3\% performance drop in average.
Poster
Qihang Fan · Huaibo Huang · Yuang Ai · Ran He

[ Exhibit Hall I ]

Abstract
As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query($Q$ or $\phi(Q)$). The absence of magnitude information prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose **Magnitude-Aware Linear Attention** (MALA), which modifies the computation of Linear Attention to fully incorporate the Query’s magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. As a result, MALA surpasses Softmax Attention in performance while maintaining only linear complexity. We build Magnitude-Aware Vision Transformer (MAViT) based on MALA, achieving **84.7%** accuracy on …
Poster
David Pujol-Perich · Sergio Escalera · Albert Clapés

[ Exhibit Hall I ]

Abstract
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning --and particularly side-tuning (ST)-- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention --a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code will be made publicly available upon acceptance.
Poster
Jiawei Mao · Yuhan Wang · Yucheng Tang · Daguang Xu · Kang Wang · Yang Yang · Zongwei Zhou · Yuyin Zhou

[ Exhibit Hall I ]

Abstract
This paper presents **MedSegFactory**, a versatile medical synthesis framework trained across multiple modalities and tasks. The core of MedSegFactory is a dual-stream diffusion model, where one stream synthesizes medical images and the other generates corresponding segmentation masks. To ensure precise alignment between image-mask pairs, we introduce Joint Cross-Attention (JCA), enabling a collaborative denoising paradigm by dynamic cross-conditioning between streams. This bidirectional interaction allows both representations to guide each other's generation, enhancing consistency between generated pairs. MedSegFactory unlocks on-demand generation of paired medical images and segmentation masks through user-defined prompts that specify the target labels, imaging modalities, anatomical regions, and pathological conditions, facilitating scalable and high-quality data generation. This new paradigm of medical image synthesis enables seamless integration into diverse medical imaging workflows, enhancing both efficiency and accuracy. Extensive experiments show that MedSegFactory generates data of superior quality and usability, achieving competitive or state-of-the-art performance in 2D and 3D segmentation tasks while addressing data scarcity and regulatory constraints.
Poster
Shuai Liu · Peng Zhang · Shiwei Zhang · Wei Ke

[ Exhibit Hall I ]

Abstract
Open-set counting is garnering increasing attention due to its capability to enumerate objects of arbitrary category. It can be generally categorized into two methodologies: text-guided zero-shot counting methods and exemplar-guided few-shot counting methods. Previous text-guided zero-shot methods only provide limited object information through text, resulting in poor performance. Besides, though exemplar-guided few-shot approaches gain better results, they rely heavily on manually annotated visual exemplars, resulting in low efficiency and high labor intensity. Therefore, we propose CountSE, which simultaneously achieves high efficiency and high performance. CountSE is a new text-guided zero-shot object counting algorithm that generates multiple precise soft exemplars at different scales to enhance counting models driven solely by semantics. Specifically, to obtain richer object information and address the diversity in object scales, we introduce Semantic-guided Exemplar Selection, a module that generates candidate soft exemplars at various scales and selects those with high similarity scores. Then, to ensure accuracy and representativeness, Clustering-based Exemplar Filtering is introduced to refine the candidate exemplars by effectively eliminating inaccurate exemplars through clustering analysis. In the text-guided zero-shot setting, CountSE outperforms all state-of-the-art methods on the FSC-147 benchmark by at least 15\%. Additionally, experiments on two other widely used datasets demonstrate that CountSE significantly outperforms …
Poster
Xinyao Liu · Diping Song

[ Exhibit Hall I ]

Abstract
Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces **FundusExpert**, the first ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with **FundusGen**, a dataset constructed through the intelligent **Fundus-Engine** system.Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths.FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6\%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0\%, significantly outperforming GPT-4o's 47.6\%. Furthermore, we reveal a scaling law between data quality and model capability($L \propto N^{0.33}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in …
Poster
Yang Liu · Yufei Yin · Chenchen Jing · Muzhi Zhu · Hao Chen · Yuling Xi · Bo Feng · Hao Wang · Shiyu Li · Chunhua Shen

[ Exhibit Hall I ]

Abstract
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g. text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
Poster
Xiaolei Wang · Xiaoyang Wang · Huihui Bai · ENG Gee LIM · Jimin XIAO

[ Exhibit Hall I ]

Abstract
Recent unsupervised distillation-based and reconstruction-based methods rely on the feature inconsistency of a frozen encoder and the corresponding learnable decoder to achieve anomaly localization. However, these methods have a critical limitation: decoders trained exclusively on normal samples unexpectedly well reconstruct abnormal features, leading to degraded detection performance. We identify this phenomenon as 'anomaly leakage' (AL): the decoder optimized by reconstruction loss tends to directly copy the encoded input, regardless of whether the input is a normal or abnormal feature. To address this challenge, we propose a novel framework that explicitly decouples encoded features into normal and abnormal components through a bounded invertible mapping in a prior latent space. Compared to previous methods, the invertible structure can eliminate anomalous information point-to-point without damaging the information of neighboring patches, improving reconstruction. Moreover, the framework suppresses the abnormal component before reconstructing features through inverse mapping. In this process, effective synthetic abnormal features are essential for training the decoupling process. Therefore, we propose to apply adversarial training to find suitable perturbations to simulate feature-level anomalies. Extensive experimental evaluations on benchmark datasets, including MVTec AD, VisA, and Real-IAD, demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. The code will be made publicly …
Poster
Eunchan Jo · Dahyun Kang · Sanghyun Kim · Yunseon Choi · Minsu Cho

[ Exhibit Hall I ]

Abstract
We address the problem of few-shot pattern detection, which aims to detect all instances of a given pattern, typically represented by a few exemplars, from an input image.Although similar problems have been studied in few-shot object counting and detection (FSCD), previous methods and their benchmarks have narrowed patterns of interest to object categories and often fail to localize non-object patterns. In this work, we propose a simple yet effective detector based on template matching and regression, dubbed \ours.While previous FSCD methods typically represent given target exemplars into a spatially collapsed prototype, we revisit classic template matching and regression. It effectively preserves and leverages the spatial layout of exemplars in our minimalistic architecture, which consists of a few learnable layers of either convolutions or projections.We also introduce a new dataset, dubbed RPINE, which covers a wider range of patterns than existing object-centric datasets.Experiments on three benchmarks, RPINE, FSCD-147, FSCD-LVIS, demonstrate that our method outperforms recent state-of-the-art methods, showing an outstanding generalization ability on cross-dataset evaluation.
Poster
Minghang Zheng · Yuxin Peng · Benyuan Sun · Yi Yang · Yang Liu

[ Exhibit Hall I ]

Abstract
In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given natural language query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG model employs memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose hierarchical event memory for online video temporal grounding. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To efficiently preserve historically valuable event information, we introduce a hierarchical event memory that retains long-term low-redundant historical events, allowing the model to access both recent fine-grained information and long-term coarse-grained information. To enable the real-time prediction of the start time, we further propose a future prediction branch that …
Poster
Emmanuelle Bourigault · Amir Jamaludin · Abdullah Hamdi

[ Exhibit Hall I ]

Abstract
In the medical imaging domain, it is a fundamental challenge to collect large-scale labeled data due to privacy, involved logistics, and the high cost of labeling medical images. In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs of 51,761 MRI 3D samples (17.9 M 2D images) and a total of more than 1.37 billion 2D segmentation masks of 72 organs based on the UK Biobank MRI dataset. We utilize automatic labeling, filter the labels with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (UKBOB-manual). This approach allows for scaling up the dataset collection while maintaining confidence in the labels. We further confirm the validity of the labels by the zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from a similar domain ( _E.g._ abdominal MRI). To further elevate the effect of the noisy labels, we propose a novel Entropy Test-time Adaptation (ETTA) to refine the segmentation output. We use UKBOB to train a foundation model (_Swin-BOB_) for 3D medical image segmentation based on Swin-UNetr, achieving state-of-the-art results in several benchmarks in 3D medical imaging, …
Poster
Chunwei Wang · Guansong Lu · Junwei Yang · Runhui Huang · Jianhua Han · Lu Hou · Wei Zhang · Hang Xu

[ Exhibit Hall I ]

Abstract
In this paper, we introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model through a unified next-token prediction formulation.To address the large dataset size typically required for image-text alignment, we propose to enhance data efficiency through the design of a vision tokenizer that incorporates semantic information and a progressive multi-stage training procedure. This approach reduces the dataset size to just 15M for pretraining -- over four times fewer than what is typically needed -- while achieving competitive or even superior performance with existing unified MLLMs, such as Janus. Additionally, to promote synergistic enhancement between understanding and generation capabilities, which is under-explored in previous works, we introduce a novel self-enhancing multimodal alignment scheme. This scheme supervises the MLLM to self-assess the consistency between text descriptions and self-generated images, facilitating the model to interpret images more accurately and avoid unrealistic and incorrect predictions caused by misalignment in image generation. Based on our extensive experiments, our proposed ILLUME stands out and competes with state-of-the-art unified MLLMs and specialized models across various benchmarks for multimodal understanding, generation, and editing.
Poster
Ruitao Wu · Yifan Zhao · Jia Li

[ Exhibit Hall I ]

Abstract
Class-Incremental Semantic Segmentation (CISS) requires continuous learning of newly introduced classes while retaining knowledge of past classes. By abstracting mainstream methods into two stages (visual feature extraction and prototype-feature matching), we identify a more fundamental challenge termed catastrophic semantic entanglement. This phenomenon involves Prototype-Feature Entanglement caused by semantic misalignment during the incremental process, and Background-Increment Entanglement due to dynamic data evolution. Existing techniques, which rely on visual feature learning without sufficient cues to distinguish targets, introduce significant noise and errors. To address these issues, we introduce a Language-inspired Bootstrapped Disentanglement framework (LBD). We leverage the prior class semantics of pre-trained visual-language models (e.g., CLIP) to guide the model in autonomously disentangling features through Language-guided Prototypical Disentanglement and Manifold Mutual Background Disentanglement. The former guides the disentangling of new prototypes by treating hand-crafted text features as topological templates, while the latter employs multiple learnable prototypes and mask-pooling-based supervision for background-incremental class disentanglement. By incorporating soft prompt tuning and encoder adaptation modifications, we further bridge the capability gap of CLIP between dense and sparse tasks, achieving state-of-the-art performance on both Pascal VOC and ADE20k, particularly in multi-step scenarios.
Poster
Liwei Che · Qingze T Liu · Jing Jia · Weiyi Qin · Ruixiang Tang · Vladimir Pavlovic

[ Exhibit Hall I ]

Abstract
Despite their remarkable potential, Large Vision-Language Models (LVLMs) still face challenges with object hallucination, a problem where their generated outputs mistakenly incorporate objects that do not actually exist. Although most works focus on addressing this issue within the language-model backbone, our work shifts the focus to the image input source, investigating how specific image tokens contribute to hallucinations. Our analysis reveals that a small subset of image tokens with high attention scores are the main drivers of object hallucination. By removing these hallucinatory image tokens (only 1.5% of all image tokens), the issue can be effectively mitigated. This finding holds consistently across different models. Building on this insight, we introduce \eazy, a novel, training-free method that automatically identifies and Eliminates hAllucinations by Zeroing out hallucinator Y image tokens. We utilize EAZY for unsupervised object hallucination detection, achieving a 15% improvement compared to previous methods. Additionally, EAZY demonstrates remarkable effectiveness in mitigating hallucinations while preserving model utility and seamlessly adapting to various LVLM architectures.
Poster
Jian Wang · Tianhong Dai · Bingfeng Zhang · Siyue Yu · ENG Gee LIM · Jimin XIAO

[ Exhibit Hall I ]

Abstract
Weakly Supervised Semantic Segmentation (WSSS) utilizes Class Activation Maps (CAMs) to extract spatial cues from image-level labels. However, CAMs highlight only the most discriminative foreground regions, leading to incomplete results. Recent Vision Transformer-based methods leverage class-patch attention to enhance CAMs, yet they still suffer from partial activation due to the token gap: classification-focused class tokens prioritize discriminative features, while patch tokens capture both discriminative and non-discriminative characteristics. This mismatch prevents class tokens from activating all relevant features, especially when discriminative and non-discriminative regions exhibit significant differences. To address this issue, we propose Optimal Transport-assisted Proxy Learning (OTPL), a novel framework that bridges the token gap by learning adaptive proxies. OTPL introduces two key strategies: (1) optimal transport-assisted proxy learning, which combines class tokens with their most relevant patch tokens to produce comprehensive CAMs, and (2) optimal transport-enhanced contrastive learning, aligning proxies with confident patch tokens for bounded proxy exploration. Our framework overcomes the limitation of class tokens in activating patch tokens, providing more complete and accurate CAM results. Experiments on WSSS benchmarks (PASCAL VOC and MS COCO) demonstrate that our method significantly improves the CAM quality and achieves state-of-the-art performances. The source code will be released.
Poster
Jiashuo Yu · Yue Wu · Meng Chu · Zhifei Ren · Zizheng Huang · Pei Chu · Ruijie Zhang · Yinan He · Qirui Li · Songze Li · Zhenxiang Li · Zhongying Tu · Conghui He · Yu Qiao · Yali Wang · Yi Wang · Limin Wang

[ Exhibit Hall I ]

Abstract
We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (average duration 1.6 hours) along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning processes, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that both evaluates models from the outcome and process level. Apart from the MCQs for the final results, we propose two metrics for progress-level evaluation: (1) LLM-guided scoring for logical coherence and factual accuracy, and (2) Stepwise multiple choice question decomposition to validate causal progression. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning. VRBench will be publicly available.
Poster
Qing Jiang · Lin Wu · Zhaoyang Zeng · Tianhe Ren · Yuda Xiong · Yihao Chen · Liu Qin · Lei Zhang

[ Exhibit Hall I ]

Abstract
Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks.
Poster
Yang Liu · Wentao Feng · Zhuoyao Liu · Shudong Huang · Jiancheng Lv

[ Exhibit Hall I ]

Abstract
Enabling Visual Semantic Models to effectively handle multi-view description matching has been a longstanding challenge. Existing methods typically learn a set of embeddings to find the optimal match for each view's text and compute similarity. However, the visual and text embeddings learned through these approaches have limited information capacity and are prone to interference from locally similar negative samples.To address this issue, we argue that the information capacity of embeddings is crucial and propose Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE), which enhances the information capacity of sparse text by leveraging dense text distillation.Specifically, D2S-VSE is a two-stage framework. In the pre-training stage, we align images with dense text to enhance the information capacity of visual semantic embeddings.In the fine-tuning stage, we optimize two tasks simultaneously, distilling dense text embeddings to sparse text embeddings while aligning images and sparse texts, enhancing the information capacity of sparse text embeddings.Our proposed D2S-VSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.
Poster
Wanting ZHANG · Zhenhui Ding · Guilian Chen · Huisi Wu · Jing Qin

[ Exhibit Hall I ]

Abstract
Accurate breast ultrasound (BUS) image segmentation is crucial for precise diagnosis and surgical planning, but it remains challenging largely due to the scarcity of labeled BUS images. Semi-supervised methods show promise by leveraging pseudo-labels to mitigate reliance on large-scale annotations. However, their performance is highly dependent on the quality of pseudo-labels, which is difficult to guarantee in BUS images due to inherent complexities such as low contrast, speckle noise, and artifacts. Previous studies primarily focus on refining pseudo-labels in one way or the other, or introducing auxiliary supervision; yet they overlook the potential of harnessing intrinsic and inherent pixel relations to enhance the robustness of semi-supervised segmentation. In this paper, we present a novel relation-aware semi-supervised model for BUS image segmentation, which is composed of two innovative components: an adjacent relation propagation (ARP) module and a cross-layer relation alignment (CRA) module, for comprehensively explore pixel relations to improve segmentation performance. The ARP propagates relations among adjacent pixels to reinforce the collaborative prediction of correlated pixels and enhance the model's awareness of local semantic consistency. The CRA aligns cross-layer pixel relations, employing deep-layer guidance to rectify erroneous correlations in shallow layers for noise suppression, while integrating multi-scale contexts to enable robust …
Poster
Chandan Yeshwanth · David Rozenberszki · Angela Dai

[ Exhibit Hall I ]

Abstract
Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects.We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts.To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description.We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions.To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail,comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes.Our experiments …
Poster
JIAHE ZHAO · rongkun Zheng · Yi Wang · Helin WANG · Hengshuang Zhao

[ Exhibit Hall I ]

Abstract
In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce **DisCo**, a novel visual encapsulation method designed to yield semantically **dis**tinct and temporally **co**herent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness.
Poster
Jiesi Hu · Hanyang Peng · Yanwu Yang · Xutao Guo · Yang Shang · Pengcheng Shi · Chenfei Ye · Ting Ma

[ Exhibit Hall I ]

Abstract
In-context learning (ICL), a type of universal model, demonstrates exceptional generalization across a wide range of tasks without retraining by leveraging task-specific guidance from context, making it particularly effective for the intricate demands of neuroimaging. However, current ICL models, limited to 2D inputs and thus exhibiting suboptimal performance, struggle to extend to 3D inputs due to the high memory demands of ICL. In this regard, we introduce Neuroverse3D, an ICL model capable of performing multiple neuroimaging tasks in 3D (e.g., segmentation, denoising, inpainting). Neuroverse3D overcomes the large memory consumption associated with 3D inputs through adaptive parallel-sequential context processing and a U-shaped fusion strategy, allowing it to handle an unlimited number of context images. Additionally, we propose an optimized loss function to balance multi-task training and enhance focus on anatomical boundaries. Our study incorporates 43,674 3D multi-modal scans from 19 neuroimaging datasets and evaluates Neuroverse3D on 14 diverse tasks using held-out test sets. The results demonstrate that Neuroverse3D significantly outperforms existing ICL models and closely matches task-specific models, enabling flexible adaptation to medical center variations without retraining. The code and model weights will be made publicly available.
Poster
Trong-Thang Pham · AKASH AWASTHI · Saba Khan · Esteban Marti · Tien-Phat Nguyen · Khoa Vo · Minh Tran · Ngoc Son Nguyen · Cuong Van · Yuki Ikebe · Anh Nguyen · Anh Nguyen · Zhigang Deng · Carol Wu · Hien Nguyen · Ngan Le

[ Exhibit Hall I ]

Abstract
Understanding radiologists' eye movement during Computed Tomography (CT) reading is crucial for developing effective interpretable computer-aided diagnosis systems. However, CT research in this area has been limited by the lack of publicly available eye-tracking datasets and the three-dimensional complexity of CT volumes. To address these challenges, we present the first publicly available eye gaze dataset on CT, called CT-ScanGaze. Then, we introduce CT-Searcher, a novel 3D scanpath predictor designed specifically to process CT volumes and generate radiologist-like 3D fixation sequences, overcoming the limitations of current scanpath predictors that only handle 2D inputs. Since deep learning models benefit from a pretraining step, we develop a pipeline that converts existing 2D gaze datasets into 3D gaze data to pretrain CT-Searcher. Through both qualitative and quantitative evaluations on CT-ScanGaze, we demonstrate the effectiveness of our approach and provide a comprehensive assessment framework for 3D scanpath prediction in medical imaging.Code and data will be available for research purposes.
Poster
Zhibo Yang · Jun Tang · Zhaohai Li · Pengfei Wang · Jianqiang Wan · Humen Zhong · Xuejing Liu · Mingkun Yang · Peng Wang · Shuai Bai · Lianwen Jin · Junyang Lin

[ Exhibit Hall I ]

Abstract
Large Multimodal Models ( LMMs ) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and fine-grained visual challenges. The current landscape lacks a comprehensive benchmark to effectively measure the literate capabilities of LMMs. Existing benchmarks are often limited by narrow scenarios and specified tasks. To this end, we introduce CC-OCR, a comprehensive benchmark that possesses a diverse range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time. We evaluate ten prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition. CC-OCR aims to comprehensively evaluate the capabilities of LMMs on OCR-centered tasks, facilitating continued progress in this crucial area.
Poster
I-Hsiang Chen · Hua-En Chang · Wei-Ting Chen · Jenq-Newng Hwang · Sy-Yen Kuo

[ Exhibit Hall I ]

Abstract
Domain Generalized Semantic Segmentation (DGSS) is a critical yet challenging task, as domain shifts in unseen environments can severely compromise model performance. While recent studies enhance feature alignment by projecting features into the source domain, they often neglect intrinsic latent domain priors, leading to suboptimal results. In this paper, we introduce PDAF, a Probabilistic Diffusion Alignment Framework that enhances the generalization of existing segmentation networks through probabilistic diffusion modeling. PDAF introduces a Latent Domain Prior (LDP) to capture domain shifts and uses this prior as a conditioning factor to align both source and unseen target domains. To achieve this, PDAF integrates into a pre-trained segmentation model and utilizes paired source and pseudo-target images to simulate latent domain shifts, enabling LDP modeling. The framework comprises three modules: the Latent Prior Extractor (LPE) predicts the LDP by supervising domain shifts; the Domain Compensation Module (DCM) adjusts feature representations to mitigate domain shifts; and the Diffusion Prior Estimator (DPE) leverages a diffusion process to estimate the LDP without requiring paired samples. This design enables PDAF to iteratively model domain shifts, progressively refining feature representations to enhance generalization under complex target conditions. Extensive experiments validate the effectiveness of PDAF across diverse and challenging urban …
Poster
Long Lian · Yifan Ding · Yunhao Ge · Sifei Liu · Hanzi Mao · Boyi Li · Marco Pavone · Ming-Yu Liu · Trevor Darrell · Adam Yala · Yin Cui

[ Exhibit Hall I ]

Abstract
Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 10 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
Poster
Shenghao Fu · Qize Yang · Yuan-Ming Li · Yi-Xing Peng · Kun-Yu Lin · Xihan Wei · Jian-Fang Hu · Xiaohua Xie · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research. The model, code, and datasets will be made publicly available.
Poster
WonJun Moon · Cheol-Ho Cho · Woojin Jun · Minho Shim · Taeoh Kim · Inwoong Lee · Dongyoon Wee · Jae-Pil Heo

[ Exhibit Hall I ]

Abstract
In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs.To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes.We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding.Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations.Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
Poster
Yana Hasson · Pauline Luc · Liliane Momeni · Maks Ovsjanikov · Guillaume Le Moing · Alina Kuznetsova · Ira Ktena · Jennifer J. Sun · Skanda Koppula · Dilara Gokay · Joseph Heyward · Etienne Pot · Andrew Zisserman

[ Exhibit Hall I ]

Abstract
In recent years, there has been a proliferation of spatiotemporal foundation models for different scientific domains. While promising, these models are often domain-specific, limiting their applicability. Given that many spatiotemporal tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise.However, it remains an open question to what extent the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific domains, and whether a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five **Sci**entific **Vid**eo tasks, across medical computer vision, animal behavior, and weather forecasting.We adapt six leading video models to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by effectively transferring general-purpose representations from ViFM backbones. Furthermore, our results shed light on limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications.We will release our code to facilitate further research in cross-domain development of ViFMs.
Poster
Zheyuan Zhang · Wanying Dou · Linkai Peng · Hongyi Pan · Ulas Bagci · Boqing Gong

[ Exhibit Hall I ]

Abstract
Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by manually annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in video-language understanding. The dataset and evaluation code will be publicly …
Poster
Runpeng Yu · Xinyin Ma · Xinchao Wang

[ Exhibit Hall I ]

Abstract
In MLLMs, Visual perception refers to the process by which MLLMs encode visual inputs, such as images, and align them with the text embedding space. Currently, MLLMs still lack the capability to autonomously control their own visual perception processes. For example, they cannot selectively re-encode specific regions of an image or focus on information related to specific object categories.In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate natural language tokens, and use them to trigger additional visual perception process. The Region Selection Token explicitly identifies regions of interest that require further processing, while the Vision Re-Encoding Token utilizes its hidden states to guide an additional vision encoding process. Extensive experiments highlight the effectiveness of these tokens in enhancing spatial reasoning, fine-grained understanding, Text/OCR-related VQA, and a wide range of other visual tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 30.9%, increasing its score from 0.572 to 0.749, and even …
Poster
Haochen Zhao · Jianwei Niu · Xuefeng Liu · Xiaozheng Xie · Li Kuang · Haotian Yang · Bin Dai · Hui Meng · Yong Wang

[ Exhibit Hall I ]

Abstract
Based on pseudo-labels, voxel-wise contrastive learning (VCL) is a prominent approach designed to learn effective feature representations for semi-supervised medical image segmentation. However, in multi-organ segmentation (MoS), the complex anatomical structures of certain organs often lead to many unreliable pseudo-labels. Directly applying VCL can introduce confirmation bias, resulting in poor segmentation performance. A common practice is to first transform these unreliable pseudo-labels into complementary ones, which represent classes that voxels are least likely to belong to, and then push voxels away from the generated complementary labels. However, we find that this approach may fail to allow voxels with unreliable pseudo-labels (unreliable voxels) to fully benefit from the advantages of VCL. In this paper, we propose DVCL, a novel distance-aware VCL method for semi-supervised MoS. DVCL is based on the observation that unreliable voxels, which may not form discriminative feature boundaries, still form clear clusters. Hence, voxels close to each other in the feature space ('neighbors') likely belong to the same semantic class, while distant ones ('outsiders') likely belong to different classes. In DVCL, we first identify neighbors and outsiders for all unreliable voxels, and then pull their neighbors into the same clusters while pushing outsiders away. In this way, unreliable …
Poster
Matic Fučka · Vitjan Zavrtanik · Danijel Skocaj

[ Exhibit Hall I ]

Abstract
Recent surface anomaly detection methods excel at identifying structural anomalies, such as dents and scratches, but struggle with logical anomalies, such as irregular or missing object components. The best-performing logical anomaly detection approaches rely on aggregated pretrained features or handcrafted descriptors (most often derived from composition maps), which discard spatial and semantic information, leading to suboptimal performance. We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. Additionally, we introduce a novel procedure for extracting composition maps that requires no hand-made labels or category-specific information, in contrast to previous methods. By effectively modelling the composition map distribution, SALAD significantly improves upon state-of-the-art methods on the standard benchmark for logical anomaly detection, MVTec LOCO, achieving an impressive image-level AUROC of 96.1\%. Code: \textcolor{magenta}{Upon acceptance}
Poster
Shengcao Cao · Zijun Wei · Jason Kuen · Kangning Liu · Lingzhi Zhang · Jiuxiang Gu · HyunJoon Jung · Liangyan Gui · Yu-Xiong Wang

[ Exhibit Hall I ]

Abstract
Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of flexible referring expression segmentation (FRES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking FRES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new FRES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks.
Poster
Ragav Sachdeva · Andrew Zisserman

[ Exhibit Hall I ]

Abstract
Comics have long been a popular form of storytelling, offering visually engaging narratives that captivate audiences worldwide. However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters, their interactions, and the vivid settings in which they reside.To this end we make the following contributions: (1) We present a unified model, Magiv3, that excels at various functional tasks pertaining to comic understanding, such as localising panels, characters, texts, and speech-bubble tails, performing OCR, grounding characters etc. (2) We release human-annotated captions for over 3300 Japanese comic panels, along with character grounding annotations, and benchmark large vision-language models in their ability to understand comic images. (3) Finally, we demonstrate how integrating large vision-language models with Magiv3, can generate seamless literary narratives that allows visually impaired audiences to engage with the depth and richness of comic storytelling. Our code, trained model …
Poster
Pan Liu · Jinshi Liu

[ Exhibit Hall I ]

Abstract
While significant advances exist in pseudo-label generation for semi-supervised semantic segmentation, pseudo-label selection remains understudied. Existing methods typically use fixed confidence thresholds to retain high-confidence predictions as pseudo-labels. However, these methods cannot cope with network overconfidence tendency, where correct and incorrect predictions overlap significantly in high-confidence regions, making separation challenging and amplifying model cognitive bias. Meanwhile, the direct discarding of low-confidence predictions disrupts spatial-semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. CSL formulates pseudo-label selection as a convex optimization problem within the confidence distribution feature space, establishing sample-specific decision boundaries to distinguish reliable from unreliable predictions. Additionally, CSL introduces random masking of reliable pixels to guide the network in learning contextual relationships from low-reliability regions, thereby mitigating the adverse effects of discarding uncertain predictions. Extensive experimental results on the Pascal VOC 2012 and Cityscapes benchmarks show that CSL performs favorably against state-of-the-art methods.
Poster
Seokho Han · Seoyeon Yoon · Jinhee Kim · Dongwei Wang · Kang Jeon · Huanrui Yang · Jong Hwan Ko

[ Exhibit Hall I ]

Abstract
As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and accuracy compared to uniform quantization. However, finding the optimal precision for each layer is challenging. Recent studies using bit-level training have shown promise, yet they often introduce substantial training complexity and high GPU memory requirements. In this paper, we propose Memory-Efficient Bit Sparsification Quantization (MSQ), a novel approach that addresses these limitations. MSQ applies a round-clamp quantizer and leverages least significant bit (LSB) regularization to induce sparsity in LSBs, enabling effective precision reduction without splitting parameters at the bit level, thereby minimizing memory use and training time. Additionally, MSQ incorporates Hessian information, allowing the simultaneous pruning of multiple LSBs to further enhance training efficiency. Experimental results show that MSQ effectively reduces resource demands while maintaining competitive accuracy and compression rates, making it a practical solution for training efficient DNNs on resource-constrained devices.
Poster
Zhaorui Tan · Xi Yang · Tan Pan · TIANYI LIU · Chen Jiang · Xin Guo · Qiufeng Wang · Anh Nguyen · Yuan Qi · Kaizhu Huang · Yuan Cheng

[ Exhibit Hall I ]

Abstract
Variations in medical imaging modalities and individual anatomical differences pose challenges to cross-modality generalization in multi-modal tasks. Existing methods often concentrate exclusively on common anatomical patterns, thereby neglecting individual differences and consequently limiting their generalization performance. This paper emphasizes the critical role of learning individual-level invariance, i.e., personalized representation $\mathbb{X}_h$, to enhance multi-modality generalization under both homogeneous and heterogeneous settings.It reveals that mappings from individual anatomy to different medical modalities remain static across the population, which is implied in the personalization process.We propose a two-stage approach: pre-training with invariant representation $\mathbb{X}_h$ for personalization, then fine-tuning for diverse downstream tasks.We provide both theoretical and empirical evidence demonstrating the feasibility and advantages of personalization, showing that our approach yields greater generalizability and transferability across diverse multi-modal medical tasks compared to methods lacking personalization. Extensive experiments further validate that our approach significantly enhances performance in various generalization scenarios.
Poster
Jiaming Liu · Linghe Kong · Guihai Chen

[ Exhibit Hall I ]

Abstract
Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images, but its performance on camouflaged object detection (COD) is unsatisfactory. In this paper, we propose SAM-COD that performs camouflaged object detection for RGB-D inputs. While keeping the SAM architecture intact, dual stream adapters are expanded on the image encoder to learn potential complementary information from RGB images and depth images, and fine-tune the mask decoder and its depth replica to perform dual-stream mask prediction. In practice, the dual stream adapters are embedded into the attention block of the image encoder in a parallel manner to facilitate the refinement and correction of the two types of image embeddings. To mitigate channel discrepancies arising from dual stream embeddings that do not directly interact with each other, we augment the association of dual stream embeddings using bidirectional knowledge distillation including a model distiller and a modal distiller. In addition, to predict the masks for RGB and depth attention maps, we hybridize the two types of image embeddings which are jointly learned with the prompt embeddings to update the initial prompt, and then feed them into the mask decoders to synchronize the consistency of image embeddings and prompt embeddings. Experimental results …
Poster
Yuanze Li · Shihao Yuan · Haolin Wang · Qizhang Li · Ming Liu · Chen Xu · Guangming Shi · Wangmeng Zuo

[ Exhibit Hall I ]

Abstract
Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current …
Poster
Zhixuan Li · Hyunse Yoon · Sanghoon Lee · Weisi Lin

[ Exhibit Hall I ]

Abstract
Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset. The code, model, and dataset will be publicly released.
Poster
Jiaxuan Chen · Yu Qi · Yueming Wang · Gang Pan

[ Exhibit Hall I ]

Abstract
Neural decoding has recently made significant progress in reconstructing images and text from brain activity, yet seeking biologically valid semantic alignment between artificial models and the brain remains challenging. Large pre-trained foundation models such as CLIP excel at capturing rich semantic details in complex visual scenes. In contrast, due to selective attention, only part of the visual semantics in the stimulus may be preferentially represented in the neural patterns when subjects view images. Past studies have generally assumed that stimulus images and their evoked brain recordings are strictly semantically equivalent, potentially leading to semantic misalignment between supervision signals and neural recordings. In order to address this, we propose a novel self-adaptive semantic decoding method (Mind-SA), designed to dynamically detect the regions within stimulus images that the brain actually focuses on and use them as supervision to guide brain-to-text reconstruction. We find that the proposed Mind-SA can be used to reduce the semantic gap between supervision signals (i.e., stimulus images) and neural representations, thus enabling the reconstruction model to focus on the parts that the brain actually perceives. Experiments demonstrate that Mind-SA improves the quality of neural representations and achieves the state-of-the-art brain-to-text performance.
Poster
HAILONG YAN · Ao Li · Xiangtao Zhang · Zhe Liu · Zenglin Shi · Ce Zhu · Le Zhang

[ Exhibit Hall I ]

Abstract
Recent advancements in deep neural networks have driven significant progress in image enhancement (IE). However, deploying deep learning models on resource-constrained platforms, such as mobile devices, remains challenging due to high computation and memory demands. To address these challenges and facilitate real-time IE on mobile, we introduce an extremely lightweight Convolutional Neural Network (CNN) framework with around 4K parameters. Our approach integrates re-parameterization with an Incremental Weight Optimization strategy to ensure efficiency. Additionally, we enhance performance with a Feature Self-Transform module and a Hierarchical Dual-Path Attention mechanism, optimized with a Local Variance-Weighted loss. With this efficient framework, we are the first to achieve real-time IE inference at up to 1,100 frames per second (FPS) while delivering competitive image quality, achieving the best trade-off between speed and performance across multiple IE tasks. The code will be released soon.
Poster
yingsen zeng · Zepeng Huang · Yujie Zhong · Chengjian Feng · Jie Hu · Lin Ma · Yang Liu

[ Exhibit Hall I ]

Abstract
Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. This approach uses a learnable token to create a continuous embedding space for all time points and incorporates a Distribution-based Time Tokenizer that decodes timestamps into probability distributions. These distributions effectively resolve boundary ambiguities and translate into continuous time values. Additionally, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models to overcome temporal granularity limitations in existing datasets. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks.
Poster
Ahmed Nassar · Matteo Omenetti · Maksym Lysak · Nikolaos Livathinos · Christoph Auer · Lucas Morin · Rafael Teixeira de Lima · Yusik Kim · A. Said Gurbuz · Michele Dolfi · Peter Staar

[ Exhibit Hall I ]

Abstract
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms — significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition.Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model weights and supplementary datasets will be publicly available upon acceptance.
Poster
Jiajia Li · Huisi Wu · Jing Qin

[ Exhibit Hall I ]

Abstract
histopathology images is a fundamental task in computational pathology. It is also a very challenging task due to complex nuclei morphologies, ambiguous boundaries, and staining variations. Existing methods often struggle to precisely delineate overlapping nuclei and handle class imbalance. We introduce WeaveSeg, a novel deep learning model for nuclei instance segmentation that significantly improves segmentation performance via synergistic integration of adaptive spectral feature refinement and iterative contrast-weaving. WeaveSeg features an adaptive spectral detail refinement (SAR) module for multi-scale feature enhancement via adaptive frequency component fusion, and an iterative contrast-weaving (ICW) module that progressively refines features through integrating contrastive attention, decoupled semantic context, and adaptive gating. Furthermore, we introduce a specialized uncertainty loss to explicitly model ambiguous regions, and a novel local contrast-based self-adaptive adjustment mechanism to accommodate dynamic feature distributions. Extensive experiments on MoNuSeg and CoNSeP demonstrate WeaveSeg's SOTA performance over existing models. Code will be publicly available.
Poster
Zitian Tang · Shijie Wang · Junho Cho · Jaewook Yoo · Chen Sun

[ Exhibit Hall I ]

Abstract
How multimodal large language models (MLLMs) perceive the visual world remains a mystery. To one extreme, object and relation modeling may be implicitly implemented with inductive biases, for example by treating objects as tokens. To the other extreme, empirical results reveal the surprising finding that simply performing visual captioning, which tends to ignore spatial configuration of the objects, serves as a strong baseline for video understanding. We aim to answer the question: how can objects help video-language understanding in MLLMs? We tackle the question from the object representation and adaptation perspectives. Specifically, we investigate the trade-off between representation expressiveness (e.g. distributed versus symbolic) and integration difficulty (e.g. data-efficiency when learning the adapters). Through extensive evaluations on five video question answering datasets, we confirm that explicit integration of object-centric representation remains necessary, and the symbolic objects can be most easily integrated while being performant for question answering. We hope our findings can encourage the community to explore the explicit integration of perception modules into MLLM design. Our code and models will be publicly released.
Poster
G Thomas Hudson · Dean Slack · Thomas Winterbottom · Jamie Stirling · Chenghao Xiao · Junjie Shentu · Noura Al Moubayed

[ Exhibit Hall I ]

Abstract
Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-text, and audio-to-text, demonstrating the model's ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.
Poster
Haoran Lou · Chunxiao Fan · Ziyan Liu · Yuexin Wu · Xinliang Wang

[ Exhibit Hall I ]

Abstract
The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1)We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: "from central region to global" and "from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and …
Poster
Luca Barsellotti · Lorenzo Bianchi · Nicola Messina · Fabio Carrara · Marcella Cornia · Lorenzo Baraldi · Fabrizio Falchi · Rita Cucchiara

[ Exhibit Hall I ]

Abstract
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
Poster
Minjoo Ki · Dae Jung Kim · Kisung Kim · Seon Joo Kim · Jinhan Lee

[ Exhibit Hall I ]

Abstract
Text-to-video retrieval serves as a powerful tool for navigating vast video databases. This is particularly useful in autonomous driving to retrieve scenes from a text query to simulate and evaluate the driving system in desired scenarios. However, traditional ranking-based retrieval methods often return partial matches that do not satisfy all query conditions. To address this, we introduce Inclusive Text-to-Video Retrieval, which retrieves only videos that meet all specified conditions, regardless of additional irrelevant elements. We propose CARIM, a framework for driving scene retrieval that employs inclusive text matching. By utilizing Vision-Language Model (VLM) and Large Language Model (LLM) to generate compressed captions for driving scenes, we transform text-to-video retrieval into a more efficient text-to-text retrieval problem, eliminating modality mismatches and heavy annotation costs. We introduce a novel positive and negative data curation strategy and an attention-based scoring mechanism tailored for driving scene retrieval. Experimental results on the DRAMA dataset demonstrate that CARIM outperforms state-of-the-art retrieval methods, excelling in edge cases where traditional models fail.
Poster
Yiming Zhang · Zhuokai Zhao · Zhaorun Chen · Zenghui Ding · Xianjun Yang · Yining Sun

[ Exhibit Hall I ]

Abstract
Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DyTo, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DyTointegrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DyTo, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.
Poster
Shaojie Zhang · Jiahui Yang · Jianqin Yin · Zhenbo Luo · Jian Luan

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.
Poster
Jinsol Song · Jiamu Wang · Anh Nguyen · Keunho Byeon · Sangjeong Ahn · Sung Hak Lee · Jin Tae Kwak

[ Exhibit Hall I ]

Abstract
Anomaly detection aims to identify rare and scarce anomalies, which is particularly challenging in computational pathology, where disease-related data are often limited or nonexistent. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a normal and abnormal pathology knowledge-augmented vision-language model for anomaly detection in pathology images. Ano-NAViLa utilizes a pre-trained vision-language model with a lightweight trainable MLP, facilitating computationally efficiency. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.
Poster
Matthias Kümmerer · Harneet Singh Khanuja · Matthias Bethge

[ Exhibit Hall I ]

Abstract
Recent advances in image-based saliency prediction are approaching gold standard performance levels on existing benchmarks. Despite this success, we show that predicting fixations across multiple saliency datasets remains challenging due to dataset bias. We find a significant performance drop (around 40%) when models trained on one dataset are applied to another. Surprisingly, increasing dataset diversity does not resolve this *inter-dataset gap*, with close to 60% attributed to dataset-specific biases. To address this remaining *generalization gap*, we propose a novel architecture extending a mostly dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific parameters that govern interpretable mechanisms such as multi-scale structure, center bias, and fixation spread. Adapting only these parameters to new data accounts for more than 75% of the generalization gap, with a large fraction of the improvement achieved with as few as 50 samples. Our model sets a new state-of-the-art on all three datasets of the MIT/Tuebingen Saliency Benchmark (MIT300, CAT2000, and COCO-Freeview), even when purely generalizing from unrelated datasets, but with a substantial boost when adapting to the respective training datasets. The model also provides valuable insights into spatial saliency properties, revealing complex multi-scale effects that combine both absolute and relative sizes.
Poster
Bingchen Gong · Diego Gomez · Abdullah Hamdi · Abdelrahman Eldesokey · Ahmed Abdelreheem · Peter Wonka · Maks Ovsjanikov

[ Exhibit Hall I ]

Abstract
We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.
Poster
Fengzhe Zhou · Humphrey Shi

[ Exhibit Hall I ]

Abstract
Recently, Mask2Former has achieved significant success as a universal image segmentation framework, with its Multi-Scale Deformable Attention (MSDeformAttn) Pixel Decoder becoming a widely adopted component in current segmentation models. However, the inefficiency of MSDeformAttn has become a performance bottleneck for segmenters. To address this, we propose the Hyper Pixel Decoder (HyPiDecoder), an improved Pixel Decoder design that replaces parts of the MSDeformAttn layers with convolution-based FPN layers, introducing explicit locality information and significantly boosting inference speed. Experimental results show that HyPiDecoder can be applied to both universal segmentation models and unified segmentation and detection models, achieving improvements in both speed and accuracy across object detection, semantic, instance, and panoptic segmentation tasks. The Mask DINO model integrated with HyPiDecoder achieves a new SOTA of 58.8 PQ on COCO panoptic segmentation with SwinL-scale backbone and no extra training data, with a 127\% increase in inference speed compared to the original model. Code will be released in the future.
Poster
Yanqi Li · Jianwei Niu · Tao Ren

[ Exhibit Hall I ]

Abstract
Open-Vocabulary Object Detection (OVOD) aims to localize and recognize objects from both known and novel categories. However, existing methods rely heavily on internal knowledge from Vision-Language Models (VLMs), restricting their generalization to unseen categories due to limited contextual understanding. To address this, we propose CODet, a plug-and-play framework that enhances OVOD by integrating object co-occurrence —-- a form of external contextual knowledge pervasive in real-world scenes. Specifically, CODet extracts visual co-occurrence patterns from images, aligns them with textual dependencies validated by Large Language Models (LLMs), and injects contextual co-occurrence pseudo-labels as external knowledge to guide detection. Without architectural changes, CODet consistently improves five state-of-the-art VLM-based detectors across two benchmarks, achieving notable gains (up to +2.3 AP on novel categories). Analyses further confirm its ability to encode meaningful contextual guidance, advancing open-world perception by bridging visual and textual co-occurrence knowledge.
Poster
Bingqing Zhang · Zhuo Cao · Heming Du · Yang Li · Xue Li · Jiajun Liu · Sen Wang

[ Exhibit Hall I ]

Abstract
Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties—text ambiguity, mapping uncertainty, and frame uncertainty—via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen–Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR's effectiveness, achieving notable gains in Recall@1 (69.2\% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.
Poster
Ji Du · Xin WANG · Fangwei Hao · Mingyang Yu · Chunyuan Chen · Jiesheng Wu · Bin Wang · Jing Xu · Ping Li

[ Exhibit Hall I ]

Abstract
At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-then-Retrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogram-based image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE significantly outperforms state-of-the-art …
Poster
Weitian Wang · Shubham rai · Cecilia De la Parra · Akash Kumar

[ Exhibit Hall I ]

Abstract
In this paper, we propose MixA-Q, a mixed-precision activation quantization framework that leverages intra-layer activation sparsity (a concept widely explored in activation pruning methods) for efficient inference of quantized window-based vision transformers. For a given uniform-bit quantization configuration, MixA-Q separates the batched window computations within Swin blocks and assigns a lower bit width to the activations of less important windows,improving the trade-off between model performance and efficiency. We introduce a Two-Branch Swin Block that processes activations separately in high- and low-bit precision, enabling seamless integration of our method with most quantization-aware training (QAT) and post-training quantization (PTQ) methods, or with simple modifications. Our experimental evaluations over the COCO dataset demonstrate that MixA-Q achieves a training-free 1.35× computational speedup without accuracy loss in PTQ configuration. With QAT, MixA-Q achieves a lossless 1.25× speedup and a 1.53× speedup with only a 1\% mAP drop by incorporating activation pruning. Notably, by reducing the quantization error in important regions, our sparsity-aware quantization adaptation improves the mAP of the quantized W4A4 model (with both weights and activations in 4-bit precision) by 0.7\%, reducing quantization degradation by 24\%.
Poster
Wentao Xiang · Haoxian Tan · Cong Wei · Yujie Zhong · Dengjie Li · Yujiu Yang

[ Exhibit Hall I ]

Abstract
Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two critical dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP, a novel and unified Visual Large Language Model (VLLM) framework designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions, all within a single framework. MVP employs an innovative multi-granularity decoder coupled with a unified prompt template, which together enable the seamless joint training of a wide array of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in large language models. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework.
Poster
Sofiène Boutaj · Marin Scalbert · Pierre Marza · Florent Couzinie-Devy · Maria Vakalopoulou · Stergios Christodoulidis

[ Exhibit Hall I ]

Abstract
Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation.
Poster
Manahil Raza · Ayesha Azam · Talha Qaiser · Nasir Rajpoot

[ Exhibit Hall I ]

Abstract
Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole-slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms …
Poster
Chenghao Xiao · Isaac Chung · Imene Kerboua · Jamie Stirling · Xin Zhang · Márton Kardos · Roman Solomatin · Noura Al Moubayed · Kenneth Enevoldsen · Niklas Muennighoff

[ Exhibit Hall I ]

Abstract
Image representation learning and image-text alignment have advanced rapidly, becoming key components in multi-modal research. However, these advancements are often evaluated through distinct, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear how capabilities measured by linear probing translate to retrieval and vice-versa. We introduce the Massive Image Embedding Benchmark (MIEB), a comprehensive benchmark designed to evaluate the capabilities of image embeddings across the broadest spectrum of tasks to date. MIEB spans 8 task categories, covering 130 tasks and a total of 39 languages. By benchmarking the performance of 50 models, MIEB uncovers hidden capabilities of advanced vision models beyond semantic alignment, such as their accurate visual representation of text; but also reveals their yet limited capabilities in robust compositionality and interleaved encoding. The benchmark aims to provide insights for guiding the design of universal image embeddings that encode multi-modal information. Additionally, we show that vision encoders' performance on MIEB tasks highly correlates with MLLMs' performance on downstream tasks, such as Visual STS tasks' over $99\%$ correlation with MLLMs' performance on OCRBench and TextVQA. Our findings underscore the importance of assessing vision embeddings beyond classification and retrieval tasks, highlighting their role in building multi-modal …
Poster
Takumi Kobayashi

[ Exhibit Hall I ]

Abstract
While deep models are effectively trained based on a softmax cross-entropy loss, a cosine-based softmax loss also works for producing favorable feature embedding.In the cosine-based softmax, temperature plays a crucial role in properly scaling the logits of cosine similarities, though being manually tuned in ad-hoc ways as there is less prior knowledge about the temperature.In this paper, we address the challenging problem to adaptively estimate the temperature of cosine-based softmax in the framework of supervised image classification.By analyzing the cosine-based softmax representation from a geometrical viewpoint regarding features and classifiers, we construct a criterion in a least-square fashion which enables us to optimize the temperature at each sample via simple greedy search.Besides, our thorough analysis about temperature clarifies that feature embedding by the cosine-based softmax loss is endowed with diverse characteristics which are controllable by the temperature in an explainable way.The experimental results demonstrate that our optimized temperature contributes to determine a feasible range of temperature to control the feature characteristics and produces favorable performance on various image classification tasks.
Poster
Matt De Vries · Reed Naidoo · Olga Fourkioti · Lucas Dent · Nathan Curry · Chris Dunsby · Chris Bakal

[ Exhibit Hall I ]

Abstract
Understanding 3D cell shape is crucial in biomedical research, where morphology serves as a key indicator of disease, cellular state, and drug response. However, existing 3D point cloud classification models often lack interpretability, making it difficult to extract biologically meaningful insights. To address this, we propose PointMIL, an inherently interpretable point cloud classifier using Multiple Instance Learning (MIL). Unlike other methods that rely on global interpretations, PointMIL simultaneously improves accuracy of point cloud-based classifier backbones and provides fine-grained, point-specific explanations, pinpointing the most informative regions of 3D shapes, without requiring $\textit{post-hoc}$ analysis. We demonstrate PointMIL on two publicly available datasets of biological cells showing state-of-the-art mACC (97.3\%) and F1 (97.5\%) on the IntrA biomedical dataset. Additionally, we introduce a novel dataset of drug-treated cancer cells (Morph3DCell), to show PointMIL's ability to reveal the morphological effects of drug treatments at a fine-grained level, with implications for drug discovery and mechanism-of-action prediction. Beyond biomedical applications, we show that PointMIL also offers quality interpretations and improves the classification accuracy on standard shape benchmarks such as ModelNet40 and ScanObjectNN, demonstrating its generalisation to broader 3D object recognition tasks.
Poster
Wenliang Zhong · Rob Barton · Weizhi An · Feng Jiang · Hehuan Ma · Yuzhi Guo · Abhishek Dan · Shioulin Sam · Karim Bouyarmane · Junzhou Huang

[ Exhibit Hall I ]

Abstract
Composed Image Retrieval (CIR) targets the retrieval of images conditioned on a reference image and a textual modification, but constructing labeled triplets (reference image, textual modification, target image) is inherently challenging. Existing Zero-Shot CIR (ZS-CIR) approaches often rely on well-aligned vision-language models (VLMs) to combine visual and textual inputs, or use large language models (LLMs) for richer modification understanding. While LLM-based methods excel in capturing textual details, they are computationally costly, slow to infer, and often restricted by proprietary constraints. In this paper, we argue that the superior performance of LLM-based ZS-CIR methods primarily stems from their capacity to follow instructions, an aspect largely missing in more efficient projection-based models built upon VLMs. To bridge this gap, we introduce DistillCIR, a dual-stream distillation framework that transfers LLMs’ instruction-following capability into compact, projection-based architectures. By synthesizing triplet data with an LLM and incorporating a novel reasoning process, DistillCIR learns both composed retrieval and instruction awareness. In addition, we train an open-source multimodal LLM on the generated data, and further distill its instruction-aware embeddings into the projection-based model. Without any reliance on LLMs at inference, DistillCIR significantly surpasses state-of-the-art ZS-CIR methods in both performance and efficiency, offering a promising direction for instruction-aware, …
Poster
Zishu Qin · Junhao Xu · Weifeng Ge

[ Exhibit Hall I ]

Abstract
Deep learning algorithms are highly data-intensive, particularly for tasks requiring pixel-level annotations, such as semantic segmentation, which makes achieving pixel-level image understanding costly. Few-shot segmentation seeks to address this challenge by enabling models to segment novel objects using only a limited number of labeled support images as references. In this paper, we argue that the traditional image-to-mask decoding framework places excessive reliance on the quality of the support sample, which is prone to errors when encountering class bias. Thus, we propose a novel image-to-mask denoising learning paradigm for few-shot segmentation, transforming mask decoding into a denoising process to reduce the support reliance problem with the help of denoising diffusion models. We formulate our image-to-mask denoising learning process in two stages: an image corruption stage and a mask denoising stage. In the first stage, we introduce an adaptive image corruption method that perturbs the image based on regional semantics, motivated by the insight of perturbing data to populate low data density regions. In the second stage, we employ an in-model denoising paradigm, designing a network to facilitate support-to-query semantic propagation and mask denoising in a single forward pass. To enhance categorical discrimination for the denoising network, we incorporate discriminative attribute learning, …
Poster
Jiawen Zhu · YEW-SOON ONG · Chunhua Shen · Guansong Pang

[ Exhibit Hall I ]

Abstract
Current zero-shot anomaly detection (ZSAD) methods show remarkable success in prompting large pre-trained vision-language models to detect anomalies in a target dataset without using any dataset-specific training or demonstration. However, these methods often focus on crafting/learning prompts that capture only coarse-grained semantics of abnormality, e.g., high-level semantics like "damaged", "imperfect", or "defective" objects. They therefore have limited capability in recognizing diverse abnormality details that deviate from these general abnormal patterns in various ways. To address this limitation, we propose FAPrompt, a novel framework designed to learn Fine-grained Abnormality Prompts for accurate ZSAD. To this end, a novel Compound Abnormality Prompt learning (CAP) module is introduced in FAPrompt to learn a set of complementary, decomposed abnormality prompts, where abnormality prompts are enforced to model diverse abnormal patterns derived from the same normality semantic. On the other hand, the fine-grained abnormality patterns can be different from one dataset to another. To enhance the cross-dataset generalization, another novel module, namely Data-dependent Abnormality Prior learning (DAP), is introduced in FAPrompt to learn a sample-wise abnormality prior from abnormal features of each test image to dynamically adapt the abnormality prompts to individual test images. Comprehensive experiments on 19 real-world datasets, covering both industrial defects and …
Poster
Yi Chen · Yuying Ge · Weiliang Tang · Yizhuo Li · Yixiao Ge · Mingyu Ding · Ying Shan · Xihui Liu

[ Exhibit Hall I ]

Abstract
Recent developments in Large Language Models (LLMs) pre-trained on extensive corpora have shown significant success in various natural language processing (NLP) tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", **can a similar generative pre-training approach be effectively applied to enhance robot learning?** The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks.Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce **Moto**, which converts video content into latent **Mo**tion **To**ken sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output …
Poster
Hanyi Wang · Han Fang · Shi-Lin Wang · Ee-Chien Chang

[ Exhibit Hall I ]

Abstract
Generative image watermarking enables the proactive detection and traceability of generated images. Among existing methods, inversion-based frameworks achieve highly conceal ed watermark embedding by injecting watermarks into the latent representation before the diffusion process. The robustness of this approach hinges on both the embedding mechanism and inversion accuracy. However, prior works have predominantly focused on optimizing the embedding process while overlooking inversion errors, which significantly affect extraction fidelity. In this paper, we address the challenge of inversion errors and propose ROAR, a dual-domain optimization-based framework designed to mitigate errors arising from two key sources: 1) Latent-domain errors, which accumulate across inversion steps due to inherent approximation assumptions. 2) Pixel-domain errors, which result from channel distortions such as JPEG compression. To tackle these issues, we introduce two novel components: A \textbf{Regeneration-based Optimization (RO)} mechanism, which incorporates an optimizable starting latent to minimize latent-domain errors; A Mixture of Experts (MoE)-based \textbf{distortion-adaptive restoration (AR)} network, which effectively recovers watermarked distributions from pixel-level distortions.Extensive experiments demonstrate that ROAR significantly reduces inversion errors and enhances watermark extraction robustness, thereby improving the reliability of generative image watermarking.
Poster
Lujun Li · Cheng Lin · Dezhi Li · You-Liang Huang · Wei Li · Tianyu Wu · Jie Zou · Wei Xue · Sirui Han · Yike Guo

[ Exhibit Hall I ]

Abstract
Low-Rank Adaptation (LoRA) has become a popular paradigm for fine-tuning large models, but it still necessitates a substantial number of training parameters. To address this issue, we first conduct comprehensive empirical studies on parameter-efficient LoRA structure. Then, we establish design guidelines that emphasize the use of serial structures, optimal placements, and nested LoRA. Based on these insights, we present NoRA, a nested parameter-efficient LoRA structure that revolutionizes the initialization and fine-tuning of projection matrices. Our NoRA's innovative approach involves freezing outer layer LoRA weights and employing a serial inner layer design, enabling precise task-specific adaptations while maintaining compact training parameters. In addition, we propose an activation-aware Singular Value Decomposition (AwSVD) that adjusts the weight matrices based on activation distributions for initialization of outer layer LoRA weights. This schema enhances decomposition accuracy and mitigates computational errors. Extensive evaluations across multiple large models demonstrate that NoRA outperforms state-of-the-art LoRA variants, achieving significant improvements in performance-efficiency trade-off on visual few-shot tasks, visual instruction tuning and subject-driven generation. Codes are available in the supplementary materials.
Poster
Dohwan Ko · Ji Soo Lee · Minhyuk Choi · Zihang Meng · Hyunwoo Kim

[ Exhibit Hall I ]

Abstract
Text-Video Retrieval has been extensively studied to accurately retrieve the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. With the advancement of multi-modal large language models (MLLMs), recent studies have proposed MLLM-based retrieval systems to enhance retrieval performance, particularly for long and complex query-candidate pairs. However, we observe that the naive application of MLLMs, $\textit{i.e.}$, retrieval based on candidate likelihood, introduces $\textit{candidate prior bias}$, wherein candidates with inherently higher prior probabilities are favored over those that are more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM ($\textbf{BLiM}$), which leverages query likelihood as well as candidate likelihood by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization ($\textbf{CPN}$), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by an average margin of 6.4 in R@1, effectively alleviating candidate prior bias and emphasizing the relevance between the query and candidate. Our in-depth analysis across various multi-modal …
Poster
Haochen Wang · Qirui Chen · Cilin Yan · Jiayin Cai · Xiaolong Jiang · Yao Hu · Weidi Xie · Stratis Gavves

[ Exhibit Hall I ]

Abstract
Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multi-round interactions. In this paper, we make three contributions:(i) we address these limitations by introducing a VideoLLM, termed as **RGA3**, capable of performing both object referring and grounding for video reasoning tasks in a multi-round conversational manner, i.e., allowing users to iteratively interact with videos using both textual and visual queries; (ii) we propose **STOM** (Spatial-Temporal Overlay Module), a novel approach that allows arbitrary visual prompts to be processed at any timestamp within a video;(iii) we present **VideoInfer**, a manually curated object-centric video instruction dataset featuring question-answering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring video object segmentation. The results on 12 benchmarks spanning 6 tasks show that RGA3 consistently outperforms baseline models in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. The code, dataset, and web demo will be publicly released.
Poster
Zijie Xin · Minquan Wang · Jingyu Liu · Quan Chen · Ye Ma · Peng Jiang · Xirong Li

[ Exhibit Hall I ]

Abstract
Adding proper background music helps complete a short video to be shared. Previous research tackles the task by video-to-music retrieval (V2MR), which aims to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53K short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unifed end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also sets MaDe as a strong baseline. Data and code will be released.
Poster
Hanyu Zhou · Gim Hee Lee

[ Exhibit Hall I ]

Abstract
Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations onto frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of our method. Our code will be made publicly available.
Poster
Mattia Soldan · Fabian Caba Heilbron · Bernard Ghanem · Josef Sivic · Bryan Russell

[ Exhibit Hall I ]

Abstract
Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require "temporally dense" reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model.Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60\%) and improvements in inference speed (up to 2.5$\times$ faster), all while closely approximating the accuracy of the original foundation model.
Poster
Shuchang Ye · Usman Naseem · Mingyuan Meng · jinman kim

[ Exhibit Hall I ]

Abstract
Medical language-guided segmentation, integrating textual clinical reports to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as textual reliance, presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, in ProLearn, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19 and MosMedData+ demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.
Poster
Yanguang Sun · Jiawei Lian · jian Yang · lei luo

[ Exhibit Hall I ]

Abstract
Large-scale pre-trained models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the full-parameter fine-tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine-tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of large-scale models. In this paper, we propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine-grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine-tuning. Furthermore, we design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine-tuning. Extensive experiments validate the superiority of our Controllable-LPMoE approach, demonstrating excellent segmentation performance compared to 31 state-of-the-art methods and adaptability to multiple binary object segmentation tasks.
Poster
Xiaoran Zhang · Byung-Woo Hong · Hyoungseob Park · Daniel Pak · Anne-Marie Rickmann · Lawrence Staib · James Duncan · Alex Wong

[ Exhibit Hall I ]

Abstract
We propose a model-agnostic, progressive test-time energy adaptation approach for medical image segmentation. Maintaining model performance across diverse medical datasets is challenging, as distribution shifts arise from inconsistent imaging protocols and patient variations. Unlike domain adaptation methods that require multiple passes through target data—impractical in clinical settings—our approach adapts pretrained models progressively as they process test data. Our method leverages a shape energy model trained on source data, which assigns an energy score at the patch level to segmentation maps: low energy represents in-distribution (accurate) shapes, while high energy signals out-of-distribution (erroneous) predictions. By minimizing this energy score at test time, we refine the segmentation model to align with the target distribution. To validate the effectiveness and adaptability, we evaluated our framework on eight public MRI (bSSFP, T1- and T2-weighted) and X-ray datasets spanning cardiac, spinal cord, and lung segmentation. We consistently outperform baselines both quantitatively and qualitatively.
Poster
Mingfeng Zha · Tianyu Li · Guoqing Wang · Peng Wang · Yangyang Wu · Yang Yang · Heng Tao Shen

[ Exhibit Hall I ]

Abstract
Audio-visual segmentation (AVS) aims to segment objects in videos based on audio cues. Existing AVS methods are primarily designed to enhance interaction efficiency but pay limited attention to modality representation discrepancies and imbalances. To overcome this, we propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding. Due to the lack of semantics, heterogeneous representations may lead to erroneous matches, especially in complex scenes with ambiguous visual content or interference from multiple audio sources. We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space, reducing modality gaps and providing prior guidance. Visual content carries more information and typically dominates, thereby marginalizing audio features in the decision-making. To mitigate knowledge preference, we propose the semantic counterfactual (SC) to learn orthogonal representations in the latent space, generating diverse counterfactual samples, thus avoiding biases introduced by complex functional designs and explicit modifications of text structures or attributes. We further formulate the collaborative distribution-aware contrastive learning (CDCL), incorporating factual-counterfactual and inter-modality contrasts to align representations, promoting cohesion and decoupling. Extensive experiments on three public datasets validate that the proposed method achieves state-of-the-art performance.
Poster
Shi-Chen Zhang · Yunheng Li · Yu-Huan Wu · Qibin Hou · Ming-Ming Cheng

[ Exhibit Hall I ]

Abstract
Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation:misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and Mask2Former-Tiny by 1.9%, 2.4%, and 2.6% mIoU, respectively, with only 0.1-0.2M additional parameters required.
Poster
Zhangjun Zhou · Yiping Li · Chunlin Zhong · Jianuo Huang · Jialun Pei · Hua Li · He Tang

[ Exhibit Hall I ]

Abstract
While the human visual system employs distinct mechanisms to perceive salient and camouflaged objects, existing models struggle to disentangle these tasks. Specifically, salient object detection (SOD) models frequently misclassify camouflaged objects as salient, while camouflaged object detection (COD) models conversely misinterpret salient objects as camouflaged. We hypothesize that this can be attributed to two factors: (i) the specific annotation paradigm of current SOD and COD datasets, and (ii) the lack of explicit attribute relationship modeling in current models. Prevalent SOD/COD datasets enforce a mutual exclusivity constraint, assuming scenes contain either salient or camouflaged objects, which poorly aligns with the real world. Furthermore, current SOD/COD methods are primarily designed for these highly constrained datasets and lack explicit modeling of the relationship between salient and camouflaged objects. In this paper, to promote the development of unconstrained salient and camouflaged object detection, we construct a large-scale dataset, USC12K, which features comprehensive labels and four different scenes that cover all possible logical existence scenarios of both salient and camouflaged objects. To explicitly model the relationship between salient and camouflaged objects, we propose a model called USCNet, which introduces two distinct prompt query mechanisms for modeling inter-sample and intra-sample attribute relationships. Additionally, to assess the …
Poster
Wenlun Zhang · Yunshan Zhong · Shimpei Ando · Kentaro Yoshioka

[ Exhibit Hall I ]

Abstract
The Segment Anything Model (SAM) has demonstrated strong versatility across various visual tasks. However, its large storage requirements and high computational cost pose challenges for practical deployment. Post-training quantization (PTQ) has emerged as an effective strategy for efficient deployment, but we identify two key challenges in SAM that hinder the effectiveness of existing PTQ methods: the heavy-tailed and skewed distribution of post-GELU activations, and significant inter-channel variation in linear projection activations. To address these challenges, we propose AHCPTQ, an accurate and hardware-efficient PTQ method for SAM. AHCPTQ introduces hardware-compatible Hybrid Log-Uniform Quantization (HLUQ) to manage post-GELU activations, employing log2 quantization for dense small values and uniform quantization for sparse large values to enhance quantization resolution. Additionally, AHCPTQ incorporates Channel-Aware Grouping (CAG) to mitigate inter-channel variation by progressively clustering activation channels with similar distributions, enabling them to share quantization parameters and improving hardware efficiency. The combination of HLUQ and CAG not only enhances quantization effectiveness but also ensures compatibility with efficient hardware execution. For instance, under the W4A4 configuration on the SAM-L model, AHCPTQ achieves 36.6\% mAP on instance segmentation with the DINO detector, while achieving a $7.89\times$ speedup and $8.64\times$ energy efficiency over its floating-point counterpart in FPGA implementation.
Poster
Dong Zhao · Qi Zang · Shuang Wang · Nicu Sebe · Zhun Zhong

[ Exhibit Hall I ]

Abstract
Pseudo-labeling is a key technique of semi-supervised and cross-domian semantic segmentation, yet its efficacy is often hampered by the intrinsic noise of pseudo-labels. This study introduces Pseudo-SD, a novel framework that redefines the utilization of pseudo-label knowledge through Stable Diffusion (SD). Our Pseudo-SD innovatively combines pseudo-labels and its text prompts to fine-tune SD models, facilitating the generation of high-quality, diverse synthetic images that closely mimic target data characteristics. Within this framework, two novel mechanisms, \textit{i.e.}, partial attention manipulation, and structured pseudo-labeling, are proposed to effectively spread text-to-image corresponding during SD fine-tuning process and to ensure controllable high-quality image synthesis respectively. Extensive results demonstrate that Pseudo-SD significantly improves the performance on semi-supervised and cross-domain segmentation scenarios. Moreover, our method is versatile and model-agnostic, which can complement existing methods. By injecting our Pseudo-SD into current methods, we establish new state of the arts in different datasets, offering a new way for the exploration of effective pseudo-label utilization.
Poster
Quanfeng Lu · Wenqi Shao · Zitao Liu · Lingxiao Du · Fanqing Meng · Boxuan Li · Botong Chen · Siyuan Huang · Kaipeng Zhang · Ping Luo

[ Exhibit Hall I ]

Abstract
Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we present GUIOdyssey, a comprehensive dataset for cross-app mobile GUI navigation. GUIOdyssey comprises 8,334 episodes with an average of 15.3 steps per episode, covering 6 mobile devices, 212 distinct apps, and 1,357 app combinations. Each step is enriched with detailed semantic reasoning annotations, which aid the model in building cognitive processes and enhancing its reasoning abilities for complex cross-app tasks. Building on GUIOdyssey, we develop OdysseyAgent, an exploratory multimodal agent for long-step cross-app navigation equipped with a history resampler module that efficiently attends to historical screenshot tokens, balancing performance and inference speed. Extensive experiments conducted in both in-domain and out-of-domain scenarios validate the effectiveness of our approach. Moreover, we demonstrate that historial information involving actions, screenshots and context in our dataset can significantly enhances OdysseyAgent's performance on complex cross-app tasks.
Poster
Mahesh Bhosale · Abdul Wasi · Yuanhao Zhai · Yunjie Tian · Samuel Border · Nan Xi · Pinaki Sarder · Junsong Yuan · David Doermann · Xuan Gong

[ Exhibit Hall I ]

Abstract
Diffusion-based generative models have shown promise in synthesizing histopathology images to address data scarcity caused by privacy constraints. Diagnostic text reports provide high-level semantic descriptions, and masks offer fine-grained spatial structures essential for representing distinct morphological regions. However, public datasets lack paired text and mask data for the same histopathological images, limiting their joint use in image generation. This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images. PathDiff also improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods. Our code and models will be open-sourced.
Poster
Yiyuan Zhang · Handong Li · Jing Liu · Xiangyu Yue

[ Exhibit Hall I ]

Abstract
High-quality image-text data is critical in enhancing Vision-Language Models (VLMs), but traditional image-based pretraining approaches face limitations. These methods are resource-intensive, relying on curated, high-quality interleaved data that is costly and challenging to collect at scale. Additionally, while such datasets improve static image-text understanding, they fail to develop the temporal and motion comprehension needed for video understanding. To address these gaps, we propose incorporating video pretraining into VLMs to improve the model’s ability to capture temporal dynamics and general visual perception, which requires reconciling spatial redundancy with strict temporal causality. Therefore, we propose Causal Hierarchical Aggregation to separate computation-heavy spatial encoding from lightweight temporal propagation and construct hierarchical receptive fields at varying granularities. As we scale video context to more than 100B tokens, our method excels in high throughput and state-of-the-art performances on both Image and Video understanding, as shown in Figure 1, providing a scalable solution to enhance multimodal learning in dynamic contexts.
Poster
Raphaela Kang · Yue Song · Georgia Gkioxari · Pietro Perona

[ Exhibit Hall I ]

Abstract
Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics.Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP?Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. Code will be released upon acceptance.
Poster
ZHIXIANG WEI · Guangting Wang · Xiaoxiao Ma · Ke Mei · Fengyun Rao · Huaian Chen · Yi Jin

[ Exhibit Hall I ]

Abstract
Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals. The resulting model, namely HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on …
Poster
Xuechao Zou · Yue Li · Shun Zhang · Kai Li · Shiying Wang · Pin Tao · Junliang Xing · congyan lang

[ Exhibit Hall I ]

Abstract
Remote sensing image segmentation faces persistent challenges in distinguishing morphologically similar categories and adapting to diverse scene variations. While existing methods rely on implicit representation learning paradigms, they often fail to dynamically adjust semantic embeddings according to contextual cues, leading to suboptimal performance in fine-grained scenarios such as cloud thickness differentiation. This work introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. The core contribution lies in a novel dictionary construction mechanism, where class-aware semantic embeddings are progressively updated via multi-stage alternating cross-attention querying between image features and dictionary embeddings. This process enables adaptive representation learning tailored to input-specific characteristics, effectively resolving ambiguities in intra-class heterogeneity and inter-class homogeneity. To further enhance discriminability, a contrastive constraint is applied to the dictionary space, ensuring compact intra-class distributions while maximizing inter-class separability. Extensive experiments across both coarse- and fine-grained datasets demonstrate consistent improvements over state-of-the-art methods, particularly in two online test benchmarks (LoveDA and UAVid). Code is available at https://anonymous.4open.science/r/D2LS-8267/.
Poster
Zesen Cheng · Kehan Li · Yian Zhao · Hang Zhang · Chang Liu · Jie Chen

[ Exhibit Hall I ]

Abstract
With the rise of applications such as embodied intelligence, developing high real-time online video instance segmentation (VIS) has become increasingly important. However, through time profiling of the components in advanced online VIS architecture (i.e., transformer-based architecture), we find that the transformer decoder significantly hampers the inference speed. Further analysis of the similarities between the outputs from adjacent frames at each transformer decoder layer reveals significant redundant computations within the transformer decoder. To address this issue, we introduce Temporal-Aware query Routing (TAR) mechanism. We embed it before each transformer decoder layer. By fusing the optimal queries from the previous frame, the queries output by the preceding decoder layer, and their differential information, TAR predicts a binary classification score and then uses an argmax operation to determine whether the current layer should be skipped. Experimental results demonstrate that integrating TAR into the baselines achieves significant efficiency gains (24.7 → 34.6 FPS for MinVIS, 22.4 → 32.8 FPS for DVIS++) while also improving performance (e.g., on YoutubeVIS 2019, 47.4 → 48.4 AP for MinVIS, 55.5 → 55.7 AP for DVIS++). Furthermore, our analysis of the TAR mechanism shows that the number of skipped layers increases as the differences between adjacent video frames decrease, …
Poster
Cheonjun Park · Hyunjae Oh · Mincheol Park · Hyunchan Moon · Minsik Kim · Suhyun Kim · Myung Kuk Yoon · Won Woo Ro

[ Exhibit Hall I ]

Abstract
Recent GPUs leverage Winograd convolution and structured pruning to significantly accelerate inference.First, Winograd convolution is theoretically 2.25× faster than standard convolution.Second, structured pruning reduces inference time without additional overhead as the pruning ratio increases.However, applying conventional structured pruning alongside Winograd convolution is inefficient. Existing structured pruning methods, which do not account for how GPUs process Winograd convolution, require large pruning unit sizes, leading to significant information loss.In this paper, we propose Winograd Structured Pruning (WINS), \textbf{the first approach} to employ optimized structured pruning for Winograd convolution. WINS is designed based on an in-depth analysis of Winograd convolution's computational characteristics on GPUs.Additionally, we introduce two variants, WINS-B and WINS-AB, which further enhance performance. Experimental results show that WINS-AB achieves up to 2.8× practical speedup in Winograd convolution inference on GPUs while preserving the accuracy of ResNet-18 on ImageNet.
Poster
Hai Huang · Yan Xia · Sashuai Zhou · Hanting Wang · Shulei Wang · Zhou Zhao

[ Exhibit Hall I ]

Abstract
Domain Generalization (DG) aims to enhance model robustness in unseen or distributionally shifted target domains through training exclusively on source domains. Although existing DG techniques—such as data manipulation, learning strategies, and representation learning—have demonstrated significant progress, they predominantly address single-modal data. With the emergence of numerous multi-modal datasets and increasing demand for multi-modal tasks, a key challenge in Multi-modal Domain Generalization (MMDG) has emerged: enabling models trained on multi-modal sources to generalize to unseen target distributions within the same modality set.Due to the inherent differences between modalities, directly transferring methods from single-modal DG to MMDG typically yields disappointing results. These methods often exhibit randomness during generalization due to the invisibility of target domains and fail to consider inter-modal consistency. Applying these methods independently to each modality in the MMDG setting before combining them can lead to divergent generalization directions across different modalities, resulting in degraded generalization capabilities. To address these challenges, we propose a novel approach that leverages Unified Representations to map different paired modalities together, effectively adapting DG methods to MMDG by enabling synchronized multi-modal improvements within the unified space. Additionally, we introduce a supervised disentanglement framework that separates modal-general and modal-specific information, further enhancing the alignment of unified …
Poster
KUO WANG · Quanlong Zheng · Junlin Xie · Yanhao Zhang · Jinguo Luo · Haonan Lu · Liang Lin · Fan Zhou · Guanbin Li

[ Exhibit Hall I ]

Abstract
Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach Free-MoRef, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. After the shadow layers in LLMs, a reference fusion step is derived to compose a final mixed reasoning sequence with key tokens from parallel chunks, which compensates the cross-reference vision interactions that are neglected in MoRef-attention. By splitting and fusing the long vision token sequences, Free-MoRef achieves improved performance under much lower computing costs in reasoning multiplexed context length, demonstrating strong efficiency and effectiveness. Experiments on VideoMME, MLVU, LongVideoBench show that Free-MoRef achieves full perception of 2$\times$ to 8$\times$ longer input …
Poster
Yuan Yao · Qiushi Yang · Miaomiao Cui · Liefeng Bo

[ Exhibit Hall I ]

Abstract
The recent Segment Anything Models (SAMs) have emerged as foundational visual models for general interactive segmentation. Despite demonstrating robust generalization abilities, they still suffer from performance degradations in scenarios that demand accurate masks. Existing methods for high-precision interactive segmentation face a trade-off between perceiving intricate local details and maintaining stable prompting capability, which hinders the applicability and effectiveness of foundational segmentation models. In this paper, we present a SAM2Refiner framework built upon the SAM2 backbone. This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos while preserving its inherent strengths. Specifically, we design a localization augment module, which incorporates local contextual cues to enhance global features via a cross-attention mechanism, thereby exploiting potential detailed patterns while maintaining semantic information. Moreover, to strengthen the prompting ability toward the enhanced object embeddings, we introduce a prompt retargeting module that renews the embedding with spatially aligned prompt features. In addition, to obtain accurate high resolution segmentation masks, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder. Extensive experiments demonstrate the effectiveness of our approach, revealing that the proposed method can produce highly precise masks for both …
Poster
Wei Liao · Chunyan Xu · Chenxu Wang · Zhen Cui

[ Exhibit Hall I ]

Abstract
Sparse annotation in remote sensing object detection poses significant challenges due to dense object distributions and category imbalances. Although existing Dense Pseudo-Label methods have demonstrated substantial potential in pseudo-labeling tasks, they remain constrained by selection ambiguities and inconsistencies in confidence estimation. In this paper, we introduce an LLM-assisted semantic guidance framework tailored for sparsely annotated remote sensing object detection, exploiting the advanced semantic reasoning capabilities of large language models (LLMs) to distill high-confidence pseudo-labels. By integrating LLM-generated semantic priors, we propose a Class-Aware Dense Pseudo-Label Assignment mechanism that adaptively assigns pseudo-labels for both unlabeled and sparsely labeled data, ensuring robust supervision across varying data distributions. Additionally, we develop an Adaptive Hard-Negative Reweighting Module to stabilize the supervised learning branch by mitigating the influence of confounding background information. Extensive experiments on DOTA and HRSC2016 demonstrate that the proposed method outperforms existing single-stage detector-based frameworks, significantly improving detection performance under sparse annotations.
Poster
Qin Zhou · Guoyan Liang · Xindi Li · Jingyuan CHEN · Zhe Wang · Chang Yao · Sai Wu

[ Exhibit Hall I ]

Abstract
Automated radiology report generation is essential for improving diagnostic efficiency and reducing the workload of medical professionals. However, existing methods face significant challenges, such as disease class imbalance and insufficient cross-modal fusion. To address these issues, we propose the learnable Retrieval Enhanced Visual-Text Alignment and Fusion (REVTAF) framework, which effectively tackles both class imbalance and visual-text fusion in report generation. REVTAF incorporates two core components: (1) a Learnable Retrieval Enhancer (LRE) that utilizes semantic hierarchies from hyperbolic space and intra-batch context through a ranking-based metric. LRE adaptively retrieves the most relevant reference reports, enhancing image representations, particularly for underrepresented (tail) class inputs; and (2) a fine-grained visual-text alignment and fusion strategy that ensures consistency across multi-source cross-attention maps for precise alignment. This component further employs an optimal transport-based cross-attention mechanism to dynamically integrate task-relevant textual knowledge for improved report generation. By combining adaptive retrieval with multi-source alignment and fusion, REVTAF achieves fine-grained visual-text integration under weak image-report level supervision while effectively mitigating data imbalance issues. Comprehensive experiments demonstrate that REVTAF outperforms state-of-the-art methods, achieving an average improvement of 7.4% on the MIMIC-CXR dataset and 2.9% on the IU X-Ray dataset. Comparisons with mainstream multimodal LLMs (e.g., GPT-series models), further highlight …
Poster
Zhizhong Huang · Xiaoming Liu

[ Exhibit Hall I ]

Abstract
Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID.This paper proposes Visual In-Context Prompting (VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models (VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}.By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories.Code will be released upon publication.
Poster
Pooyan Rahmanzadehgervi · Hung Nguyen · Rosanne Liu · Long Mai · Anh Nguyen

[ Exhibit Hall I ]

Abstract
Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision.Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model.We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention.Unlike standard self-attention, TAB constrains the total attention over all patches to $\in [0, 1]$.That is, when the total attention is 0, no visual information is propagated further into the network, and the vision-language model (VLM) would default to a generic, image-independent response.To demonstrate the advantages of TAB, we train VLMs with TAB to perform image-difference captioning.Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur.TAB is the first architecture to enable users to debug by editing attention, which often produces expected outputs by VLMs.
Poster
Ruiyun Yu · Bingyang Guo · Haoyuan Li

[ Exhibit Hall I ]

Abstract
Anomaly detection plays a crucial role in the industrial sector, especially in ensuring the quality of integrated circuits (IC), which are critical for product reliability and performance. With increasing demands for higher quality standards, anomaly detection during the IC manufacturing process has become a significant research focus. However, the progress of IC anomaly detection is hampered by the scarcity of defective samples and the shortage of well-defined annotations. To address this challenge, this paper focuses on the research in the field of IC, especially on ceramic package substrates (CPS). We construct a systematic automated optical inspection (AOI) equipment, and based on this, collected large-scale CPS 2D images to build a novel anomaly detection dataset (CPS2D-AD), which offers copious samples with precise annotations, including category, mask, and bounding box. To the best of our knowledge, CPS2D-AD is the largest dataset in the field of IC. Meanwhile, we conduct an extensive benchmark of CPS2D-AD, intending to supplement existing research by providing a baseline for the detection and localization of anomalies in high-resolution data of ceramic package substrates. In addition, we have developed a novel large vision model, \textbf{S}egment \textbf{A}ny \textbf{I}ntegrated \textbf{C}ircuits (SAIC), by embedding-based distillation mechanism based on CPS2D-AD datasets. Our CPS2D-AD …
Poster
Kaining Ying · Henghui Ding · Guangquan Jie · Yu-Gang Jiang

[ Exhibit Hall I ]

Abstract
Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information as well as deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose **Omni**modal Referring **A**udio-**V**isual **S**egmentation (**OmniAVS**), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce **O**mnimodal **I**nstructed **S**egmentation **A**ssistant (**OISA**), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments on 10 datasets show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.
Poster
Dibyadip Chatterjee · Edoardo Remelli · Yale Song · Bugra Tekin · Abhay Mittal · Bharat Bhatnagar · Necati Cihan Camgoz · Shreyas Hampali · Eric Sauser · Shugao Ma · Angela Yao · Fadime Sener

[ Exhibit Hall I ]

Abstract
We introduce ProVideLLM, an end-to-end framework for real-time streaming procedural assistance. ProVideLLM integrates a multimodal cache configured to store two types of tokens -- verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by $22\times$ over existing methods in representing one hour of long-term observations while effectively encoding fine-grained representations. By interleaving these tokens in the multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and 25 FPS for streaming dialogue, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.
Poster
Yuan Bian · Min Liu · Yunqi Yi · Xueping Wang · Shuai Jiang · Yaonan Wang

[ Exhibit Hall I ]

Abstract
Person re-identification (re-id) models are vital in security surveillance systems, requiring transferable adversarial attacks to explore the vulnerabilities of them. Recently, vision-language models (VLM) based attacks have shown superior transferability by attacking generalized image and textual features of VLM, but they lack comprehensive feature disruption due to the overemphasis on discriminative semantics in integral representation. In this paper, we introduce the Attribute-aware Prompt Attack (AP-Attack), a novel method that leverages VLM's image-text alignment capability to explicitly disrupt fine-grained semantic features of pedestrian images by destroying attribute-specific textual embeddings. To obtain personalized textual descriptions for individual attributes, textual inversion networks are designed to map pedestrian images to pseudo tokens that represent semantic embeddings, trained in the contrastive learning manner with images and a predefined prompt template that explicitly describes the pedestrian attributes. Inverted benign and adversarial fine-grained textual semantics facilitate attacker in effectively conducting thorough disruptions, enhancing the transferability of adversarial examples. Extensive experiments show that AP-Attack achieves state-of-the-art transferability, significantly outperforming previous methods by 22.9% on mean Drop Rate in cross-model&dataset attack scenarios.
Poster
Mingyang Liu · Xinyang Chen · Yang Shu · Xiucheng Li · Weili Guan · Liqiang Nie

[ Exhibit Hall I ]

Abstract
Chest X-ray classification is extensively utilized within the field of medical image analysis. However, manually labeling chest X-ray images is time-consuming and costly. Domain adaptation, which is designed to transfer knowledge from related domains, could offer a promising solution. Existing methods employ feature adaptation or self-training for knowledge transfer. Nonetheless, negative transfer is observed due to the entanglement of class imbalance and distribution shift in chest X-ray classification. In this paper, wepropose Debiased Curriculum Adaptation framework to mitigate negative transfer in two aspects: (1) Curriculum Adaptation, which is designed to transfer knowledge in an easy-to-hard way, is proposed to alleviate confirmation bias in self-training. (2) Spectral Debiasing is introduced to harmonize the feature space between the source and target domains, as well as balance the feature space of positive and negative samples. Extensive experiments on 72 transfer tasks (including 6 diseases and 4 domains) demonstrate our superiority over state-of-the-art methods. In comparison to advanced methods, our approach effectively mitigates negative transfer, ensuring safe knowledge transfer.
Poster
Linwei Chen · Lin Gu · Ying Fu

[ Exhibit Hall I ]

Abstract
Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures.We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale).Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function.Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings.The code will be publicly available …
Poster
Jingyang Li · Kuangyu Ding · Kim-chuan Toh · Pan Zhou

[ Exhibit Hall I ]

Abstract
Preconditioned stochastic optimization algorithms, exemplified by Shampoo, outperform first-order optimizers by offering theoretical convergence benefits and practical gains in large-scale neural network training. However, they incur substantial memory overhead due to the storage demands of non-diagonal preconditioning matrices. To address this, we introduce 4-bit quantization for Shampoo's preconditioners. We introduce two key methods: First, we apply Cholesky decomposition followed by quantization of the Cholesky factors, reducing memory usage by leveraging their lower triangular structure while better preserving spectral properties to minimize information loss. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. Second, we incorporate error feedback in the quantization process, efficiently storing Cholesky factor and error state in the lower and upper triangular parts of the same matrix. Through extensive experiments, we demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance in large-scale deep-learning tasks. Theoretically, we also provide convergence proofs for quantized Shampoo under both smooth and non-smooth stochastic optimization settings. The source code is included in the supplementary and will be publicly released.
Poster
Zechao Hu · Zhengwei Yang · Hao Li · Zheng Wang · Yixiong Zou

[ Exhibit Hall I ]

Abstract
Sketch-based person re-identification (re-ID) enables pedestrian retrieval using sketches. While recent methods have improved modality alignment between sketches and RGB images, the challenge of subjective style variation, where sketches exhibit diverse and unpredictable appearances, remains largely unresolved.A natural solution is to train on a diverse range of pedestrian sketches, but the high cost of large-scale pedestrian sketch collection makes this impractical.In contrast, sketches of general categories (e.g., animals, objects) exhibit diverse style variations and are accessible at a low cost, making them an intuitive and scalable alternative for enhancing style generalization in sketch re-ID.To this end, we propose Adaptive Incremental Prompt-tuning (AIP), the first approach that explores cross-category subjective style generalization for sketch re-ID. Specifically, AIP incorporates a multi-stage prompt-tuning strategy that learns a broad but shareable spectrum of sketch styles from non-pedestrian data. In addition, an input-sensitive prompt generator enables the model to adapt dynamically to unseen sketch styles.Extensive experimental results demonstrate that the performance gain is not merely attributed to the inclusion of additional data but rather to the effectiveness of AIP in leveraging non-pedestrian data for subjective style generalization. Our method outperforms existing works by a significant margin, establishing new state-of-the-art results.
Poster
Tianyu Fu · Tengxuan Liu · Qinghao Han · Guohao Dai · Shengen Yan · Huazhong Yang · Xuefei Ning · Yu Wang

[ Exhibit Hall I ]

Abstract
The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens.Existing token reduction methods primarily prune tokens based on importance metrics, such as accumulative attention scores. However, even important tokens may exhibit high redundancy caused by similarity among adjacent video frames and repetitive visual elements.To address this limitation, we propose FrameFusion, a novel token reduction approach integrating similarity-based merging with importance-based pruning.We conduct a thorough study on token similarity characteristics, revealing three key insights: (1) spatially corresponding vision tokens between adjacent frames have higher cosine similarities compared to other token pairs; (2) high token similarities prominently decrease in deeper model layers; and (3) token similarity rankings are highly consistent across different layers.Guided by these observations, FrameFusion computes token similarities exclusively between corresponding vision tokens from adjacent frames, applies token merging at initial successive layers followed by pruning in deeper layers, and adopts a cascaded merging strategy to further enhance efficiency.We evaluate FrameFusion comprehensively across six diverse LVLMs, ranging from 2B to 72B parameters, using five video benchmarks encompassing video retrieval, question-answering, and spatial-temporal understanding tasks.Experiments show that FrameFusion reduces vision tokens by 70\%, achieving 1.6 – 3.6$\times$ end-to-end …
Poster
Yong Liu · Song-Li Wu · Sule Bai · Jiahao Wang · Yitong Wang · Yansong Tang

[ Exhibit Hall I ]

Abstract
Open-vocabulary segmentation aims to achieve segmentation of arbitrary categories given unlimited text inputs as guidance. To achieve this, recent works have focused on developing various technical routes to exploit the potential of large-scale pre-trained vision-language models and have made significant progress on existing benchmarks. However, we find that existing test sets are limited in measuring the models' comprehension of ``open-vocabulary" concepts, as their semantic space closely resembles the training space, even with many overlapping categories. To this end, we present a new benchmark named OpenBench that differs significantly from the training semantics. It is designed to better assess the model's ability to understand and segment a wide range of real-world concepts. When testing existing methods on OpenBench, we find that their performance diverges from the conclusions drawn on existing test sets. In addition, we propose a method named OVSNet to improve the segmentation performance for diverse and open scenarios. Through elaborate fusion of heterogeneous features and cost-free expansion of the training space, OVSNet achieves state-of-the-art results on both existing datasets and our proposed OpenBench. Corresponding analysis demonstrate the soundness and effectiveness of our proposed benchmark and method.
Poster
Zelong Sun · Dong Jing · Zhiwu Lu

[ Exhibit Hall I ]

Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and large language models (LLMs) to generate target captions based on composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning for composed queries. To enhance the reasoning reliability, we devise CIRCoT, which guides the LVLM through a step-by-step inference process using predefined subtasks. Considering that existing approaches focus solely on global-level reasoning, our CoTMR incorporates multi-scale reasoning to achieve more comprehensive inference via fine-grained predictions about the presence or absence of key elements at the object scale. Further, we design a Multi-Grained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing …
Poster
Zhuoyan Luo · Yinghao Wu · Tianheng Cheng · Yong Liu · Yicheng Xiao · Hongfa Wang · Xiao-Ping Zhang · Yujiu Yang

[ Exhibit Hall I ]

Abstract
The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios. Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification. However, these approaches tend to encode multi-granularity object information into a single representation, which makes it difficult to precisely represent comprehensive objects of different granularity. Moreover, the simple binary object-existence identification across all referent scenarios fails to specify their inherent differences, incurring ambiguity in object understanding. To tackle the above issues, we propose a **Co**unting-Aware **H**ierarchical **D**ecoding framework (CoHD) for GRES. By decoupling the intricate referring semantics into different granularity with a visual-linguistic hierarchy, and dynamic aggregating it with intra- and inter-selection, CoHD boosts multi-granularity comprehension with the reciprocal benefit of the hierarchical nature. Furthermore, we incorporate the counting ability by embodying multiple/single/non-target scenarios into count- and category-level supervision, facilitating comprehensive object perception. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks demonstrate the effectiveness and rationality of CoHD which outperforms state-of-the-art GRES methods by a remarkable margin. Code will be available.
Poster
YINGXIAN Chen · Jiahui Liu · Ruidi Fan · Yanwei Li · Chirui CHANG · Shizhen Zhao · Wilton.W.T. Fok · Xiaojuan Qi · Yik WU

[ Exhibit Hall I ]

Abstract
Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks. The code and data will be released.
Poster
Bin Yang · Yulin Zhang · Hong-Yu Zhou · Sibei Yang

[ Exhibit Hall I ]

Abstract
Detection transformers have been applied to human-object interaction (HOI) detection, enhancing the localization and recognition of human-action-object triplets in images. Despite remarkable progress, this study identifies a critical issue—"Toxic Siblings" bias—which hinders the interaction decoder's learning, as numerous similar yet distinct HOI triplets interfere with and even compete against each other both input side and output side to the interaction decoder. This bias arises from high confusion among sibling triplets/categories, where increased similarity paradoxically reduces precision, as one’s gain comes at the expense of its toxic sibling’s decline. To address this, we propose two novel debiasing learning objectives—"contrastive-then-calibration" and "merge-then-split"—targeting the input and output perspectives, respectively. The former samples sibling-like incorrect HOI triplets and reconstructs them into correct ones, guided by strong positional priors. The latter first learns shared features among sibling categories to distinguish them from other groups, then explicitly refines intra-group differentiation to preserve uniqueness. Experiments show that we significantly outperform both the baseline (+9.18\% mAP on HICO-Det) and the state-of-the-art (+3.59\% mAP) across various settings. The source code will be made public.
Poster
Yuci Liang · Xinheng Lyu · Meidan Ding · Wenting Chen · Xiaohan Xing · Jipeng Zhang · Sen Yang · Xiangjian He · Song Wu · Xiyue Wang · Linlin Shen

[ Exhibit Hall I ]

Abstract
Recent advances in computational pathology have introduced whole slide image (WSI)-level multimodal large language models (MLLMs) for automated pathological analysis. However, current WSI-level MLLMs face two critical challenges: limited explainability in their decision-making process and insufficient attention to morphological features crucial for accurate diagnosis. To address these challenges, we first introduce \textbf{WSI-Bench}, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, specifically designed to evaluate MLLMs' understanding of morphological characteristics crucial for accurate diagnosis. To the best of our knowledge, WSI-Bench presents the first benchmarking systematically evaluate morphological understanding capabilities in WSI analysis. To enhance the model explainability, we present \textbf{WSI-LLaVA}, an MLLM framework for gigapixel WSI understanding with a three-stage training strategy, which can provide detailed morphological findings to explain its final answer. For more precise model assessment in pathological contexts, we develop two specialized WSI metrics: \textbf{WSI-Precision} and \textbf{WSI-Relevance}. Extensive evaluation on WSI-Bench reveals both the capabilities and limitations of current WSI MLLMs in morphological analysis and various pathology tasks, while demonstrating WSI-LLaVA's superior performance across all capabilities.
Poster
Rongpei Hong · Jian Lang · Ting Zhong · Fan Zhou

[ Exhibit Hall I ]

Abstract
The rapid proliferation of online video-sharing platforms has accelerated the spread of malicious videos, creating an urgent need for robust detection methods. However, the performance and generalizability of existing detection approaches are severely limited by the scarcity of annotated video data, as manually curating large-scale malicious detection datasets is both labor-intensive and impractical. To address this challenge, we propose CRAVE, a novel CRoss-domAin retrieVal augmEntation framework that transfers knowledge from resource-rich image-text domain to enhance malicious video detection. Specifically, CRAVE introduces a Pseudo-Pair Retriever to identify semantically relevant image-text data for high-quality cross-domain augmentation. Additionally, a Contrastive Cross-Domain Augmenter is designed to disentangle domain-shared and -unique representations, effectively bridging the domain gaps during knowledge transfer. These shared image-text representations are then leveraged to refine video representations, yielding more discriminative features for accurate malicious content detection. Experiments on four video datasets demonstrate that CRAVE largely outperforms competitive baselines in both performance and generalization, providing an innovative and strong solution to the issue of video data-scarcity.
Poster
Zhongwei Qiu · Hanqing Chao · Tiancheng Lin · Wanxing Chang · Zijiang Yang · Wenpei Jiao · Yixuan Shen · Yunshuo Zhang · Yelin Yang · Wenbin Liu · Hui Jiang · Yun Bian · Ke Yan · Dakai Jin · Le Lu

[ Exhibit Hall I ]

Abstract
Histopathology plays a critical role in medical diagnostics, with whole slide images (WSIs) offering valuable insights that directly influence clinical decision-making. However, the large size and complexity of WSIs may pose significant challenges for deep learning models, in both computational efficiency and effective representation learning. In this work, we introduce Pixel-Mamba, a novel deep learning architecture designed to efficiently handle gigapixel WSIs. Pixel-Mamba leverages the Mamba module, a state-space model (SSM) with linear memory complexity, and incorporates local inductive biases through progressively expanding tokens, akin to convolutional neural networks. This enables Pixel-Mamba to hierarchically combine both local and global information while efficiently addressing computational challenges. Remarkably, Pixel-Mamba achieves or even surpasses the quantitative performance of state-of-the-art (SOTA) foundation models that were pretrained on millions of WSIs or WSI-text pairs, in a range of tumor staging and survival analysis tasks, even without requiring any pathology-specific pretraining. Extensive experiments demonstrate the efficacy of Pixel-Mamba as a powerful and efficient framework for end-to-end WSI analysis.
Poster
Maximilian Augustin · Yannic Neuhaus · Matthias Hein

[ Exhibit Hall I ]

Abstract
Vision-language models (VLMs) are prone to object hal-lucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quan-tify hallucinations using relatively small, labeled datasets.However, this approach is i) insufficient to assess halluci-nations that arise in open-world settings, where VLMs arewidely used, and ii) inadequate for detecting systematic er-rors in VLMs. We propose DASH (Detection and Assess-ment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinationsof VLMs on real-world images in an open-world setting.A key component is DASH-OPT for image-based retrieval,where we optimize over the “natural image manifold” togenerate images that mislead the VLM. The output of DASHconsists of clusters of real and semantically similar imagesfor which the VLM hallucinates an object. We apply DASHto PaliGemma and two LLaVA-NeXT models across 380 ob-ject classes and, in total, find more than 15k clusters with650kimages. We study the transfer of the identified system-atic hallucinations to other VLMs and show that fine-tuningPaliGemma with the model-specific images obtained withDASH mitigates object hallucinations.
Poster
Jiajin Tang · Zhengxuan Wei · Yuchen Zhu · Cheng Shi · Guanbin Li · Liang Lin · Sibei Yang

[ Exhibit Hall I ]

Abstract
Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research. Code will be made publicly available.
Poster
Onkar Susladkar · Gayatri Deshmukh · Yalcin Tur · Gorkem Durak · Ulas Bagci

[ Exhibit Hall I ]

Abstract
We introduce ViCTr (Vital Consistency Transfer), a framework for advancing medical image synthesis through a principled integration with Rectified Flow trajectories. Unlike traditional approaches, we modify the Tweedie formulation to accommodate linear trajectories within the Rectified Flow framework, enabling more accurate initial state approximation and consistent trajectory paths. ViCTr’s design allows for precise control over anatomical accuracy and pathological attributes across CT and MRI modalities via a two-stage architecture. In Stage 1, it performs anatomical learning on the ATLAS-8k dataset using Elastic Weight Consolidation (EWC) to selectively train model weights tailored for medical data. In Stage 2, an adversarial fine-tuning strategy is applied: the base model from Stage 1 remains frozen while a LoRA adapter is exclusively applied to the weights tuned in Stage 1, allowing targeted adaptation for downstream tasks while preserving the core medical data properties learned during pretraining. ViCTr achieves notable improvements by utilizing segmentation maps and textual prompts to enable refined control over CT and MRI synthesis. Extensive experiments on benchmark datasets, including BTCV, AMOS, and CirrMRI600+, demonstrate ViCTr’s superiority, showing significant enhancements in quantitative metrics and clinical detail, such as liver surface nodularity in cirrhosis synthesis. These results establish ViCTr as a major advancement in …
Poster
Rohan Sharma · Changyou Chen · Feng-Ju Chang · Seongjun Yun · Xiaohu Xie · Rui Meng · Dehong Xu · Alejandro Mottini · qingjun cui

[ Exhibit Hall I ]

Abstract
We present Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM), a framework that advances vision-language matching and retrieval by leveraging a large language model (LLM) backbone. While concurrent LLM-based approaches like VLM2VEC, MM-Embed, NV-Embed, and MM-GEM have demonstrated impressive capabilities in multi-modal and multi-task scenarios, our work introduces novel mechanisms for task-adaptive learning and embedding extraction that further enhance the potential of LLM-based retrieval systems. Our key technical contribution lies in the development of a task-aware contrastive learning framework with an automated Bayesian weighing mechanism. This approach provides a principled way to balance multiple tasks during training, departing from conventional contrastive learning strategies. We further enhance the framework through a multiple-token summarization strategy and an auxiliary language modeling objective, which together significantly improve retrieval performance.Comprehensive experiments on M-BEIR and ICinW benchmarks demonstrate M3T-UEM's effectiveness, showing competitive or superior performance compared to both traditional encoder-based methods and recent LLM-based approaches. Furthermore, we demonstrate particular strengths in handling compositional conceptual changes and multilingual scenarios owing to the incorporation of an LLM backbone where the method drastically outperforms CLIP in zero-shot settings, often by orders of magnitude.
Poster
Songsong Duan · Xi Yang · Nannan Wang

[ Exhibit Hall I ]

Abstract
Recent Training-Free Open-Vocabulary Semantic Segmentation (TF-OVSS) leverages a pre-training vision-language model to segment images from open-set visual concepts without training and fine-tuning. The key of TF-OVSS is to improve the local spatial representation of CLIP by leveraging self-correlation maps, thus preserving its zero-sample capability and achieving open understanding. However, most TF-OVSS methods utilize the Multi-Head Self-Attention (MHSA) mechanism to generate self-correlation maps, neglecting the diversity among multiple heads. In this paper, we explore the diversity of MHSA, revealing that the contributions of single-head attention to the final results are varied and redundant. To address this issue, we introduce DIH-CLIP, a training-free CLIP model for open-vocabulary semantic segmentation. Specifically, we propose a Selective Head Attention (SHA) to replace the traditional MHSA in CLIP, which contains two key designs: (1) evaluating the diversity of multi-head attention via calculating information entropy scores of per head attention map and removing the redundant attention head with threshold; (2) transferring the local representation of single-head attention to the global CLIP feature to enhance the local spatial representation capability of CLIP. Furthermore, we embed SHA into the middle layers of CLIP to extract the plentiful details. Experiments on six benchmark datasets demonstrate the effectiveness of DIH-CLIP.
Poster
Ryan Wong · Necati Cihan Camgoz · Richard Bowden

[ Exhibit Hall I ]

Abstract
Sign language representation learning presents unique challenges due to the complex spatio-temporal nature of signs and the scarcity of labeled datasets. Existing methods often rely either on models pre-trained on general visual tasks, that lack sign-specific features, or use complex multimodal and multi-branch architectures. To bridge this gap, we introduce a scalable, self-supervised framework for sign representation learning. We leverage important inductive (sign) priors during the training of our RGB model. To do this, we leverage simple but important cues based on skeletons while pretraining a masked autoencoder. These sign specific priors alongside feature regularization and an adversarial style agnostic loss provide a powerful backbone. Notably, our model does not require skeletal keypoints during inference, avoiding the limitations of keypoint-based models during downstream tasks. When finetuned, we achieve state-of-the-art performance for sign recognition on the WLASL, ASL-Citizen and NMFs-CSL datasets, using a simpler architecture and with only a single-modality. Beyond recognition, our frozen model excels in sign dictionary retrieval and sign translation, surpassing standard MAE pretraining and skeletal-based representations in retrieval. It also reduces computational costs for training existing sign translation models while maintaining strong performance on Phoenix2014T, CSL-Daily and How2Sign.
Poster
Zhixiang Chi · Yanan Wu · Li Gu · Huan Liu · Ziqiang Wang · Yang Zhang · Yang Wang · Konstantinos Plataniotis

[ Exhibit Hall I ]

Abstract
CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP.In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.
Poster
Mark Endo · Xiaohan Wang · Serena Yeung-Levy

[ Exhibit Hall I ]

Abstract
Recent works on accelerating Vision-Language Models achieve strong performance across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model. Surprisingly, we find that while strong performance is maintained across many tasks, it exhibits drastically different behavior for a subset of vision-centric tasks such as localization. Upon further investigation, we uncover a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, on many benchmarks aiming to evaluate vision-centric capabilities, strong performance persists with the flawed pruning strategy, highlighting these benchmarks' limited ability to assess fine-grained visual capabilities. Based on these findings, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that resolves the discovered early-layer pruning issue and further enhances the preservation of relevant tokens via multistage pruning with early uniform sampling to ensure broad image coverage. With comparable computational savings, we find that FEATHER achieves more than 5x performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.
Poster
Yicong Li · Yiyang Chen · Zhenyuan Ma · Junbin Xiao · Xiang Wang · Angela Yao

[ Exhibit Hall I ]

Abstract
Language-guided Affordance Segmentation (LASO) aims to identify actionable object regions based on text instructions. At the core of its practicality is learning generalizable affordance knowledge that captures functional regions across diverse objects. However, current LASO solutions struggle to extend learned affordances to object categories that are not encountered during training. Scrutinizing these designs, we identify limited generalizability on unseen categories, stemming from (1) underutilized generalizable patterns in the intermediate layers of both 3D and text backbones, which impedes the formation of robust affordance knowledge, and (2) the inability to handle substantial variability in affordance regions across object categories due to a lack of structural knowledge of the target region.Towards this, we introduce a \textbf{G}enera\textbf{L}ized fr\textbf{A}mework on u\textbf{N}seen \textbf{C}ategori\textbf{E}s (GLANCE), incorporating two key components: a cross-modal connector that links intermediate stages of the text and 3D backbones to enrich pointwise embeddings with affordance concepts, and a VLM-guided query generator that provides affordance priors by extracting a few 3D key points based on the intra-view reliability and cross-view consistency of their multi-view segmentation masks. Extensive experiments on two benchmark datasets demonstrate that GLANCE outperforms state-of-the-art methods (SoTAs), with notable improvements in generalization to unseen categories. Our code is available at \url{https://anonymous.4open.science/r/GLANCE}.
Poster
Jiayuan Chen · Thai-Hoang Pham · Yuanlong Wang · Ping Zhang

[ Exhibit Hall I ]

Abstract
High-throughput screening techniques, such as microscopy imaging of cellular responses to genetic and chemical perturbations, play a crucial role in drug discovery and biomedical research. However, robust perturbation screening for \textit{de novo} cell lines remains challenging due to the significant morphological and biological heterogeneity across cell lines. To address this, we propose a novel framework that integrates external biological knowledge into existing pretraining strategies to enhance microscopy image profiling models. Our approach explicitly disentangles perturbation-specific and cell line-specific representations using external biological information. Specifically, we construct a knowledge graph leveraging protein interaction data from STRING and Hetionet databases to guide models toward perturbation-specific features during pretraining. Additionally, we incorporate transcriptomic features from single-cell foundation models to capture cell line-specific representations. By learning these disentangled features, our method improves the generalization of imaging models to \textit{de novo} cell lines. We evaluate our framework on the RxRx database through one-shot fine-tuning on an RxRx1 cell line and few-shot fine-tuning on cell lines from the RxRx19a dataset. Experimental results demonstrate that our method enhances microscopy image profiling for \textit{de novo} cell lines, highlighting its effectiveness in real-world phenotype-based drug discovery applications.
Poster
Yuzhang Shang · Mu Cai · Bingxin Xu · Yong Jae Lee · Yan Yan

[ Exhibit Hall I ]

Abstract
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. However, due to the inherent design of the Transformer architecture, the computational costs of these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism that identifies significant spatial redundancy among visual tokens. In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs. Specifically, to metric the importance of each token, we exploit the sparsity observed in the visual encoder, characterized by the sparse distribution of attention scores between the class token and visual tokens. This sparsity enables us to dynamically select the most crucial visual tokens to retain. Subsequently, we cluster the selected (unpruned) tokens based on their key similarity and merge them …
Poster
Changhao Li · Xinrui Chen · Ji Wang · Kang Zhao · Jianfei Chen

[ Exhibit Hall I ]

Abstract
Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data.Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks.Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method.
Poster
Shicai Wei · Chunbo Luo · Yang Luo

[ Exhibit Hall I ]

Abstract
Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types …
Poster
Junho Kim · Hyungjin Chung · Byung-Hoon Kim

[ Exhibit Hall I ]

Abstract
Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have explored the use of text queries, leveraging their enhanced stability and generalization capabilities. However, existing approaches often remain constrained by their reliance on support queries, their failure to fully utilize the rich priors embedded in pre-trained large language models, and the limitations imposed by their parametric distribution assumptions. To address these challenges, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. Our method encompasses effective training strategies and carefully designed instructions for applying the MLLM to CAPE. Moreover, we propose an inference mechanism that further enhances the reasoning process for unseen keypoints. while flexibly modeling their underlying spatial distribution and uncertainty, allowing for adaptive refinement based on contextual cues. We conducted extensive experiments to apply the MLLM to CAPE effectively, focusing not only on the model architecture and prompt design but also on ensuring robustness across input variations. Our approach sets a new state-of-the-art …
Poster
Sanjoy Chowdhury · Hanan Gani · Nishit Anand · Sayan Nag · Ruohan Gao · Mohamed Elhoseiny · Salman Khan · Dinesh Manocha

[ Exhibit Hall I ]

Abstract
Recent advancements in reasoning optimization havegreatly enhanced the performance of large language models(LLMs). However, existing work fails to address the com-plexities of audio-visual scenarios, underscoring the needfor further research. In this paper, we introduce AURE-LIA, a novel actor-critic based audio-visual (AV) reasoningframework that distills structured, step-by-step reasoninginto AVLLMs at test time, improving their ability to processcomplex multi-modal inputs without additional training orfine-tuning. To further advance AVLLM reasoning skills, wepresent AVReasonBench, a challenging benchmark compris-ing 4500 audio-visual questions, each paired with detailedstep-by-step reasoning. Our benchmark spans six distincttasks, including AV-GeoIQ, which evaluates AV reasoningcombined with geographical and cultural knowledge. Evalu-ating 18 AVLLMs on AVReasonBench reveals significant lim-itations in their multi-modal reasoning capabilities. UsingAURELIA, we achieve up to a 100% relative improvement,demonstrating its effectiveness. This performance gain high-lights the potential of reasoning-enhanced data generationfor advancing AVLLMs in real-world applications. Our codeand data will be publicly released.
Poster
Gueter Josmy Faure · Jia-Fong Yeh · Min-Hung Chen · Hung-Ting Su · Shang-Hong Lai · Winston Hsu

[ Exhibit Hall I ]

Abstract
Long-form video understanding presents unique challenges that extend beyond traditional short-video analysis approaches, particularly in capturing long-range dependencies, processing redundant information efficiently, and extracting high-level semantic concepts. To address these challenges, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, featuring two versatile modules that can enhance existing video-language models or operate as a standalone system. Our Episodic COmpressor (ECO) efficiently aggregates representations from micro to semi-macro levels, reducing computational overhead while preserving temporal dependencies. Our Semantics reTRiever(SeTR) enriches these representations with semantic information by focusing on broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. We demonstrate that these modules can be seamlessly integrated into existing SOTA models, consistently improving their performance while reducing inference latency by up to 43\%. As a standalone system, HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings. Our code will be made public.
Poster
Yusen Zhang · Wenliang Zheng · Aashrith Madasu · Peng Shi · Ryo Kamoi · Hao Zhou · Zhuoyang Zou · Shu Zhao · Sarkar Snigdha Sarathi Das · Vipul Gupta · Xiaoxin Lu · Nan Zhang · Ranran Zhang · Avitej Iyer · Renze Lou · Wenpeng Yin · Rui Zhang

[ Exhibit Hall I ]

Abstract
High-resolution image (HRI) understanding aims to process images with a large number of pixels such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) typically handle higher-resolution images through dynamic patching. However, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding, leaving this domain underexplored. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 $\times$ 1,024 to 35,503 $\times$ 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic and radiology images to street views, long-range pictures, and telescope images. It includes high-resolution images of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and similar distracting images in different orders. These datasets assess how well models utilize HRI by comparing performance across different image regions. We conduct extensive experiments involving 27 VLMs, including Gemini 2.0 Pro and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of …
Poster
Hyojin Bahng · Caroline Chan · Fredo Durand · Phillip Isola

[ Exhibit Hall I ]

Abstract
Current metrics for image-text alignment rely on human preferences or task-oriented VQA datasets for supervision. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image, we generate diverse captions using image-to-text models, then map these captions back to image space with a text-to-image model. We compute a cycle consistency score by measuring perceptual similarity between the original and reconstructed image. The score is used to determine preferences over captions, i.e., more descriptive and accurate captions yield faithful reconstructions and are thus preferred over lower quality captions. Analogously, we can measure cycle consistency in the text-to-image-to-text direction by measuring textual similarity between an input caption and its reconstruction through the cycle. We explore both mapping directions, resulting in 398K image-to-text pairs and 468K text-to-image comparison pairs. Our reward model, trained on this dataset, outperforms state-of-the-art methods on detailed captioning tasks, with superior inference-time scalability when used as a verifier for Best-of-N evaluation. We will release our dataset, model, and code upon acceptance.
Poster
Yufei Zhan · Shurong Zheng · Yousong Zhu · Hongyin Zhao · Fan Yang · Ming Tang · Jinqiao Wang

[ Exhibit Hall I ]

Abstract
Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, \textit{etc}. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scale up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details and significantly improves multimodal perception ability, especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can localize objects of interest with visual and textual referring, achieve state-of-the-art performance on REC and phrase grounding, and outperform expert models in object detection, object counting, and REG. Data, codes, and models will be released.
Poster
Weihan Wang · zehai he · Wenyi Hong · Yean Cheng · Xiaohan Zhang · Ji Qi · Ming Ding · Xiaotao Gu · Shiyu Huang · Bin Xu · Yuxiao Dong · Jie Tang

[ Exhibit Hall I ]

Abstract
Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension.
Poster
Yongxin Zhu · Bocheng Li · Yifei Xin · Zhihua Xia · Linli Xu

[ Exhibit Hall I ]

Abstract
Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically design complex optimization strategies or reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we analyze the representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose \textbf{Sim}ple\textbf{VQ}, a novel method that reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the \textit{entire linear space} spanned by the codebook, rather than merely updating \textit{single code vectors} selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works …
Poster
Ylli Sadikaj · Hongkuan Zhou · Lavdim Halilaj · Stefan Schmid · Steffen Staab · Claudia Plant

[ Exhibit Hall I ]

Abstract
Precise optical inspection in industrial applications is crucial for minimizing scrap rates and reducing the associated costs. Besides merely detecting if a product is anomalous or not, it is crucial to know the distinct type of defect, such as a bent, cut, or scratch. The ability to recognize the ``exact" defect type enables automated treatments of the anomalies in modern production lines. Current methods are limited to solely detecting whether a product is defective or not without providing any insights on the defect type, nevertheless detecting and identifying multiple defects. We propose MultiADS, a zero-shot learning approach, able to perform Multi-Anomaly Detection and Segmentation. The architecture of MultiADS comprises CLIP and extra linear layers to align the visual- and textual representation in a joint feature space. To the best of our knowledge, our proposal, is the first approach to perform a multi-type anomaly segmentation task in zero-shot learning. Contrary to the other baselines, our approach i) generates specific anomaly masks for each distinct defect type, ii) learns to distinguish defect types, and iii) simultaneously identifies multiple defect types present in an anomalous product. Additionally, our approach outperforms zero/few-shot learning SoTA methods on image-level and pixel-level anomaly detection and segmentation tasks …
Poster
Nandish Chattopadhyay · Amira Guesmi · Muhammad Abdullah Hanif · Bassem ouni · Muhammad Shafique

[ Exhibit Hall I ]

Abstract
Adversarial attacks present a significant challenge to the dependable deployment of machine learning models, with patch-based attacks being particularly potent. These attacks introduce adversarial perturbations in localized regions of an image, deceiving even well-trained models. In this paper, we propose Outlier Detection and Dimension Reduction (ODDR), a comprehensive defense strategy engineered to counteract patch-based adversarial attacks through advanced statistical methodologies.Our approach is based on the observation that input features corresponding to adversarial patches—whether naturalistic or synthetic—deviate from the intrinsic distribution of the remaining image data and can thus be identified as outliers. ODDR operates through a robust three-stage pipeline: Fragmentation, Segregation, and Neutralization. This model-agnostic framework is versatile, offering protection across various tasks, including image classification, object detection, and depth estimation, and is proved effective in both CNN-based and Transformer-based architectures.In the Fragmentation stage, image samples are divided into smaller segments, preparing them for the Segregation stage, where advanced outlier detection techniques isolate anomalous features linked to adversarial perturbations. The Neutralization stage then applies dimension reduction techniques to these outliers, effectively neutralizing the adversarial impact while preserving critical information for the machine learning task.Extensive evaluation on benchmark datasets against state-of-the-art adversarial patches underscores the efficacy of ODDR. For example, our …
Poster
Hao Tang · Zhiqing Guo · Liejun Wang · Chao Liu

[ Exhibit Hall I ]

Abstract
In recent years, it has been found that “grandmother cells” in the primary visual cortex (V1) of macaques can directly recognize visual input with complex shapes. This inspires us to examine the value of these cells in promoting the research of medical image segmentation. In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically, we propose a Dynamic Memory Weights–Loss Attention (DMW-LA), which matches and remembers the category features of specific lesions or organs in medical images through the similarity memory prior in the prototype memory bank, thus helping the network to learn subtle texture changes between categories. DMW-LA also dynamically updates the similarity memory prior in reverse through Weight-Loss Dynamic (W-LD) update strategy, effectively assisting the network directly extract category features. In addition, we propose the Double-Similarity Global Internal Enhancement Module (DS-GIM) to deeply explore the internal differences in the feature distribution of input data through cosine similarity and euclidean distance. Extensive experiments on four public datasets show that Sim-MPNet has better segmentation performance than other state-of-the-art methods. Our code is available on https://anonymous.4open.science/r/Sim-MPNet.
Poster
Han Ji · Yuqi Feng · Jiahao Fan · Yanan Sun

[ Exhibit Hall I ]

Abstract
Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS). These predictors estimate the performance of unseen architectures by learning from the correlation between a small set of trained architectures and their performance. However, most existing predictors ignore the inherent distribution shift between limited training samples and diverse test samples. Hence, they tend to learn spurious correlations as shortcuts to predictions, leading to poor generalization. To address this, we propose a Causality-guided Architecture Representation Learning (CARL) method aiming to separate critical (causal) and redundant (non-causal) features of architectures for generalizable architecture performance prediction. Specifically, we employ a substructure extractor to split the input architecture into critical and redundant substructures in the latent space. Then, we generate multiple interventional samples by pairing critical representations with diverse redundant representations to prioritize critical features. Extensive experiments on five NAS search spaces demonstrate the state-of-the-art accuracy and superior interpretability of CARL. For instance, CARL achieves 97.67\% top-1 accuracy on CIFAR-10 using DARTS.
Poster
Peng Du · Hui Li · Han Xu · Paul Jeon · Dongwook Lee · Daehyun Ji · Ran Yang · Feng Zhu

[ Exhibit Hall I ]

Abstract
Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image super-resolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multi-scale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multi-scale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform (MDWT) to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency (LF) and high-frequency (HF) sub-bands, without omiting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.
Poster
Seungju Yoo · Hyuk Kwon · Joong-Won Hwang · Kibok Lee

[ Exhibit Hall I ]

Abstract
Object detection is a fundamental task in computer vision that has received significant attention in recent years. Despite advances in training object detection models, evaluating their performance in real-world applications remains challenging due to the substantial costs associated with image annotation. To address this issue, we propose Prediction Consistency and Reliability (PCR) as an automated model evaluation (AutoEval) method for object detection. Our method is motivated by the observation that most existing object detection models generate many candidate predictions, which are subsequently filtered through non-maximum suppression (NMS). Specifically, we analyze 1) the consistency between the final and redundant predictions and 2) the reliability of these predictions determined by their confidence scores, and propose PCR by examining their relationships with object detection performance. Furthermore, to facilitate a more realistic assessment of AutoEval methods for object detection, we construct meta-datasets incorporating various corruptions. Experimental results demonstrate the superior performance of PCR compared to the existing AutoEval methods.
Poster
Yuxuan Luo · Jiaqi Tang · Chenyi Huang · Feiyang Hao · Zhouhui Lian

[ Exhibit Hall I ]

Abstract
Chinese calligraphy, a UNESCO Heritage, remains computationally challenging due to visual ambiguity and cultural complexity. Existing AI systems fail to contextualize their intricate scripts, because of limited annotated data and poor visual-semantic alignment. We propose CalliReader, a vision-language model (VLM) that solves the Chinese Calligraphy Contextualization (CC$^2$) problem through three innovations: (1) character-wise slicing for precise character extraction and sorting, (2) CalliAlign for visual-text token compression and alignment, (3) embedding instruction tuning (e-IT) for improving alignment and addressing data scarcity. We also build CalliBench, the first benchmark for full-page calligraphic contextualization, addressing three critical issues in previous OCR and VQA approaches: fragmented context, shallow reasoning, and hallucination. Extensive experiments including user studies have been conducted to verify our CalliReader's \textbf{superiority to other state-of-the-art methods and even human professionals in page-level calligraphy recognition and interpretation}, achieving higher accuracy while reducing hallucination. Comparisons with reasoning models highlight the importance of accurate recognition as a prerequisite for reliable comprehension. Quantitative analyses validate CalliReader's efficiency; evaluations on document and real-world benchmarks confirm its robust generalization ability.
Poster
Weiwei Cao · Jianpeng Zhang · Zhongyi Shui · Sinuo Wang · Zeli Chen · Xi Li · Le Lu · Xianghua Ye · Qi Zhang · Tingbo Liang · Ling Zhang

[ Exhibit Hall I ]

Abstract
Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On the one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model's ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model's perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9\% …
Poster
Yihang Liu · Ying Wen · Longzhen Yang · Lianghua He · Heng Tao Shen

[ Exhibit Hall I ]

Abstract
Medical foundation models, pre-trained on diverse data sources, have shown significant potential for multi-domain medical imaging tasks.However, the domain shifts across different anatomical types significantly hinder their performance compared to domain-specific models.To address this challenge, we propose CoSMIC, a Continual Self-supervised learning framework for Multi-domain medIcal image analysis, with the core idea of Conditional mutual information maximization. Specifically, CoSMIC (i) acquires domain-specific knowledge sequentially, bypassing domain shifts caused by joint pre-training; (ii) enhances generalized representations by proposing a novel conditional contrastive loss to prevent catastrophic forgetting. This loss hierarchically aligns multi-view features within the current domain, maximizing their mutual information conditioned on domain-invariant representations extracted from prior domains through Anatomy-Guided Calibration. We pre-train CoSMIC across four medical domains and evaluate it on fifteen downstream datasets from five domains: Retinoscopy, Radiography, Ophthalmoscopy, Dermoscopy, and Histopathology (unseen). Experimental results show that CoSMIC (i) achieves robust feature extraction ability comparable to domain-specific models, (ii) exhibits exceptional generalization capability, significantly surpassing SOTA medical foundation models, and (iii) demonstrates superior transferability to new domains, overcoming current continual pre-training methods.
Poster
Zhaoyang Li · Yuan Wang · Guoxin Xiong · Wangkai Li · Yuwen Pan · Tianzhu Zhang

[ Exhibit Hall I ]

Abstract
Generalized few-shot point cloud segmentation (GFS-3DSeg) aims to segment objects of both base and novel classes using abundant base class samples and limited novel class samples. Existing GFS-3DSeg methods encounter bottlenecks due to the scarcity of novel class data and inter-class confusion. In this paper, we propose the LLM-Assisted Hyper-Relation Matching (LARM) framework, which leverages the wealth of prior knowledge in LLM to enrich novel category prototypes and introduces a hyper-relation matching strategy to mitigate false matches between point features and category prototypes caused by inter-class confusion. The proposed LARM enjoys several merits. First, the vast knowledge embedded in LLM can be an effective complement to vanilla category prototypes, enabling them to exhibit greater robustness. Second, the hyper-relation matching strategy harnesses the structure information implicit in the inter-class relationships, making it more robust than comparing individually.Extensive experiments on two benchmarks demonstrate that LARM outperforms previous state-of-the-art methods by large margins. The code will be open-sourced for further research.
Poster
Jun Li · Jinpeng Wang · Chaolei Tan · Niu Lian · Long Chen · Yaowei Wang · Min zhang · Shu-Tao Xia · Bin Chen

[ Exhibit Hall I ]

Abstract
Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce ``$\text{text} \prec \text{video}$'' hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods.Code will be released at https://anonymous.4open.science/r/HLFormer-F8E6.
Poster
Lizhen Xu · Xiuxiu Bai · Xiaojun Jia · Jianwu Fang · Shanmin Pang

[ Exhibit Hall I ]

Abstract
Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks.However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores.Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code will be made publicly available on GitHub.
Poster
Yangfu Li · Hongjian Zhan · Qi Liu · Li Sun · Yu-Jie Xiong · Yue Lu

[ Exhibit Hall I ]

Abstract
Most existing methods regard open-set Chinese text recognition (CTR) as a single-task problem, primarily focusing on prototype learning of linguistic components or glyphs to identify unseen characters. In contrast, humans identify characters by integrating multiple perspectives, including linguistic and visual cues. Inspired by this, we propose a multi-task framework termed MSA$^2$, which considers multi-view character representations for open-set CTR. Within MSA$^2$, we introduce two novel strategies for character representation: structure-aware component encoding (SACE) and style-adaptive glyph embedding (SAGE). SACE utilizes a binary tree with dynamic representation space to emphasize the primary linguistic components, thereby generating structure-aware and discriminative linguistic representations for each character. Meanwhile, SAGE employs a glyph-centric contrastive learning to aggregate features from diverse forms, yielding robust glyph representations for the CTR model to adapt to the style variations among various fonts. Extensive experiments demonstrate that our proposed MSA$^2$ outperforms state-of-the-art CTR methods, achieving an average improvement of 1.3% and 6.0% in accuracy under closed-set and open-set settings on the BCTR dataset, respectively. The code will be available soon.
Poster
Rui Hu · Yuxuan Zhang · Lianghui Zhu · Tianheng Cheng · Lei Liu · Heng Liu · Longjin Ran · Xiaoxin Chen · Wenyu Liu · Xinggang Wang

[ Exhibit Hall I ]

Abstract
Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities.However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce **GroundingSuite**, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than the GLaMM.
Poster
Yiping Ji · Hemanth Saratchandran · Peyman Moghadam · Simon Lucey

[ Exhibit Hall I ]

Abstract
We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (eg. CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that self-attention mechanism is fundamentally ill-conditioned and is therefore uniquely dependent on skip connections for regularization. Additionally, we propose $T$oken $G$raying($TG$) -- a simple yet effective complement (to skip connections) that further improves the conditioning of input tokens. We validate our approach in both supervised and self-supervised training methods.
Poster
Qi Chen · Lingxiao Yang · Yun Chen · Nailong Zhao · Jianhuang Lai · Jie Shao · Xiaohua Xie

[ Exhibit Hall I ]

Abstract
Fine-tuning pre-trained vision-language models has proven effective in enhancing open-vocabulary semantic segmentation (OVSS). However, given the significant resource consumption required for training on large datasets, there is growing interest in exploring training-free methods for OVSS. Current training-free methods primarily focus on modifying model architectures and generating prototypes to improve segmentation performance, often overlooking issues of category redundancy and ambiguity. In this paper, we identify two key phenomena in OVSS: class redundancy and vision-language ambiguity in class activation maps and the affinity-refined activation maps. Inspired by our observations, we propose a training-free class purification framework -- FreeCP to purify semantic categories and address errors caused by these two issues. Specifically, we first generate class activation maps along with their refined activation maps using CLIP. These activations and their refined counterparts, are then organized by their associated categories to adaptively construct category relations, i.e., per category relations, and cross-category relations. We then effectively perform redundancy purification to eliminate classes, which are not present in the current image. Furthermore, we propose ambiguity purification to distinguish the correct class from their semantic similarity ones. The purified classes are subsequently used to produce the final segmentation prediction. Extensive experiments across eight benchmarks demonstrate that FreeCP, …
Poster
Zhewei Dai · Shilei Zeng · Haotian Liu · Xurui Li · Feng Xue · Yu Zhou

[ Exhibit Hall I ]

Abstract
We introduce SeaS, a unified industrial generative model for automatically creating diverse anomalies, authentic normal products, and precise anomaly masks. While extensive research exists, most efforts either focus on specific tasks, i.e., anomalies or normal products only, or require separate models for each anomaly type. Consequently, prior methods either offer limited generative capability or depend on a vast array of anomaly-specific models. We demonstrate that U-Net's differentiated learning ability captures the distinct visual traits of slightly-varied normal products and diverse anomalies, enabling us to construct a unified model for all tasks. Specifically, we first introduce an Unbalanced Abnormal (UA) Text Prompt, comprising one normal token and multiple anomaly tokens. More importantly, our Decoupled Anomaly Alignment (DA) loss decouples anomaly attributes and binds them to distinct anomaly tokens of UA, enabling SeaS to create unseen anomalies by recombining these attributes. Furthermore, our Normal-image Alignment (NA) loss aligns the normal token to normal patterns, making generated normal products globally consistent and locally varied. Finally, SeaS produces accurate anomaly masks by fusing discriminative U-Net features with high-resolution VAE features. SeaS sets a new benchmark for industrial generation, significantly enhancing downstream applications, with average improvements of +8.66% pixel-level AP for synthesis-based AD approaches, +1.10% …
Poster
Lin Zhang · Xianfang Zeng · Kangcong Li · Gang YU · Tao Chen

[ Exhibit Hall I ]

Abstract
We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. pecifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of original and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate recall bonuses for accurate corrections and hallucination punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.
Poster
Chenhao Zheng · Jieyu Zhang · Mohammadreza Salehi · Ziqi Gao · Vishnu Iyengar · Norimasa Kobori · Quan Kong · Ranjay Krishna

[ Exhibit Hall I ]

Abstract
Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 20x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust …
Poster
Handong Li · Yiyuan Zhang · Longteng Guo · Xiangyu Yue · Jing Liu

[ Exhibit Hall I ]

Abstract
Most current video-language models rely on an encoder-decoder architecture, where a vision encoder extracts visual features from video and passes them to a language model. However, this approach suffers from inefficiencies, resolution biases, and challenges in capturing fine-grained multimodal correlations, particularly when dealing with long-duration videos. To address these limitations, we propose NOVA, an encoder-free video-language model that directly integrates raw video input into a language model, eliminating the need for a separate vision encoder. NOVA leverages input-adaptive video tokenization, efficient distillation from a video-pretrained teacher, multimodal alignment using synthetic video recaption data, and hybrid-resolution inference to overcome the limitations of traditional models. Our experiments demonstrate that NOVA, with only about 10M publicly available training data, achieves competitive performance as strong encoder-based models across various benchmarks, and offers clear advantages in efficiency and scalability. This work provides a promising solution for real-time, large-scale video applications and paves the way for more flexible and resource-efficient video-language models.
Poster
Jiahui Wang · Zuyan Liu · Yongming Rao · Jiwen Lu

[ Exhibit Hall I ]

Abstract
Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5\%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38× real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test.
Poster
Jessica Bader · Leander Girrbach · Stephan Alaniz · Zeynep Akata

[ Exhibit Hall I ]

Abstract
Concept Bottleneck Models (CBMs) and other interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concept values under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB --- a fine-grained image and concept dataset containing 38,400 synthetic images based on the CUB bird dataset. To create SUB, we select a subset of 33 bird classes and 32 concepts from CUB to generate counterfactual bird images where a specific concept, such as wing color or belly pattern, is substituted.To achieve precise control for generated images, we introduce a novel Tied Diffusion Guidance (TDG) method, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct bird concept are generated. This novel dataset enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods.Furthermore, we show that the common practice of training CBMs using class-level concept annotations does not lead to generalized recognition of the concepts. Our code and data will be released upon acceptance.
Poster
Lin Sun · Jiale Cao · Jin Xie · Xiaoheng Jiang · Yanwei Pang

[ Exhibit Hall I ]

Abstract
Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate the local details using the self-attention maps of diffusion model. We conduct the experiments on eight segmentation datasets. Our CLIPer achieves the state-of-the-art performance on these datasets. With ViT-L and sliding-window inference, CLIPer has the mIoU of 72.2% and 44.7% on VOC and Object, outperforming ProxyCLIP by 11.6% and 5.5%. We will release the source code and models.
Poster
Tongkun Guan · Zining Wang · Pei Fu · Zhentao Guo · Wei Shen · Kai zhou · Tiezhu Yue · Chen Duan · Hao Sun · Qianyi Jiang · Junfeng Luo · Xiaokang Yang

[ Exhibit Hall I ]

Abstract
In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenFD, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenFD to construct a token-level visual-language MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenFD and TokenVL. Code, demo, datasets, and weights will be available soon.
Poster
Anurag Bagchi · Zhipeng Bao · Yu-Xiong Wang · Pavel Tokmakov · Martial Hebert

[ Exhibit Hall I ]

Abstract
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method unlocks the universal visual-language mapping learned by video diffusion models on Internet-scale data by fine-tuning them on small-scale Referring Object Segmentation datasets. Our key insight is preserving the entirety of the generative model's architecture by shifting its objective from predicting noise to predicting mask latents. The resulting model can accurately segment and track rare and unseen objects, despite only being trained on object masks from a limited set of categories. Additionally, it can effortlessly generalize to non-object dynamic concepts, such as smoke or raindrops, as demonstrated in our newly introduced benchmark for Referring Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to 11 points in terms of region similarity out-of-domain, leveraging the power of Internet-scale pre-training.
Poster
Byung Hyun Lee · Wongi Jeong · Woojae Han · KYOUNGBUN LEE · Se Young Chun

[ Exhibit Hall I ]

Abstract
Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., $16 \times 16$) using pre-trained models. This approach seems infeasible for MIL localization due to enormous amounts ($\sim 10^5$) of large patches (e.g., $256 \times 256$) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework designed to improve both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes-based Pseudo-Labeling (BPPL) for reliable instance pseudo-labeling, and (3) Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets, CAMELYON-16, PAIP, and TCGA, demonstrate superior performance of CoMEL, outperforming the prior arts by up to $11.00\%$ in bag-level accuracy and up to $23.4\%$ in …
Poster
Yucheng Suo · Fan Ma · Linchao Zhu · Tianyi Wang · Fengyun Rao · Yi Yang

[ Exhibit Hall I ]

Abstract
Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves …
Poster
Weijia Zhang · Yuehao Liu · Wu Ran · Chao Ma

[ Exhibit Hall I ]

Abstract
We describe a simple method for cross-architecture knowledge distillation, where the knowledge transfer is cast into a redundant information suppression formulation. Existing methods introduce sophisticated modules, architecture-tailored designs, and excessive parameters, which impair their efficiency and applicability. We propose to extract the architecture-agnostic knowledge in heterogeneous representations by reducing the redundant architecture-exclusive information. To this end, we present a simple redundancy suppression distillation (RSD) loss, which comprises cross-architecture invariance maximization and feature decorrelation objectives. To prevent the student from entirely losing its architecture-specific capabilities, we further design a lightweight module that decouples the RSD objective from the student's internal representations. Our method is devoid of the architecture-specific designs and complex operations in the pioneering method of OFA. It substantially outperforms OFA on CIFAR-100 and ImageNet-1k benchmarks with only a fraction of their parameter overhead, which highlights its potential as a simple and strong baseline to the cross-architecture distillation community. Our code and models will be made publicly available.
Poster
Tao Wang · Changxu Cheng · Lingfeng Wang · Senda Chen · Wuyue Zhao

[ Exhibit Hall I ]

Abstract
The remarkable performance of large multimodal models (LMMs) has attracted significant interest from the image segmentation community.To align with the next-token-prediction paradigm, current LMM-driven segmentation methods either use object boundary points to represent masks or introduce special segmentation tokens, whose hidden states are decoded by a segmentation model requiring the original image as input.However, these approaches often suffer from inadequate mask representation and complex architectures, limiting the potential of LMMs.In this work, we propose the \textbf{Hi}erarchical \textbf{M}ask \textbf{Tok}enizer (HiMTok), which represents segmentation masks with up to 32 tokens and eliminates the need for the original image during mask de-tokenization.HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the LLM next-token-prediction paradigm and facilitating the direct acquisition of segmentation capabilities.We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.Additionally, we enable bidirectional information flow, allowing conversion between bounding boxes and mask tokens to fully leverage multi-task training potential.Extensive experiments demonstrate that our method achieves state-of-the-art performance across various segmentation tasks,while also enhancing visual grounding and maintaining overall visual understanding.The codes will be made publicly available.
Poster
Jihun Kim · Hoyong Kwon · Hyeokjun Kweon · Wooseong Jeong · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
Interactive segmentation (IS) allows users to iteratively refine object boundaries with minimal cues, such as positive and negative clicks. While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi-part objects). To overcome these challenges, we propose DC-TTA, a novel test-time adaptation (TTA) framework that adapts SAM on a per-sample basis by leveraging user interactions as supervision. Instead of forcing a single model to incorporate all user clicks at once, DC-TTA partitions the clicks into more coherent subsets, each processed independently via TTA with a separated model. This Divide-and-Conquer strategy reduces conflicts among diverse cues and enables more localized updates. Finally, we merge the adapted models to form a unified predictor that integrates the specialized knowledge from each subset. Experimental results across various benchmarks demonstrate that DC-TTA significantly outperforms SAM’s zero-shot results and conventional TTA methods, effectively handling complex tasks such as camouflaged object segmentation with fewer interactions and improved accuracy. The code will be available soon.
Poster
YITING LI · Fayao Liu · Jingyi Liao · Sichao Tian · Chuan-Sheng Foo · Xulei Yang

[ Exhibit Hall I ]

Abstract
Multimodal anomaly detection (MAD) enhances industrial inspection by leveraging complementary 2D and 3D data. However, existing methods struggle in few-shot scenarios due to limited data and modality gaps. While current approaches either fuse multimodal features or align cross-modal representations, they often suffer from high false positive rates and fail to detect subtle defects, especially when training data is scarce. To address these challenges, we propose the first few-shot MAD method FIND, a novel dual-student framework that synergistically integrates intra-modal reverse distillation and cross-modal distillation. FIND employs modality-specific teachers and two collaborative students: an intra-modal student for fine-grained anomaly localization via reverse distillation, and a cross-modal student that captures inter-modal correspondences to detect inconsistencies. Extensive experiments on MVTec-3D-AD and Eyecandies show that FIND outperforms state-of-the-art methods in both full-shot and few-shot settings. Ablation studies validate the complementary roles of intra- and cross-modal distillation. Our work significantly advances MAD robustness in data-scarce industrial applications.
Poster
Meiqi Wang · Han Qiu

[ Exhibit Hall I ]

Abstract
In-orbit object detection is essential for Earth observation missions on satellites equipped with GPUs.A promising approach is to use pre-trained vision-language modeling (VLM) to enhance its open-vocabulary capability.However, adopting it on satellites poses two challenges: (1) satellite imagery differs substantially from natural images, and (2) satellites' embedded GPUs are insufficient for complex models' inference.We reveal their lack of a crucial prior: in-orbit detection involves identifying a set of known objects within a cluttered yet monotonous background.Motivated by this observation, we propose VISO, a Vision-language Instructed Satellite Object detection model that focuses on object-specific features while suppressing irrelevant regions through language-guided mask learning.After pre-training on a large-scale satellite dataset with 3.4M region-text pairs, VISO enhances object-text alignment and object-centric features to improve detection accuracy.Also, VISO suppresses irrelevant regions, enabling highly sparse inference to accelerate speed on satellites.Extensive experiments show that VISO without sparsity outperforms state-of-the-art (SOTA) VLMs in zero-shot detection by increasing 34.1\% AP and reducing 27$\times$ FLOPs, and surpasses specialist models in supervised object detection and object referring by improving 2.3\% AP.When sparsifying VISO to a comparable AP, FLOPs can be greatly reduced by up to 8.5$\times$.Real-world tests reveal that VISO achieves a 2.8–4.8$\times$ FPS speed-up on satellites’ embedded GPUs.
Poster
Ran Ran · Jiwei Wei · Shiyuan He · Zeyu Ma · Chaoning Zhang · Ning Xie · Yang Yang

[ Exhibit Hall I ]

Abstract
Video Temporal Grounding (VTG) confronts the challenge of bridging the semantic gap between concise textual queries and the rich complexity of video content, compounded by the difficulty of capturing discriminative features without external priors. To address these challenges, we propose Knowledge Diffusion Alignment (KDA), a framework that leverages the generative prowess of diffusion models. KDA introduces a multi-layer video knowledge extraction module alongside a background residual diffusion model that progressively prunes irrelevant background information from global video features, thereby distilling query-relevant moment knowledge enriched with visual context. By a three-stage training approach that harnesses external priors, KDA guarantees that the extracted moment knowledge incorporates the discriminative features necessary for accurate localization. A knowledge prompt reasoning module facilitates the comprehensive interaction and utilization of moment knowledge and multimodal features. Moreover, we introduce a spans-enhanced decoder that selectively integrates spans from multi-modal features, capitalizing on intrinsic alignment cues. Comprehensive experiments on three datasets demonstrate performance that surpasses state-of-the-art methods, attesting to the effectiveness of the proposed framework.
Poster
YUFEI SHI · Weilong Yan · Gang Xu · Yumeng Li · Yucheng Chen · ZhenXi Li · Fei Yu · Ming Li · Si Yong Yeo

[ Exhibit Hall I ]

Abstract
Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as ''Wilson is receiving chemotherapy" or ''Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, …
Poster
Wentian Cai · Weizhao Weng · Zihao Huang · Yandan Chen · Siquan Huang · Ping Gao · Victor Leung · Ying Gao

[ Exhibit Hall I ]

Abstract
Massive requirement for pixel-wise annotations in histopathological image segmentation poses a significant challenge, leading to increasing interest in Unsupervised Semantic Segmentation (USS) as a viable alternative. Pre-trained model-based methods have been widely used in USS, achieving promising segmentation performance. However, these methods are less capable for medical image USS tasks due to their limited ability in encoding task-specific contextual information. In this paper, we propose a context-based Overlapping Patches Consistency Constraint (OPCC), which employs the consistency constraint between the local overlapping region’s similarity and global context similarity, achieving consistent class representation in similar environments. Additionally, we introduce an Inter-Layer Self-Attention Fusion (ILSAF) module that employs a multi-head self-attention mechanism along with Inter-Layer Importance-Weighting to generate context-aware and semantically discriminative pixel representations, improving pixel clustering accuracy. Extensive experiments on two public histopathological image segmentation datasets demonstrate that our approach significantly outperforms state-of-the-art methods by a large margin, with mIoU surpassing previous leading work by 5.74 and 8.38 percentage points on the two datasets, respectively.
Poster
Yujian Lee · Peng Gao · Yongqi Xu · Wentao Fan

[ Exhibit Hall I ]

Abstract
Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, Stepping Stone Plus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient …
Poster
Harsh Agrawal · Eldon Schoop · Xinlei Pan · Ari Seff · Anuj Mahajan · Di Feng · Ruijia Cheng · Andres Romero Mier y Teran · Esteban Gomez · Abhishek Sundararajan · Forrest Huang · Amanda Swearngin · Mohana Moorthy · Jeffrey Nichols · Alexander Toshev

[ Exhibit Hall I ]

Abstract
We build a comprehensive online evaluation benchmark for language-conditioned multi-step task execution on mobile interfaces. Our benchmark strives to evaluate the multi-step planning, reasoning, and visual grounding capabilities of agents, using mobile user interfaces as a concrete testbed. To build diverse, challenging tasks that reflect real-world use cases, we propose an exhaustive taxonomy that allows us to measure progress along multiple decision-making abilities including multi-step planning, visual perception, action grounding, and using memory or external knowledge. We also highlight important factors such as statefulness, safety, and evaluation complexity that are key to design tasks that can be reliably evaluated. Using this taxonomy, we design 116 tasks across 36 unique apps. Through an automatic framework, we stage and evaluate several natural baselines with different input representations and planning strategies. We show that the best-performing agent achieves 40% success on our benchmark. We further measure agents' abilities to plan, ground, and utilize world knowledge highlighting areas of improvement.
Poster
Li Yi · Jie Hu · Songan Zhang · GUANNAN JIANG

[ Exhibit Hall I ]

Abstract
Foundation Segmentation Models (FSMs) show suboptimal performance on unconventional image domains like camouflage objects. Fine-tuning is often impractical due to data preparation challenges, time limits, and optimization issues. To boost segmentation performance while keeping zero-shot features, one approach is pre-augmenting images for the segmentation model. However, existing image augmentations mainly depend on rule-based methods, restricting augmentation effectiveness. Though learning-based methods can diversify augmentation, rule-based ones are degree-describable (e.g., slight/intense brightening), while learning-based methods usually predict non-degree-describable ground truths (e.g., depth estimation), creating a heterogeneous search space when combined. To this end, we propose an ``Augmenting-to-Adapt'' paradigm, replacing traditional rule-based augmentation with an optimal heterogeneous augmentation policy to enhance segmentation. Our method uses 32 augmentation techniques (22 rule-based, 10 learning-based) to ease parameter misalignment, forming a robust, multi-discrete heterogeneous search space.To apply the optimal policy in real-world scenarios, we distill the augmentation process to speed up the preprocess. Extensive evaluations across diverse datasets and domains show our method significantly improves model adaptation with a domain-specific augmentation strategy. We will release our code to support further research.
Poster
Xiao-Wen Zhang · Delong Zhang · Yi-Xing Peng · Zhi Ouyang · Jingke Meng · Wei-Shi Zheng

[ Exhibit Hall I ]

Abstract
Person re-identification (ReID) is to match the person images under different camera views. Training ReID models necessitates a substantial amount of labeled real-world data, leading to high labeling costs and privacy issues. Although several ReID data synthetic methods are proposed to address these issues, they fail to generate images with real-world camera style or new identities. In this paper, we propose a novel pedestrian generation pipeline, VIPerson, to generate camera-realistic pedestrian images with flexible Virtual Identities for the Person ReID task. VIPerson focuses on three key factors in data synthesis: (I) Virtual identity diversity: Enhancing the latent diffusion model with our proposed dropout text embedding, we flexibly generate random and hard identities. (II) Scalable cross-camera variations: VIPerson introduces scalable variations of scenes and poses within each identity. (III) Camera-realistic style: Adopting an identity-agnostic approach to transfer realistic styles, we avoid privacy exposure of real identities. Extensive experimental results across a broad range of downstream ReID tasks demonstrate the superiority of our generated dataset over existing methods. In addition, VIPerson can be adapted to the privacy-constrained ReID scenario, which widens the application of our pipeline. We will release our code and datasets.
Poster
Guobin Shen · Jindong Li · Tenglong Li · Dongcheng Zhao · Yi Zeng

[ Exhibit Hall I ]

Abstract
Spiking Neural Networks (SNNs) hold promise for energy-efficient, biologically inspired computing. We identify substantial information loss during spike transmission, linked to temporal dependencies in traditional Leaky Integrate-and-Fire (LIF) neurons—a key factor potentially limiting SNN performance. Existing SNN architectures also underutilize modern GPUs, constrained by single-bit spike storage and isolated weight-spike operations that restrict computational efficiency. We introduce SpikePack, a neuron model designed to reduce transmission loss while preserving essential features like membrane potential reset and leaky integration. SpikePack achieves constant $\mathcal{O}(1)$ time and space complexity, enabling efficient parallel processing on GPUs and also supporting serial inference on existing SNN hardware accelerators. Compatible with standard Artificial Neural Network (ANN) architectures, SpikePack facilitates near-lossless ANN-to-SNN conversion across various networks. Experimental results on tasks such as image classification, detection, and segmentation show SpikePack achieves significant gains in accuracy and efficiency for both directly trained and converted SNNs over state-of-the-art models. Tests on FPGA-based platforms further confirm cross-platform flexibility, delivering high performance and enhanced sparsity. By enhancing information flow and rethinking SNN-ANN integration, SpikePack advances efficient SNN deployment across diverse hardware platforms.
Poster
Xinwei Long · Kai Tian · Peng Xu · Guoli Jia · Jingxuan Li · Sa Yang · Yihua Shao · Kaiyan Zhang · Che Jiang · Hao Xu · Yang Liu · Jiaheng Ma · Bowen Zhou

[ Exhibit Hall I ]

Abstract
Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of the common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos' traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold, (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 21.1 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our ReAd-R achieves the state-of-the-art outperforming strong competitors …
Poster
Woojung Son · Yoonki Cho · Guoyuan An · Chanmi Lee · Sung-eui Yoon

[ Exhibit Hall I ]

Abstract
Person search aims to simultaneously detect and re-identify a query person within an entire scene.While existing studies have made significant progress in achieving superior performance on clean datasets, the challenge of robustness under various corruptions remains largely unexplored.However, the lack of environments for analyzing corruption robustness presents a challenge, as extensive collection of new person images attempting to cover numerous corruption scenarios inevitably introduces privacy concerns.In this context, we construct the environments for analyzing corruption robustness using existing publicly available data, and introduce two benchmarks: CUHK-SYSU-C and PRW-C.Previous studies on corruption have been conducted independently for single tasks such as re-identification and detection.However, recent advancements in person search adopt an end-to-end multi-task learning framework that processes the entire scene as input, unlike the combination of single tasks. This raises the question of whether independent achievements can ensure corruption robustness for person search.Our findings reveal that merely combining independent, robust detection and re-identification models is not sufficient for achieving robust person search. We further investigate the vulnerability of the detection and representation stages to corruption and explore its impact on both foreground and background areas.Based on these insights, we propose a foreground-aware augmentation and regularization method to enhance the robustness of …
Poster
Chiao-An Yang · Kuan-Chuan Peng · Raymond A. Yeh

[ Exhibit Hall I ]

Abstract
Anomaly detection (AD) identifies the defect regions of a given image. Recent works have studied AD, focusing on learning AD without abnormal images, with long-tailed distributed training data, and using a unified model for all classes. In addition, online AD learning has also been explored. In this work, we expand in both directions to a realistic setting by considering the new novel task of long-tailed online AD (LTOAD). We first identified that the offline state-of-the-art LTAD methods cannot be directly applied to the online setting. Specifically, LTAD is class-aware, requiring class labels that are not available in the online setting. To address this challenge, we propose a class-agnostic framework for LTAD and then adapt it to our online learning setting. Our method outperforms the SOTA baselines in most offline LTAD settings, including both the industrial manufacturing and the medical domain. In particular, we observe $+$4.63\% image-AUROC on MVTec even compared to methods that have access to class labels and the number of classes. In the most challenging long-tailed online setting, we achieve +0.53\% image-AUROC compared to baselines.
Poster
Fatemeh Ghezloo · Saygin Seyfioglu · Rustin Soraki · Wisdom Ikezogwo · Beibin Li · Tejoram Vivekanandan · Joann Elmore · Ranjay Krishna · Linda Shapiro

[ Exhibit Hall I ]

Abstract
Diagnosing diseases through histopathology whole slide images (WSIs) is fundamental in modern pathology but is challenged by the gigapixel scale and complexity of WSIs. Trained histopathologists overcome this challenge by navigating the WSI, looking for relevant patches, taking notes, and compiling them to produce a final holistic diagnostic. Traditional AI approaches, such as multiple instance learning and transformer-based models, fail short of such a holistic, iterative, multi-scale diagnostic procedure, limiting their adoption in the real-world. We introduce PathFinder, a multi-modal, multi-agent framework that emulates the decision-making process of expert pathologists. PathFinder integrates four AI agents, the Triage Agent, Navigation Agent, Description Agent, and Diagnosis Agent, that collaboratively navigate WSIs, gather evidence, and provide comprehensive diagnoses with natural language explanations. The Triage Agent classifies the WSI as benign or risky; if risky, the Navigation and Description Agents iteratively focus on significant regions, generating importance maps and descriptive insights of sampled patches. Finally, the Diagnosis Agent synthesizes the findings to determine the patient's diagnostic classification. Our Experiments show that PathFinder outperforms state-of-the-art methods in skin melanoma diagnosis by 8% while offering inherent explainability through natural language descriptions of diagnostically relevant patches. Qualitative analysis by pathologists shows that the Description Agent's outputs are …
Poster
Marcin Przewięźlikowski · Randall Balestriero · Wojciech Jasiński · Marek Śmieja · Bartosz Zieliński

[ Exhibit Hall I ]

Abstract
Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.
Poster
Weili Xu · Enxin Song · Wenhao Chai · Xuexiang Wen · Tian Ye · Gaoang Wang

[ Exhibit Hall I ]

Abstract
The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with RWKV, an RNN-like language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, as well as to reduce the gap between RWKV’s 4k context length and the extended token sequences typical of long videos, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a RWKV LLM backbone in a LLaVA-like model for open-ended video QA.
Poster
Junghyup Lee · Jeimin Jeon · Dohyung Kim · Bumsub Ham

[ Exhibit Hall I ]

Abstract
Quantization-aware training (QAT) simulates a quantization process during training to lower bit-precision of weights/activations. It learns quantized weights indirectly by updating latent weights, i.e., full-precision inputs to a quantizer, using gradient-based optimizers. We claim that coupling a user-defined learning rate (LR) with these optimizers is sub-optimal for QAT. Quantized weights transit discrete levels of a quantizer, only if corresponding latent weights pass transition points, where the quantizer changes discrete states. This suggests that the changes of quantized weights are affected by both the LR for latent weights and their distributions. It is thus difficult to control the degree of changes for quantized weights by scheduling the LR manually. We conjecture that the degree of parameter changes in QAT is related to the number of quantized weights transiting discrete levels. Based on this, we introduce a transition rate (TR) scheduling technique that controls the number of transitions of quantized weights explicitly. Instead of scheduling a LR for latent weights, we schedule a target TR of quantized weights, and update the latent weights with a novel transition-adaptive LR (TALR), enabling considering the degree of changes for the quantized weights during QAT. Experimental results demonstrate the effectiveness of our approach on standard benchmarks.
Poster
Junhao Zheng · Jiahao Sun · Chenhao Lin · Zhengyu Zhao · Chen Ma · Chong Zhang · Cong Wang · Qian Wang · Chao Shen

[ Exhibit Hall I ]

Abstract
Developing reliable defenses against patch attacks for object detectors has attracted increasing interest.However, we identify that existing defense evaluations lack a unified and comprehensive framework, causing inconsistent and incomplete assessment of current methods.To address this issue, we revisit 10 representative defenses and present the first large-scale benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics.This leads to the first large-scale adversarial patch dataset with 94 types of patches and 94,000 images, which can also be used to improve existing defenses. We conduct comprehensive analyses to reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. In light of this, we construct a large-scale dataset with diverse patch distributions to obtain stronger defenses, with 15.09\% AP@0.5 improvement.(2) A higher patch detection accuracy does not necessarily imply better defense performance.Instead, the average precision of the attacked object shows higher consistency.(3) Existing defenses can be substantially bypassed by adaptive attacks, and defenses that integrate complex/stochastic models or patch-level features are less vulnerable.We will open-source our dataset and code as well as keep integrating new attacks/defenses.
Poster
Yuheng Shi · Minjing Dong · Chang Xu

[ Exhibit Hall I ]

Abstract
While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation.Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance.Trident achieves a significant improvement in the mIoU across eight popular benchmarks compared with the current SOTA.Furthermore, it can also be utilized to generate visual prompts that enhance the performance of Large Vision-Language Models (LVLMs).
Poster
Zexi Jia · Chuanwei Huang · Yeshuang Zhu · Hongyan Fei · Ying Deng · Zhiqiang Yuan · Jiapei Zhang · Jinchao Zhang · Jie Zhou

[ Exhibit Hall I ]

Abstract
Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as "puzzle pieces" coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly improves visual reasoning performance. Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.
Poster
Jiao Tang · Junjie Zhou · Bo Qian · Peng Wan · Yingli Zuo · WEI SHAO · Daoqiang Zhang

[ Exhibit Hall I ]

Abstract
Tissue segmentation in pathology images is crucial for computer-aided diagnostics of human cancers. Traditional tissue segmentation models rely heavily on large-scale labeled datasets, where every tissue type must be annotated by experts. However, due to the complexity of tumor micro-environment, collecting annotations for all possible tissue types is challenging, which makes the traditional methods ineffective in segmenting unseen tissue types with zero training samples. With the rapid development of vision-language models (VLMs), recent studies extend their powerful zero-shot capabilities to pixel-level segmentation tasks, where the model is trained only on seen classes but can perform tissue segmentation on both seen and unseen categories in the testing phase. However, these VLM-based zero-shot segmentation models still require substantial annotation efforts on seen classes. To attach desirable segmentation performance on both seen and unseen categories with limited labeled data, we propose AcZeroTS, a novel active learning framework for zero-shot tissue segmentation in pathology images. Specifically, AcZeroTS is built on a VLM-based prototype-guided zero-shot segmentation model called ProZS. We introduce a novel active selection criterion to select the most valuable samples for annotation on seen classes, which not only considers both uncertainty and diversity of unlabeled samples, but also ensures that the generated prototypes …
Poster
Xiaowen Ma · Zhen-Liang Ni · Xinghao Chen

[ Exhibit Hall I ]

Abstract
Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. By observing, we find that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic …
Poster
Renshan Zhang · Rui Shao · Gongwei Chen · Miao Zhang · Kaiwen Zhou · Weili Guan · Liqiang Nie

[ Exhibit Hall I ]

Abstract
The incorporation of high-resolution visual input equips multimodal large language models (MLLMs) with enhanced visual perception capabilities for real-world tasks. However, most existing high-resolution MLLMs rely on a cropping-based approach to process images, which leads to fragmented visual encoding and a sharp increase in redundant tokens. To tackle these issues, we propose the FALCON model. FALCON introduces a novel visual register technique to simultaneously: 1) Eliminate redundant tokens at the stage of visual encoding. To directly address the visual redundancy present in the output of vision encoder, we propose a Register-based Representation Compacting (ReCompact) mechanism. This mechanism introduces a set of learnable visual registers designed to adaptively aggregate essential information while discarding redundancy. It enables the encoder to produce a more compact visual representation with a minimal number of output tokens, thus eliminating the need for an additional compression module. 2) Ensure continuity in visual encoding. To address the potential encoding errors caused by fragmented visual inputs, we develop a Register Interactive Attention (ReAtten) module. This module facilitates effective and efficient information exchange across sub-images by enabling interactions between visual registers. It ensures the continuity of visual semantics throughout the encoding. We conduct comprehensive experiments with FALCON on high-resolution benchmarks …
Poster
Zhentao Tan · Ben Xue · Jian Jia · Junhao Wang · Wencai Ye · Shaoyun Shi · Sun Mingjie · Wenjin Wu · Quan Chen · Peng Jiang

[ Exhibit Hall I ]

Abstract
This paper presents the $\textbf{S}$emantic-a$\textbf{W}$ar$\textbf{E}$ spatial-t$\textbf{E}$mporal $\textbf{T}$okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via $\textbf{D}$ecoupled $\textbf{Q}$uery $\textbf{A}$uto$\textbf{E}$ncoder (DQAE). This design allows SweetTok to efficiently compress video token count while achieving better fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a $\textbf{M}$otion-enhanced $\textbf{L}$anguage $\textbf{C}$odebook (MLC) tailored for spatial and temporal compression to address the differences in semantic representation between appearance and motion information.SweetTok significantly improves video reconstruction results by $\textbf{42.8}$\% w.r.t rFVD on UCF-101 dataset.With a better token compression strategy, it also boost downstream video generation results by $\textbf{15.1}$\% w.r.t gFVD.Additionally, the compressed decoupled tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.
Poster
Ping Cao · Yepeng Tang · Chunjie Zhang · Xiaolong Zheng · Chao Liang · Yunchao Wei · Yao Zhao

[ Exhibit Hall I ]

Abstract
Human-object interaction (HOI) detection fundamentally relies on capturing fine-grained visual information to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixel-level generation, their potential for interaction-level relationship modeling remains largely unexplored. We aim to bridge this gap by leveraging generative models’ fine-grained visual perception to enhance HOI detection through improved visual relation representation learning. In this work, we propose a Visual Relation Diffusion model (VRDiff) for HOI detection, which introduces dense visual relation conditions. Considering that diffusion models primarily focus on instance-level objects, we design an interaction-aware condition representation that learns relation features with spatial responsiveness and contextual interaction cues. Instead of relying on text conditions, VRDiff leverages learned visual relation representations as conditions for the diffusion model. Furthermore, we refine the visual relation representations through generative feedback from the text-to-image diffusion model, enhancing HOI detection performance without requiring image generation. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves state-of-the-art performance under both standard and zero-shot HOI detection settings.
Poster
Yingfan MA · Bohan An · Ao Shen · Mingzhi Yuan · Minghong Duan · Manning Wang

[ Exhibit Hall I ]

Abstract
Whole Slide Image (WSI) classification has been widely used in pathological diagnosis and prognosis prediction, and it is commonly formulated as a weakly-supervised Multiple Instance Learning (MIL) problem because of the large size of WSIs and the difficulty of obtaining fine-grained annotations. In the MIL formulation, a WSI is treated as a bag and the patches cut from it are treated as its instances, and most existing methods first extract instance features and then aggregate them into bag feature using attention-based mechanism for bag-level prediction. These models are trained using only bag-level labels, so they often lack instance-level insights and lose detailed semantic information, which limits their bag-level classification performance and damages their ability to explore high-expressive information. In this paper, we propose Flow-MIL, which leverages normalizing flow-based Latent Semantic Embedding Space (LSES) to enhance feature representation. By mapping patches into the simple and highly-expressive latent space LSES, Flow-MIL achieves effective slide-level aggregation while preserving critical semantic information. We also introduce Gaussian Mixture Model-based Latent Semantic Prototypes (LSP) within the LSES to capture class-specific pathological distribution for each class and refine pseudo instance labels. Extensive experiments on three public WSI datasets show that Flow-MIL outperforms recent SOTA methods in both …
Poster
Ke Zhang · Yi Huang · Wei Liu · Yuanyuan Wang · Vishal Patel · Le Lu · Xu Han · Dakai Jin · Ke Yan

[ Exhibit Hall I ]

Abstract
Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular structure segmentation framework named HarmonySeg. First, we design a deep-to-shallow decoder network featuring flexible convolution blocks with varying receptive fields, which enables the model to effectively adapt to tubular structures of different scales. Second, to highlight potential anatomical regions and improve the recall of small tubular structures, we incorporate vesselness maps as auxiliary information. These maps are aligned with image features through a shallow-and-deep fusion module, which simultaneously eliminates unreasonable candidates to maintain high precision. Finally, we introduce a topology-preserving loss function that leverages contextual and shape priors to balance the growth and suppression of tubular structures, which also allows the model to handle low-quality and incomplete annotations. Extensive quantitative experiments are conducted on four public datasets. The results show that our model can accurately segment 2D and 3D tubular structures and outperform existing state-of-the-art methods. External validation on a private dataset also demonstrates good generalizability. …
Poster
Xiaohui Chen · Satya Narayan Shukla · Mahmoud Azab · Aashu Singh · Qifan Wang · David Yang · ShengYun Peng · Hanchao Yu · Shen Yan · Xuewen Zhang · Baosheng He

[ Exhibit Hall I ]

Abstract
How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs’ understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.
Poster
Ayush Gupta · Anirban Roy · Rama Chellappa · Nathaniel D. Bastian · Alvaro Velasquez · Susmit Jha

[ Exhibit Hall I ]

Abstract
We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available.We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding.We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded open-ended question answering.For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.
Poster
Trevine Oorloff · Vishwanath Sindagi · Wele Gedara Chaminda Bandara · Ali Shafahi · Amin Ghiasi · Charan Prakash · Reza Ardekani

[ Exhibit Hall I ]

Abstract
Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) -- the ability to leverage a few set of example prompts to adapt to various tasks without having to explicitly update model weights. ICL has recently been explored for the visual domain with promising early outcomes. These approaches involve specialized training and/or additional data which complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be re-purposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this re-purposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task …
Poster
Haicheng Wang · Zhemeng Yu · Gabriele Spadaro · Chen Ju · Victor Quétu · Shuai Xiao · Enzo Tartaglione

[ Exhibit Hall I ]

Abstract
Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating computational and memory demands during both training and inference. Through a comprehensive analysis of the token reduction process in vision encoder, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We show the effectiveness of FOLDER by integrating it into the visual backbone of various MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens. The source code will be open-sourced upon acceptance of the article.
Poster
guangyao Li · Siping Zhuang · Yajun Jian · Yan Yan · Hanzi Wang

[ Exhibit Hall I ]

Abstract
Referring Multi-Object Tracking (RMOT) aims to detect and track specific objects based on natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, often failing to exploit fine-grained linguistic cues that are crucial for distinguishing objects with similar characteristics. Notably, these cues play distinct roles at different tracking stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose DKGTrack, a novel RMOT method that enhances language comprehension for precise object tracking by decoupling language expressions into localized descriptions and motion states. To improve the accuracy of language-guided object identification, we introduce a Static Semantic Enhancement (SSE) module, which enhances region-level vision-language alignment through hierarchical cross-modal feature interaction, providing more discriminative object representations for tracking. Furthermore, we propose a Motion Perception Alignment (MPA) module that explicitly aligns object queries with motion descriptions, enabling accurate object trajectory prediction across frames. Experimental results on multiple RMOT benchmarks demonstrate the effectiveness of our method, which achieves competitive performance in challenging tracking scenarios.
Poster
Yuxiao Wang · Yu Lei · Zhenao WEI · WeiYing Xue · Xinyu Jiang · Nan Zhuang · Qi Liu

[ Exhibit Hall I ]

Abstract
The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects.Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions.To tackle this issue, a HOT framework, termed **P3HOT**, is proposed, which blends **P**rompt guidance and human **P**roximal **P**erception. To begin with, we utilize a semantic-driven prompt mechanism to direct the network's attention towards the relevant regions based on the correlation between image and text.Then a human proximal perception mechanism is employed to dynamically perceive key depth range around the human, using learnable parameters to effectively eliminate regions where interactions are not expected.Calculating depth resolves the uncertainty of the overlap between humans and objects in a 2D perspective, providing a quasi-3D viewpoint.Moreover, a Regional Joint Loss (RJLoss) has been created as a new loss to inhibit abnormal categories in the same area. A new evaluation metric called ``AD-Acc.'' is introduced to address the shortcomings of existing methods in addressing negative samples.Comprehensive experimental results demonstrate that our approach achieves state-of-the-art performance in four metrics across two benchmark datasets. Specifically, our model achieves …
Poster
Sebastian Schmidt · Julius Koerner · Dominik Fuchsgruber · Stefano Gasperini · Federico Tombari · Stephan Günnemann

[ Exhibit Hall I ]

Abstract
In panoptic segmentation, individual instances must be separated within semantic classes. As state-of-the-art methods rely on a pre-defined set of classes, they struggle with novel categories and out-of-distribution (OOD) data. This is particularly problematic in safety-critical applications, such as autonomous driving, where reliability in unseen scenarios is essential. We address the gap between outstanding benchmark performance and reliability by proposing Prior2Former(P2F), the first approach for segmentation vision transformers rooted in evidential learning. P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments. This design enables high-quality uncertainty estimation that effectively detects novel and OOD objects enabling state-of-the-art anomaly instance segmentation and open-world panoptic segmentation. Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes, making it highly applicable in real-world scenarios where such prior information is unavailable. Additionally, P2F can be flexibly applied to anomaly instance and panoptic segmentation.Through comprehensive experiments on the Cityscapes, COCO, SegmentMeIfYouCan, and OoDIS datasets, we demonstrate the state-of-the-art performance of P2F. It achieves the highest ranking in the OoDIS anomaly instance benchmark among methods not using OOD data in any way.
Poster
Peng Ren · Tian Bai · Jing Sun · Fuming Sun

[ Exhibit Hall I ]

Abstract
Open-Vocabulary Camouflaged Object Segmentation (OVCOS) aims to segment camouflaged objects of any category based on text descriptions. Despite existing open-vocabulary methods exhibit strong segmentation capabilities, they still have a major limitation in camouflaged scenarios: semantic confusion, which leads to incomplete segmentation and class shift in the model. To mitigate the above limitation, we propose a framework for OVCOS, named SuCLIP. Specifically, we design a context-aware prompt scheme that leverages the internal knowledge of the CLIP visual encoder to enrich the text prompt and align it with local visual features, thereby enhancing the text prompt. To better align the visual semantic space and the text semantic space, we design a class-aware feature selection module to dynamically adjust text and visual embeddings, making them more matched with camouflaged object. Meanwhile, we introduce a semantic consistency loss to mitigate the semantic deviation between the text prompt and visual features, ensuring semantic consistency between the segmentation results and the text prompt. Finally, we design a text query decoder that precisely maps textual semantics to pixel-level segmentation results, thereby achieving semantic-spatial consistent decoding. Experimental results show that SuCLIP significantly outperforms the advanced method OVCoser on the OVCamo dataset.
Poster
rongkun Zheng · Lu Qi · Xi Chen · Yi Wang · Kun Wang · Hengshuang Zhao

[ Exhibit Hall I ]

Abstract
Recent efforts in video reasoning segmentation (VRS) integrate large language models (LLMs) with perception models to localize and track objects via textual instructions, achieving barely satisfactory results in simple scenarios. However, they struggled to discriminate and deduce the objects from user queries in more real-world scenes featured by long durations, multiple objects, rapid motion, and heavy occlusions. In this work, we analyze the underlying causes of these limitations, and present **ViLLa**: **Vi**deo reasoning segmentation with **L**arge **La**nguage Model. Remarkably, our ViLLa manages to tackle these challenges through multiple core innovations: (1) a context synthesizer that dynamically encodes the user intent with video contexts for accurate reasoning, resolving ambiguities in complex queries, and (2) a hierarchical temporal synchronizer that disentangles multi-object interactions across complex temporal scenarios by modelling multi-object interactions at local and global temporal scales. To enable efficient processing of long videos, ViLLa incorporates (3) a key segment sampler that adaptively partitions long videos into shorter but semantically dense segments for less redundancy. What's more, to promote research in this unexplored area, we construct a VRS benchmark, **VideoReasonSeg**, featuring different complex scenarios. Our model also exhibits impressive state-of-the-art results on VideoReasonSeg, Ref-YouTube-VOS, Ref-DAVIS17, MeViS, and ReVOS. Both quantitative and qualitative …
Poster
Xiaoyi Bao · Chen-Wei Xie · Hao Tang · Tingyu Weng · Xiaofeng Wang · Yun Zheng · Xingang Wang

[ Exhibit Hall I ]

Abstract
In recent years, the introduction of Multi-modal Large Language Models (MLLMs) into video understanding tasks has become increasingly prevalent. However, how to effectively integrate temporal information remains a critical research focus. Traditional approaches treat spatial and temporal information separately. Due to issues like motion blur, it is challenging to accurately represent the spatial information of rapidly moving objects. This can lead to temporally important regions being underemphasized during spatial feature extraction, which in turn hinders accurate spatio-temporal interaction and video understanding. To address this limitation, we propose an innovative video representation method called Dynamic-Image (DynImg). Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects. During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions. Moreover, to maintain the correct sequence for DynImg, we employ a corresponding 4D video Rotary Position Embedding. This retains both the temporal and spatial adjacency of DynImg, helping MLLM understand the spatio-temporal order within this combined format. Experimental evaluations reveal that DynImg surpasses the state-of-the-art methods by approximately 2% across multiple video understanding benchmarks, proving the effectiveness of our temporal prompts …
Poster
chunlin wen · Yu Zhang · Jie Fan · Hongyuan Zhu · Xiu-Shen Wei · Yijun Wang · Zhiqiang Kou · Shuzhou Sun

[ Exhibit Hall I ]

Abstract
Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, i.e., irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the object-level correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target …
Poster
Taimur Hassan · Anabia Sohail · Muzammal Naseer · Naoufel Werghi

[ Exhibit Hall I ]

Abstract
Retinopathy comprises a group of retinal disorders that can lead to severe visual impairment or blindness. The heterogeneous morphology of lesions poses a significant challenge in developing robust diagnostic systems. Supervised approaches rely on large labeled datasets and often struggle with generalization. To address these limitations, we propose an unsupervised vision-language neural graph featurization method. This method first segments fundus images into a set of super-pixels via Simple Linear Iterative Clustering (SLIC). The super-pixel regions are then decomposed into an undirected graph where each super-pixel serve as a node, and spatially adjacent nodes are connected by edges. A Hamiltonian path systematically traverses the graph and iteratively update and propagate node and edge latent space embeddings throughout the graph until convergence is achieved. Then, a normalized cut separates the converged embeddings into two clusters within a latent space that represent the lesion and healthy super-pixel regions of the input scans. The lesion super-pixels are further classified into lesion categories using prompt-based zero-shot vision-language model. The proposed method is rigorously tested on three public datasets, dubbed ODIR, BIOMISA, and IDRiD, achieving F1-scores of 0.89, 0.93, and 0.92, respectively, with significant performance gains over state-of-the-art methods.
Poster
Shuaiting Li · Juncan Deng · Chengxuan Wang · Kedong Xu · Rongtao Deng · Hong Gu · Haibin Shen · Kejie Huang

[ Exhibit Hall I ]

Abstract
Vector Quantization (VQ) has emerged as a prominent weight compression technique, showcasing substantially lower quantization errors than uniform quantization across diverse models, particularly in extreme compression scenarios. However, its efficacy during fine-tuning is limited by the constraint of the compression format, where weight vectors assigned to the same codeword are restricted to updates in the same direction. Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. Our approach involves extracting the sign bits of uncompressed weights and performing clustering and compression on all-positive weights. We then introduce latent variables for the sign bit and jointly optimize both the signs and the codebook. Additionally, we implement a progressive freezing strategy for the learnable sign to ensure training stability. Extensive experiments on various modern models and tasks demonstrate that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ. Furthermore, we validate our algorithm on a hardware accelerator, showing that SSVQ achieves a 3× speedup over the 8-bit compressed model by reducing memory access.
Poster
Pedro Bassi · Mehmet Yavuz · Ibrahim Ethem Hamamci · Sezgin Er · Xiaoxi Chen · Wenxuan Li · Bjoern Menze · Sergio Decherchi · Andrea Cavalli · Kang Wang · Yang Yang · Alan Yuille · Zongwei Zhou

[ Exhibit Hall I ]

Abstract
With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present Rad-GPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. Rad-GPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical structures, then transforms this information into both structured reports and narrative reports. These reports provide tumor size, shape, location, attenuation, volume, and interactions with surrounding blood vessels and organs. Extensive evaluation on unseen hospitals shows that RAD-GPT can produce accurate reports, with high sensitivity/specificity for small tumor (<2 cm) detection: 80/73% for liver tumors, 92/78% for kidney tumors, and 77/77% for pancreatic tumors. For large tumors, sensitivity ranges from 89% to 97%. The results significantly surpass the state-of-the-art in abdominal CT report generation.Rad-GPT generated reports for 17 public datasets. Through radiologist review and refinement, we have ensured the reports' accuracy, and created the first publicly available image-text 3D medical dataset, comprising over 1.8 million text tokens and 2.7 million images from 9,262 CT scans, including 2,947 tumor scans/reports of 2,562 tumor instances. Our reports can: (1) localize tumors in eight liver sub-segments and …
Poster
Jiayuan Zhu · Junde Wu · Cheng Ouyang · Konstantinos Kamnitsas · Alison Noble

[ Exhibit Hall I ]

Abstract
Medical image segmentation data inherently contain uncertainty. This can stem from both imperfect image quality and variability in labeling preferences on ambiguous pixels, which depend on annotator expertise and the clinical context of the annotations. For instance, a boundary pixel might be labeled as tumor in diagnosis to avoid under-estimation of severity, but as normal tissue in radiotherapy to prevent damage to sensitive structures. As segmentation preferences vary across downstream applications, it is often desirable for an image segmentation model to offer user-adaptable predictions rather than a fixed output. While prior uncertainty-aware and interactive methods offer adaptability, they are inefficient at test time: uncertainty-aware models require users to choose from numerous similar outputs, while interactive models demand significant user input through click or box prompts to refine segmentation. To address these challenges, we propose **SPA**, a new **S**egmentation **P**reference **A**lignment framework that efficiently adapts to diverse test-time preferences with minimal human interaction. By presenting users with a select few, distinct segmentation candidates that best capture uncertainties, it reduces the user workload to reach the preferred segmentation. To accommodate user preference, we introduce a probabilistic mechanism that leverages user feedback to adapt a model's segmentation preference. The proposed framework is evaluated …
Poster
Kaixiang Yang · Xin Li · Qiang Li · Zhiwei Wang

[ Exhibit Hall I ]

Abstract
Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world surgeries. In this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument usage. Theoretically, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing accuracy. Experiments on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and …
Poster
Jiaqi Liao · Yuwei Niu · Fanqing Meng · Hao Li · Changyao Tian · Yinuo Du · Yuwen Xiong · Dianqi Li · Xizhou Zhu · Li Yuan · Jifeng Dai · Yu Cheng

[ Exhibit Hall I ]

Abstract
Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B with minimal performance degradation. Overall, LangBridge enables interpretable vision-language alignment by grounding visual semantics in LLM language …
Poster
Joëlle Hanna · Damian Borth

[ Exhibit Hall I ]

Abstract
Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo-masks, outperforming related works. Those pseudo-masks can be used to train a segmentation model which achieves results comparable to fully-supervised models, significantly reducing the need for fine-grained labeled data.
Poster
Xinye Cao · Hongcan Guo · Jiawen Qian · Guoshun Nan · Chao Wang · Yuqi Pan · Tianhao Hou · Xiaojuan Wang · Yutong Gao

[ Exhibit Hall I ]

Abstract
Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO …
Poster
Samir Khaki · Junxian Guo · Jiaming Tang · Shang Yang · Yukang Chen · Konstantinos Plataniotis · Yao Lu · Song Han · Zhijian Liu

[ Exhibit Hall I ]

Abstract
Vision language models (VLMs) have garnered increasing attention for their ability to integrate visual and textual understanding, with some capable of processing native-resolution images and long videos. While the capacity to process large visual data unlocks numerous downstream applications, it often introduces significant latency challenges, as the visual tokens dominate the resource consumption. In this work, we introduce SparseVILA, a novel method of query-aware token retrieval to dynamically accelerate the underlying LLM, by pruning tokens in the context stage, while attending to a sparse subset of visual tokens during the generation phase. By decoupling the context and generation compression, we can migrate the majority of sparsity into the generation stage, enabling query-aware support for multi-turn conversation while achieving a 1.5$\times$ speedup on image benchmarks. Further, this approach leads to significant accuracy improvements on image-centric benchmarks over previous query-aware/agnostic pruning works. Finally, SparseVILA enables efficient long-context/long-generation tasks by achieving a 6.3$\times$ and 1.7$\times$ speedup in context processing and generation respectively.
Poster
Federico Girella · Davide Talon · Ziyue Liu · Zanxi Ruan · Yiming Wang · Marco Cristani

[ Exhibit Hall I ]

Abstract
Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.
Poster
Ziv Weiss Haddad · Oren Barkan · Yehonatan Elisha · Noam Koenigstein

[ Exhibit Hall I ]

Abstract
Completeness is a widely discussed property in explainability research, requiring that the attributions sum to the model’s response to the input. While completeness intuitively suggests that the model’s prediction is "completely explained" by the attributions, its global formulation alone is insufficient to ensure meaningful explanations. We contend that promoting completeness locally within attribution subregions, in a soft manner, can serve as a standalone guiding principle for producing high quality attributions. To this end, we introduce the concept of the completeness gap as a flexible measure of completeness and propose an optimization procedure that minimizes this gap across subregions within the attribution map. Extensive evaluations across various model architectures demonstrate that our method outperforms state-of-the-art explanation methods on multiple benchmarks.
Poster
Yunheng Li · Yuxuan Li · Quan-Sheng Zeng · Wenhai Wang · Qibin Hou · Ming-Ming Cheng

[ Exhibit Hall I ]

Abstract
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available in the supplementary materials and will be publicly released.
Poster
Ronglai Zuo · Rolandos Alexandros Potamias · Evangelos Ververas · Jiankang Deng · Stefanos Zafeiriou

[ Exhibit Hall I ]

Abstract
Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), the reverse task—sign language generation (text-to-sign)—remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we leverage a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. During decoding, unlike existing approaches that flatten all part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method capable of predicting multiple tokens simultaneously. This approach improves inference efficiency while maintaining effective information fusion across different body parts. To further ease the generation process, we propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of generated signs. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. Code, models, and data will be made publicly available.
Poster
Ruyang Liu · Shangkun Sun · Haoran Tang · Wei Gao · Ge Li

[ Exhibit Hall I ]

Abstract
Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the "key" is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines frame-level hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7\% on Video-MME, 71.4\% on MLVU and 60.4\% on LongVideoBench.
Poster
Beomyoung Kim · Chanyong Shin · Joonhyun Jeong · Hyungsik Jung · Seyun Lee · Sewhan Chun · Dong-Hyun HWANG · Joonsang Yu

[ Exhibit Hall I ]

Abstract
The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D segmentation. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code will be available soon.
Poster
Tassilo Wald · Constantin Ulrich · Jonathan Suprijadi · Sebastian Ziegler · Michal Nohel · Robin Peretzke · Gregor Koehler · Klaus Maier-Hein

[ Exhibit Hall I ]

Abstract
The field of self-supervised learning (SSL) for 3D medical images lacks consistency and standardization.While many methods have been developed, it is impossible to identify the current state-of-the-art, due to i) varying and small pre-training datasets, ii) varying architectures, and iii) being evaluated on differing downstream datasets. In this paper we bring clarity to this field and lay the foundation for further method advancements through three key contributions: We a) publish the largest publicly available pre-training dataset comprising 114k brain MRI volumes, enabling all practitioners to pre-train on a large-scale dataset. We b) benchmark existing 3D self-supervised learning methods on this dataset for a state-of-the-art CNN and Transformer architecture, clarifying the state of 3D SSL pre-training. Among many findings, we show that pre-trained methods can exceed a strong from-scratch nnU-Net ResEnc-L baseline. Lastly, we c) publish the code of our pre-training and fine-tuning frameworks and provide the pre-trained models created during the benchmarking process to facilitate rapid adoption and reproduction.
Poster
Chenting Wang · Kunchang Li · Tianxiang Jiang · Xiangyu Zeng · Yi Wang · Limin Wang

[ Exhibit Hall I ]

Abstract
Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. The code and models will be publicly released to facilitate future video tasks.
Poster
Ding Zhong · Xu Zheng · Chenfei Liao · Yuanhuiyi Lyu · Jialei Chen · Shengyang Wu · Linfeng Zhang · Xuming Hu

[ Exhibit Hall I ]

Abstract
Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide.To address these issues, we propose a novel $\textbf{OmniSAM}$ framework, which makes the $\textbf{first}$ attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2’s memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries.For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization …
Poster
Xinyu Yan · Meijun Sun · Ge-Peng Ji · Fahad Khan · Salman Khan · Deng-Ping Fan

[ Exhibit Hall I ]

Abstract
We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve $F_\beta^\omega$ gains of 4.6% with both the LS and WR strategies and 3.6% gains with only the LS strategy on DIS-TE. Our code will be available.
Poster
Vishwesh Ramanathan · Tony Xu · Pushpak Pati · Faruk Ahmed · Maged Goubran · Anne Martel

[ Exhibit Hall I ]

Abstract
Prediction tasks in digital pathology are challenging due to the massive size of whole-slide images (WSIs) and the weak nature of training signals. Advances in computing, data availability, and self-supervised learning (SSL) have paved the way for slide-level foundation models (SLFMs) that can improve prediction tasks in low-data regimes. However, working with these models is challenging, with issues such as catastrophic forgetting during fine-tuning and under-utilization of shared information between tasks and modalities. To overcome these two challenges, we propose ModalTune, a novel fine-tuning framework which introduces the Modal Adapter to integrate new modalities without modifying SLFM weights. Additionally, we use large-language models (LLMs) to encode labels as text, capturing semantic relationships and enhancing generalization across multiple tasks and cancer types in a single training recipe. ModalTune achieves state-of-the-art (SOTA) results against both uni-modal and multi-modal models across four cancer types, jointly improving survival and cancer subtype prediction while remaining competitive in pan-cancer settings. Additionally, we show ModalTune is highly generalizable to two out-of-distribution (OOD) datasets. To our knowledge, this is the first unified fine-tuning framework for multi-modal, multi-task, and pan-cancer modeling in digital pathology. Code will be shared after blind-review.
Poster
Sihan Yang · Runsen Xu · Chenhang Cui · Tai Wang · Dahua Lin · Jiangmiao Pang

[ Exhibit Hall I ]

Abstract
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning …
Poster
Chia-Wen Kuo · Sijie Zhu · Fan Chen · Xiaohui Shen · Longyin Wen

[ Exhibit Hall I ]

Abstract
Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities.However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency.In this paper, we propose Decomposed Attention (\method{}), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention.\method{} decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived weighting strategy, namely $\alpha$-weighting. Taking advantage of the flexibility, we are able to introduce two critical improvements in visual token processing while maintaining the capacity of pre-trained LLMs: 1) We rectify the biased positional encoding in textual-to-visual attention to boost visual understanding performance. 2) We diagonalize visual-to-visual attention to reduce computation complexity from $\mathcal{O}(|V|^2)$ to $\mathcal{O}(|V|)$ for $|V|$ visual tokens without compromising performance. Extensive experiments and analysis validate the effectiveness of \method{}, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs (\eg, $5\times$ faster).Code, data, and models will …
Poster
Ting Lei · Shaofeng Yin · Qingchao Chen · Yuxin Peng · Yang Liu

[ Exhibit Hall I ]

Abstract
Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set.Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model’s ability to capture detailed HOI relationships.To address these issues, we propose Interaction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model’s attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection.Furthermore, we refine HOI concept representations through language model-guided calibration, which helps distinguish diverse HOI concepts by leveraging structured semantic knowledge. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions.Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and …
Poster
Kai Huang · hao zou · Bochen Wang · Xi Ye · Zhen Xie · Hao Wang

[ Exhibit Hall I ]

Abstract
Recent advancements in Large Visual Language Models (LVLMs) have gained significant attention due to their remarkable reasoning capabilities and proficiency in generalization. However, processing a large number of visual tokens and generating long-context outputs impose substantial computational overhead, leading to excessive demands for key-value (KV) cache. To address this critical bottleneck, we propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference. This work systematically investigates the correlations between visual and textual tokens within the attention mechanisms of LVLMs. Our empirical analysis reveals considerable redundancy in cached visual tokens, wherein strategically eliminating these tokens preserves model performance while significantly accelerating context generation. Inspired by these findings, we introduce an elite observation window for assessing the importance of visual components in the KV cache, focusing on stable inter-modal relevancy modeling with enhanced multi-perspective consistency. Additionally, we develop an adaptive layer-wise budget allocation strategy that capitalizes on the strength and skewness of token importance distribution, showcasing superior efficiency compared to uniform allocation. Comprehensive evaluations across multiple LVLMs and benchmarks demonstrate that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache, thereby reducing decoding latency by 29% to 66% across various batch …
Poster
Arsha Nagrani · Sachit Menon · Ahmet Iscen · Shyamal Buch · Nilpa Jha · Ramin Mehran · Anja Hauth · Mikhail Sirotenko · Yukun Zhu · Carl Vondrick · Cordelia Schmid · Tobias Weyand

[ Exhibit Hall I ]

Abstract
Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be released publicly.
Poster
Fu Rong · Meng Lan · Qian Zhang · Lefei Zhang

[ Exhibit Hall I ]

Abstract
Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings along with multimodal class tokens. A mask prior generator is devised to utilize the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts, along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we propose a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal …
Poster
Jeongseok Hyun · Sukjun Hwang · Su Ho Han · Taeoh Kim · Inwoong Lee · Dongyoon Wee · Joon-Young Lee · Seon Joo Kim · Minho Shim

[ Exhibit Hall I ]

Abstract
Video large language models (LLMs) have achieved good video understanding performance by utilizing a large number of tokens in spatio-temporal space. However, the quadratic growth of the computational complexity associated with the number of tokens remains a critical challenge. To address this, we propose a novel spatio-temporal token merging (STTM) designed to enhance token efficiency in video LLMs. Our key insight is to leverage inherent spatial and temporal local redundancy in video data, which has been overlooked in previous research. Specifically, we transform individual frames into multi-granular spatial tokens, by coarse-to-fine search algorithm based on the quadtree data structure. Subsequently, we perform multi-granular directed pairwise merging in the temporal dimension. This decomposed merging approach significantly reduces redundant visual tokens across spatio-temporal dimension. Experiments on multiple video QA benchmarks show that our approach outperforms existing token reduction methods in accuracy. Surprisingly, our approach maintains above 99\% relative accuracy to models using full tokens with only 50\% of token budget. This token reduction also translates to lower inference latency.
Poster
Qi Chen · Xinze Zhou · Chen Liu · Hao Chen · Wenxuan Li · Zekun Jiang · Ziyan Huang · Yuxuan Zhao · Dexin Yu · Junjun He · Yefeng Zheng · Ling Shao · Alan Yuille · Zongwei Zhou

[ Exhibit Hall I ]

Abstract
AI development for tumor segmentation is challenged by the scarcity of large, annotated datasets, due to the intensive annotation effort and required medical expertise. Analyzing a proprietary dataset of 3,000 per-voxel annotated pancreatic tumor scans, we discovered that beyond 1,500 scans, AI performance plateaus despite more data. We further incorporated synthetic data, showing that AI could reach the plateaus with only 500 real scans. This indicates that synthetic augmentation steepens the scaling laws, enhancing AI performance more efficiently than real data alone.Motivated by these lessons, we created CancerVerse---a dataset of 10,136 CT scans with a total of 10,260 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, uterus) and 5,279 control scans. This monumental effort by eight expert radiologists offers a dataset scale that surpasses existing public tumor datasets by several orders of magnitude. While we continue to expand the scale of data and annotations, we believe that the current CancerVerse can already provide a solid foundation---based on our lessons from the proprietary dataset---to enable AI to segment tumors in these six organs, offering significant improvements in both in-distribution (+7% DSC) and out-of-distribution (+16% DSC) evaluations over those trained on current public datasets. More importantly, AI …
Poster
Ziyang Luo · Nian Liu · Xuguang Yang · Salman Khan · Rao Anwer · Hisham Cholakkal · Fahad Khan · Junwei Han

[ Exhibit Hall I ]

Abstract
Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity,they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that couples the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
Poster
Young Seok Jeon · Hongfei Yang · Huazhu Fu · Young Seok Jeon

[ Exhibit Hall I ]

Abstract
Imposing key anatomical features, such as the number of organs, their shapes and relative positions, is crucial for building a robust multi-organ segmentation model. Current attempts to incorporate anatomical features include broadening the effective receptive field (ERF) size with data-intensive modules, or introducing anatomical constraints that scales poorly to multi-organ segmentation. We introduce a novel architecture called the Anatomy-Informed Cascaded Segmentation Network (AIC-Net). AIC-Net incorporates a learnable input termed "Anatomical Prior", which can be adapted to patient-specific anatomy using a differentiable spatial deformation. The deformed prior later guides decoder layers towards more anatomy-informed predictions. We repeat this process at a local patch level to enhance the representation of intricate objects, resulting in a cascaded network structure. AIC-Net is a general method that enhances any existing segmentation models to be more anatomy-aware. We have validated the performance of AIC-Net, with various backbones, on three multi-organ segmentation tasks: abdominal organs, vertebrae, and ribs. For each respective task, our benchmarks demonstrate improved dice score and Hausdorff distance.
Poster
Jiannan Ge · Lingxi Xie · Hongtao Xie · Pandeng Li · Sun-Ao Liu · XIAOPENG ZHANG · Qi Tian · Yongdong Zhang

[ Exhibit Hall I ]

Abstract
In recent years, Open-Vocabulary Semantic Segmentation (OVSS) has been largely advanced. However, existing methods mostly rely on a pre-trained vision-language model (e.g., CLIP) and require a predefined set of classes to guide the semantic segmentation process during the inference. This not only narrows the application scenario but also constrains comprehension within a finite vocabulary. To overcome this, we reformulate OVSS as a text generation task and propose the CLIP-adapted Region-to-Text Network (CRTNet) that achieves vocabulary-free OVSS by generating category names and descriptions upon segmentation masks. The training process consists of two steps to ensure an accurate and detailed interpretation of the masked regions: (i) the initial step adapts CLIP visual features to mask-level proposal features using binarized masks extracted by a trained mask extractor, and (ii) the subsequent step involves aggregating these features to become text-aware by integrating CLIP text embeddings, effectively aligning visual data with corresponding linguistic data to facilitate region-to-text learning. Furthermore, we introduce a series of parsing and filtering techniques to integrate multiple sources of training data to improve the generalization ability of our model. Experiments demonstrate that our model not only excels in OVSS but also exhibits scalability and can be adapted to various foundation models …
Poster
Danhui Chen · Ziquan Liu · Chuxi Yang · Dan Wang · Yan Yan · Yi Xu · Xiangyang Ji

[ Exhibit Hall I ]

Abstract
Pixel-level vision tasks, such as semantic segmentation, require extensive and high-quality annotated data, which is costly to obtain. Semi-supervised semantic segmentation (SSSS) has emerged as a solution to alleviate the labeling burden by leveraging both labeled and unlabeled data through self-training techniques. Meanwhile, the advent of foundational segmentation models pre-trained on massive data, has shown the potential to generalize across domains effectively. This work explores whether a foundational segmentation model can address label scarcity in the pixel-level vision task as an annotator for unlabeled images. Specifically, we investigate the efficacy of using SEEM, a Segment Anything Model (SAM) variant fine-tuned for textual input, to generate predictive masks for unlabeled data. To address the shortcomings of using SEEM-generated masks as supervision, we propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain's labeled data and then filters out unreliable pixel labels of unlabeled data so that only high-confidence labels are used as supervision. By leveraging conformal prediction (CP) to adapt foundation models to target data through uncertainty calibration, ConformalSAM exploits the strong capability of the foundational segmentation model reliably which benefits the early-stage learning, while a subsequent self-reliance training strategy mitigates overfitting to SEEM-generated masks …
Poster
Zhang Li · Biao Yang · Qiang Liu · Shuo Zhang · Zhiyin Ma · Liang Yin · Deng Linger · Yabo Sun · Yuliang Liu · Xiang Bai

[ Exhibit Hall I ]

Abstract
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
Poster
Sijie Li · Chen Chen · Jungong Han

[ Exhibit Hall I ]

Abstract
In this paper, we propose SimMLM, a simple yet powerful framework for multimodal learning with missing modalities. Unlike existing approaches that rely on sophisticated network architectures or complex data imputation techniques, SimMLM provides a generic and effective solution that can adapt to various missing modality scenarios with improved accuracy and robustness. Specifically, SimMLM consists of a generic Dynamic Mixture of Modality Experts (DMoME) architecture, featuring a dynamic, learnable gating mechanism that automatically adjusts each modality’s contribution in both full and partial modality settings. A key innovation of SimMLM is the proposed More vs. Fewer (MoFe) ranking loss, which ensures that task accuracy improves or remains stable as more modalities are made available. This aligns the model with an intuitive principle: removing one or more modalities should not increase accuracy. We validate SimMLM on multimodal medical image segmentation (BraTS 2018) and multimodal classification (UPMC Food-101, avMNIST) tasks, where it consistently surpasses competitive methods, demonstrating superior accuracy, interpretability, robustness, and reliability across both complete and missing modality scenarios at test time.
Poster
Zhixi Cai · Fucai Ke · Simindokht Jahangard · Maria Garcia de la Banda · Gholamreza Haffari · Peter Stuckey · Hamid Rezatofighi

[ Exhibit Hall I ]

Abstract
Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines.
Poster
Hui Sun · Shiyin Lu · Huanyu Wang · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang · Ming Li

[ Exhibit Hall I ]

Abstract
Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP$^3$) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space (RKHS). We then apply the determinantal point process (DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process (MDP) for allocating selection sizes across segments. Theoretically, MDP$^3$ provides a $(1-1/e)$-approximate solution to the NP-hard list-wise frame selection problem with …
Poster
Hongqiu Wang · Wu Chen · Xiangde Luo · Zhaohu Xing · Lihao Liu · Jing Qin · Shaozhi Wu · Lei Zhu

[ Exhibit Hall I ]

Abstract
Fairness in AI-assisted medical image analysis is crucial for equitable healthcare, but is often neglected, especially in cross-domain scenarios (diverse patient demographics and imaging protocols) that are prevalent in medical applications. Effective and equitable deployment of AI models in these scenarios are critical, yet traditional Unsupervised Domain Adaptation (UDA) methods exhibit limited improvements. Emerging Active Domain Adaptation (ADA) approaches offer more effective enhancements, but all ignore fairness issues, exacerbating biased outcomes. Therefore, in this work, we propose the first fairness-aware ADA paradigm that simultaneously achieves both enhanced fairness and superior overall performance. Our method leverages the multimodal alignment capability of Vision-Language Models (VLMs): By performing medical images (vision) and sensitive attributes (language) learning, VLM inherently captures semantic correlations between visual features and protected attributes, enabling explicit attributes representation. Building on this foundation, we further devise an attribute-aware strategy (FairAP), which dynamically adapts to diverse patient demographics to promote equitable and high-quality outcomes by considering both Attribute and Polysemy. Extensive experiments on the FairDomain benchmark demonstrate that our method significantly reduces bias and maintains state-of-the-art performance in segmentation tasks, outperforming existing UDA and ADA methods. This work pioneers a VLM-driven ADA paradigm for fair cross-domain medical segmentation, offering a blueprint for …
Poster
Maoxian Wan · Kaige Li · Qichuan Geng · Weimin Shi · Zhong Zhou

[ Exhibit Hall I ]

Abstract
Existing incremental few-shot semantic segmentation (IFSS) methods often learn novel classes by fine-tuning parameters from previous stages. This inevitably reduces the distinguishability of old class features, leading to catastrophic forgetting and overfitting to limited new samples. In this paper, we propose a novel prompt-based IFSS method with a visual prompt pool to store and switch multi-granular knowledge across stages, enhancing the model's ability to learn new classes. Specifically, we introduce three levels of prompts: 1) Task-persistent prompts: capturing generalizable knowledge shared across stages, such as foreground-background distributions, to ensure consistent recognition guidance; 2) Stage-specific prompts: adapting to the unique requirements of each stage by integrating its discriminative knowledge (e.g., shape difference) with common knowledge from previous stages; and 3) Region-unique prompts: encoding category-specific structures (e.g., edges) to more accurately guide the model to retain local details. In particular, we introduce a prompt switching mechanism that adaptively allocates the knowledge required for base and new classes, avoiding interference between prompts and preventing catastrophic forgetting and reducing the increasing computation. Our method achieves a new state-of-the-art performance, outperforming previous SoTA methods by 30.28\% mIoU-N on VOC and 13.90\% mIoU-N on COCO under 1-shot setting.
Poster
Jiale Zhou · Wenhan Wang · Shikun Li · Xiaolei Qu · Xin Guo · Yizhong Liu · Wenzhong Tang · Xun Lin · Yefeng Zheng

[ Exhibit Hall I ]

Abstract
Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudo-break regions. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA's effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.
Poster
Rongchang Xie · Chen Du · Ping Song · Chang Liu

[ Exhibit Hall I ]

Abstract
We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with language tokens. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance.Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces the amount of training data and improves the performance of the unified model. With the same LLM size, our method improved the understanding performance by 4.8\% compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7\%. For visual generation, our model achieves a FID score of 7.73 on MJHQ-30k, surpassing the existing unified models.
Poster
Yitong Jiang · Jinwei Gu · Tianfan Xue · Ka Chun Cheung · Pavlo Molchanov · Hongxu Yin · Sifei Liu

[ Exhibit Hall I ]

Abstract
Vision-Language Models (VLMs) excel at visual understanding by leveraging pretrained image encoders to generate visual tokens. However, they struggle with high-resolution images and zoomed-in regions due to the computational burden and token redundancy of uniform patch-based processing, often leading to the loss of critical details. To address these challenges, we propose Token-Efficient Vision Language Model (TEVA), a novel framework that detects key regions and applies dynamic patch sampling to efficiently capture fine-grained details while preserving global context. Our approach first identifies subject-oriented regions using an adaptive detection strategy. Then, a dynamic patch sampling mechanism selects and arranges patches at varying scales, ensuring efficient processing without increasing token count. Extensive experiments demonstrate that Token-Efficient Vision Language Model (TEVA) significantly enhances VLM performance in handling visual details, seamlessly integrating with various decoders and LLMs. Code and dataset will be released upon publication.
Poster
YASSER ABDELAZIZ DAHOU DJILALI · Ngoc Huynh · Phúc Lê Khắc · Wamiq Para · Ankit Singh · Sanath Narayan

[ Exhibit Hall I ]

Abstract
We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones. This benchmark focuses on low-level features including color, intensity, and orientation, which are fundamental to human visual processing. Our SalBench consists of images that highlight rare, unusual, or unexpected elements within scenes, and naturally draw human attention. It comprises three novel tasks for evaluating the perceptual capabilities of LVLM: Odd-One-Out Detection, Referring Odd-One-Out, and Visual Referring Odd-One-Out. We perform a comprehensive evaluation of state-of-the-art LVLM using SalBench and our findings reveal a surprising limitation: LVLM struggle to identify seemingly obvious visual anomalies, with even the advanced GPT-4o achieving only 47.6\% accuracy on such a simple task. SalBench will be an important step in measuring the capabilities of LVLM that align with the subtle definition of human attention. The project is available: https://github.com/salbench/salbench.
Poster
Yuxuan Wang · Yiqi Song · Cihang Xie · Yang Liu · Zilong Zheng

[ Exhibit Hall I ]

Abstract
Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers.In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks.This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a $4.2$ points improvement over its competitors across four VideoQA benchmarks, and $2.06$ points on egocentric planning. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to $8\times$. Besides, the frame retrieval results on our specialized \textbf{Needle in a Video Haystack (NIAVH)} benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory …
Poster
Xinyu Mao · Xiaohan Xing · Fei MENG · Jianbang LIU · Fan BAI · Qiang Nie · Max Meng

[ Exhibit Hall I ]

Abstract
Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anything Model (SAM) have demonstrated strong generalizability and fine-grained boundary detection with sparse prompts, effectively addressing key polyp segmentation challenges. However, SAM’s prompt-dependent nature limits automation in medical applications, since manually inputting prompts for each image is labor-intensive and time-consuming. We propose OP-SAM, a One-shot Polyp segmentation framework based on SAM that automatically generates prompts from a single annotated image, ensuring accurate and generalizable segmentation without additional annotation burdens. Our method introduces Correlation-based Prior Generation (CPG) for semantic label transfer and Scale-cascaded Prior Fusion (SPF) to adapt to polyp size variations as well as filter out noisy transfers. Instead of dumping all prompts at once, we devise Euclidean Prompt Evolution (EPE) for iterative prompt refinement, progressively enhancing segmentation quality. Extensive evaluations across five datasets validate OP-SAM’s effectiveness. Notably, on Kvasir, it achieves 76.93% IoU, surpassing the state-of-the-art by 11.44%. Source codes will be released upon acceptance.
Poster
Shuyi Ouyang · Ziwei Niu · Hongyi Wang · Yen-wei Chen · Lanfen Lin

[ Exhibit Hall I ]

Abstract
Referring Visual Grounding (RVG) tasks revolve around utilizing vision-language interactions to incorporate object information from language expressions, thereby enabling targeted object detection or segmentation within images. Transformer-based methods have enabled effective interaction through attention mechanisms, achieving notable performance in RVG tasks. However, existing strategies for RVG, which involve direct interaction between visual and linguistic features, face three key challenges: (i) tendency to focus on a single target, (ii) insufficient control over linguistic noise, and (iii) high computational cost. To address these challenges, we propose a Region-aware Anchoring Mechanism (RaAM) that mediates vision-language interactions. In RaAM, region-aware anchors engage in alternating interactions with vision and language modalities, acting as indicators for object presence across different regions within the image. RaAM (i) directs attention to multiple target regions for better localization, (ii) reduces cross-modal redundancy by using anchors as buffers, and (iii) lowers time complexity. In addition, we design region and pixel level loss functions to enhance object presence assessment and edge precision. We evaluate our RaAM-RVG on four benchmark datasets and integrate RaAM into various models by replacing their interaction design. Results show that RaAM outperforms state-of-the-art methods with lower computational cost. Code will be released publicly.
Poster
Jinglei Zhang · Yuanfan Guo · Rolandos Alexandros Potamias · Jiankang Deng · Hang Xu · Chao Ma

[ Exhibit Hall I ]

Abstract
In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. The code will be made publicly available.
Poster
Ruyi Xu · Yen-Tzu Chiu · Tai-I Chen · Oscar Chew · Yung-Yu Chuang · Wen-Huang Cheng

[ Exhibit Hall I ]

Abstract
Anomaly generation has become essential in addressing the scarcity of defective samples in industrial anomaly inspection. However, existing training-based methods fail to handle complex anomalies and multiple defects simultaneously, especially when only a single anomaly sample is available per defect type. To address this issue, we propose TF-IDG, a novel training-free defect generation framework capable of generating diverse anomaly samples in a one-shot setting. We propose a Feature Alignment strategy that provides fine-grained appearance guidance by minimizing the distributional gap between generated and real defects with high complexity. Additionally, we introduce an Adaptive Anomaly Mask mechanism to mitigate the issue of defects with small regions being ignored during the generation process, enhancing consistency between synthetic defects and their corresponding masks. Finally, we incorporate a Texture Preservation module that extracts background information from anomaly-free images, ensuring that the visual properties of synthetic defects are seamlessly integrated into the image. Extensive experiments demonstrate the effectiveness of our method in generating accurate and diverse anomalies, further leading to superior performance in downstream anomaly inspection tasks.
Poster
Sebastian Höfer · Dorian Henning · Artemij Amiranashvili · Douglas Morrison · Mariliza Tzes · Ingmar Posner · Marc Matvienko · Alessandro Rennola · Anton Milan

[ Exhibit Hall I ]

Abstract
We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD (Bergmann et al., 2021) and VisA (Zou et al., 2022) have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of viewpoints and object appearances. Leading anomaly detection methods fall short when applied to this new setting.To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (29,000 defective instances), it is 40 times larger than MVTec and contains more than 46,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they achieve only 56.9% AUC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail …
Poster
Hui Lu · Albert Ali Salah · Ronald Poppe

[ Exhibit Hall I ]

Abstract
Video understanding requires the extraction of rich spatio-temporal representations, achieved by transformer models through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that addresses these limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. VideoMambaPro models surpass VideoMamba by 1.6-3.0% and 1.1-1.9% top-1 on Kinetics-400 and Something-Something V2, respectively. Even without extensive pre-training, our models present an attractive and efficient alternative to current transformer models. Moreover, our two solutions are orthogonal to recent advances in Vision Mamba models, and are likely to provide further improvements in future models.
Poster
Ke Niu · Haiyang Yu · Mengyang Zhao · Teng Fu · Siyang Yi · Wei Lu · Bin Li · Xuelin Qian · Xiangyang Xue

[ Exhibit Hall I ]

Abstract
Person re-identification (Re-ID) is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views. While recent advanced vision-language models (VLMs) excel in logical reasoning and multi-task generalization, their applications in Re-ID tasks remain limited. They either struggle to perform accurate matching based on identity-relevant features or assist image-dominated branches as auxiliary semantics. In this paper, we propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification. To integrate the reasoning abilities of language models into Re-ID pipelines, We first present a large-scale instruction dataset, which contains more than 8 million prompts to promote the model fine-tuning. Next. we introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning.Extensive experiments across ten popular benchmarks demonstrate that ChatReID outperforms existing methods, achieving state-of-the-art performance in all Re-ID tasks. More experiments demonstrate that ChatReID not only has the ability to recognize fine-grained details but also to integrate them into a coherent reasoning process.
Poster
Fan Li · Xuanbin Wang · Xuan Wang · Zhaoxiang Zhang · yuelei xu

[ Exhibit Hall I ]

Abstract
Recently, open-vocabulary semantic segmentation has garnered growing attention. Most current methods leverage vision-language models like CLIP to recognize unseen categories through their zero-shot capabilities. However, CLIP struggles to establish potential spatial dependencies among scene objects due to its holistic pre-training objective, causing sub-optimal results. In this paper, we propose a DEnoising learning framework based on the Diffusion model for Open-vocabulary semantic Segmentation, called DEDOS, which is aimed at constructing the scene skeleton. Motivation stems from the fact that diffusion models incorporate not only the visual appearance of objects but also embed rich scene spatial priors. Our core idea is to view images as labels embedded with "noise"—non-essential details for perceptual tasks—and to disentangle the intrinsic scene prior from the diffusion feature during the denoising process of the images. Specifically, to fully harness the scene prior knowledge of the diffusion model, we introduce learnable proxy queries during the denoising process. Meanwhile, we leverage the robustness of CLIP features to texture shifts as supervision, guiding proxy queries to focus on constructing the scene skeleton and avoiding interference from texture information in the diffusion feature space. Finally, we enhance spatial understanding within CLIP features using proxy queries, which also serve as an interface …
Poster
Osman Ülger · Maksymilian Kulicki · Yuki Asano · Martin Oswald

[ Exhibit Hall I ]

Abstract
Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, without training or fine-tuning. However, OVS methods typically require a human in the loop to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce Auto-Vocabulary Semantic Segmentation (AVS), advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, AutoSeg, presents a framework that autonomously identifies relevant class names using semantically enhanced BLIP embeddings and segments them afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. With AVS, our method sets new benchmarks on datasets PASCAL VOC, Context, ADE20K, and Cityscapes, while showing competitive performance to OVS methods that require specified class names. All code will be publicly released.
Poster
Tom Nuno Wolf · Emre Kavak · Fabian Bongratz · Christian Wachinger

[ Exhibit Hall I ]

Abstract
The deployment of deep learning models in critical domains necessitates a balance between high accuracy and interpretability.We introduce SIC, an inherently interpretable neural network that provides local and global explanations of its decision-making process.Leveraging the concept of case-based reasoning, SIC extracts class-representative support vectors from training images, ensuring they capture relevant features while suppressing irrelevant ones.Classification decisions are made by calculating and aggregating similarity scores between these support vectors and the input's latent feature vector. We employ B-Cos transformations, which align model weights with inputs, to yield coherent pixel-level explanations in addition to global explanations of case-based reasoning.We evaluate SIC on three tasks: fine-grained classification on Stanford Dogs and FunnyBirds, multi-label classification on Pascal VOC, and pathology detection on the RSNA dataset.Results indicate that SIC not only achieves competitive accuracy compared to state-of-the-art black-box and inherently interpretable models but also offers insightful explanations verified through practical evaluation on the FunnyBirds benchmark.Our theoretical analysis proves that these explanations fulfill established axioms for explanations. Our findings underscore SIC's potential for applications where understanding model decisions is as critical as the decisions themselves.
Poster
Zuhao Yang · Yingchen Yu · Yunqing Zhao · Shijian Lu · Song Bai

[ Exhibit Hall I ]

Abstract
Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.
Poster
Shraman Pramanick · Effrosyni Mavroudi · Yale Song · Rama Chellappa · Lorenzo Torresani · Triantafyllos Afouras

[ Exhibit Hall I ]

Abstract
We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.
Poster
Hongliang Zhou · Yongxiang Liu · Canyu Mo · Weijie Li · Bowen Peng · Li Liu

[ Exhibit Hall I ]

Abstract
Few-shot object detection aims to detect novel classes with limited samples. Due to boundary and scale discrepancies with base classes, novel classes exhibit suboptimal performance under limited samples. Although recent methods leverage rich semantic representations of pretrained ViT to overcome limitations of model fine-tuning, thereby enhancing novel class performance, designing a ViT architecture that addresses boundary and scale issues to balance base and novel class performance remains challenging: (1) modeling feature distinctions at object boundaries at pixel level while preserving global information; and (2) applying scale-specific extraction for images containing multiscale objects, adaptively capturing of local details and global contours. So Pixel Difference Vision Transformer (PiDiViT) is proposed. Innovations include: (1) difference convolution fusion module (DCFM), which achieves precise object boundary localization and effective preservation of global object information by integrating direction-sensitive differential feature maps of pixel neighborhoods with original feature maps; and (2) multiscale feature fusion module (MFFM), which adaptively fuses features extracted by five different scale convolutional kernels using a scale attention mechanism to generate attention weights, achieving an optimal balance between local detail and global semantic information extraction. PiDiViT achieves SOTA on COCO benchmark: surpassing few-shot detection SOTA by 2.7 nAP50 (10-shot) and 4.0 nAP50 (30-shot) for …
Poster
Zhe Li · Lei Zhang · Zheren Fu · Kun Zhang · Zhendong Mao

[ Exhibit Hall I ]

Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text describing the user's intention without training on the triplet datasets. The key to this task is to make specified changes to specific objects in the reference image based on the text. Previous works generate single or multiple pseudo words by projecting the reference image to the word embedding space. However, these methods ignore the fact that the editing objects of CIR are naturally hierarchical, and lack the ability of text adaptation, thus failing to adapt to multi-level editing needs. In this paper, we argue that the hierarchical object decomposition is the key to learning pseudo words, and propose a hierarchy-aware dynamic pseudo word learning (HIT) framework to equip with HIerarchy semantic parsing and Text-adaptive filtering. The proposed HIT enjoys several merits. First, HIT is empowered to dynamically decompose the image into different granularity of editing objects by a set of learnable group tokens as guidance, thus naturally forming the hierarchical semantic concepts. Second, the text-adaptive filtering strategy is proposed to screen out specific objects from different levels based on the text, so as to learn hierarchical pseudo words that meet diverse …
Poster
Zeyu Xi · Haoying Sun · Yaofei Wu · Junchi Yan · Haoran Zhang · Lifang Wu · Liang Wang · Chang Wen Chen

[ Exhibit Hall I ]

Abstract
Existing sports video captioning methods often focus on the content yet overlook player identities, limiting their applicability. Although existing methods integrate extra information to generate identity-aware descriptions, player identities are sometimes incorrect because the extra information is independent of the video content. This paper introduces a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-VC), which focus on recognizing player identity from a visual perspective. Specifically, an identity related information extraction module (IRIEM) is designed to extract player related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct the NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance.
Poster
Wooseong Jeong · Jegyeong Cho · Youngho Yoon · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
Generalizing neural networks to unseen target domains is a significant challenge in real-world deployments. Test-time training (TTT) addresses this by using an auxiliary self-supervised task to reduce the domain gap caused by distribution shifts between the source and target. However, we find that when models are required to perform multiple tasks under domain shifts, conventional TTT methods suffer from unsynchronized task behavior, where the adaptation steps needed for optimal performance in one task may not align with the requirements of other tasks. To address this, we propose a novel TTT approach called Synchronizing Tasks for Test-time Training (S4T), which enables the concurrent handling of multiple tasks. The core idea behind S4T is that predicting task relations across domain shifts is key to synchronizing tasks during test time. To validate our approach, we apply S4T to conventional multi-task benchmarks, integrating it with traditional TTT protocols. Our empirical results show that S4T outperforms state-of-the-art TTT methods across various benchmarks.
Poster
Ru Zeng · Yan Song · Yang ZHANG · yanlinghu yanlinghu · Hui Yu

[ Exhibit Hall I ]

Abstract
GLOM, an innovative departure from standard deep learning architectures, has been proposed and gained special concern recently due to its good interpretability in representing part-whole relationships in computer vision. However, GLOM faces challenges in achieving agreement and is usually computationally demanding. First, current implementations struggle to produce identical vectors that reliably converge to represent nodes in a parse tree. Second, GLOM is computationally intensive due to the need to maintain equal resolution across all levels. To address these issues, inspired by contrastive learning, we proposed a contrastive agreement enhancer (CAE), which effectively promotes agreement between positive embedding pairs while pushing apart negative pairs, thereby facilitating forming distinct ``islands." Furthermore, we introduce a dissimilarity-focused head ($ H_d $) to reduce redundancy in the top-level embeddings, where embedding weights for downsampling are negatively correlated with similarity within a sliding window. The results of comparison experiments indicate that the proposed approach delicately retains informative content and significantly reduces the number of parameters. Additionally, the ablation experiments and visualization results demonstrate that CAE successfully promotes islands of agreement.
Poster
Maximilian Ulmer · Wout Boerdijk · Rudolph Triebel · Maximilian Durner

[ Exhibit Hall I ]

Abstract
This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks. Code and the synthetic dataset will be publicly released.
Poster
Tommaso Galliena · Tommaso Apicella · Stefano Rosa · Pietro Morerio · ALESSIO DEL BUE · Lorenzo Natale

[ Exhibit Hall I ]

Abstract
We present a self-supervised method to improve an agent's abilities in describing arbitrary objectswhile actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism.First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set.Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations will be released upon paper acceptance.
Poster
Yuhui Zeng · Haoxiang Wu · Wenjie Nie · Xiawu Zheng · Guangyao Chen · Yunhang Shen · Jun Peng · Yonghong Tian · Rongrong Ji

[ Exhibit Hall I ]

Abstract
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities. This deficiency arises from their architecture's emphasis on discrete object identification rather than modeling the compositional reasoning, inter-object correlations, and contextual semantics essential for comprehensive event understanding. To address this challenge, we present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding through LLM-guided symbolic reasoning. Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive task-specific training. The proposed plug-and-play framework interfaces with any open-vocabulary detector while extending their inherent capabilities across architectures. At its core, our approach combines (i) a symbolic regression mechanism exploring relationship patterns among detected entities and (ii) a LLM-guided strategically guiding the search toward meaningful expressions. These discovered symbolic rules transform low-level visual perception into interpretable event understanding, providing a transparent reasoning path from objects to events with strong transferability across domains.We compared our training-free framework against specialized event recognition systems across diverse application domains. Experiments demonstrate that our framework enhances multiple object detector architectures to recognize complex events such as illegal fishing activities ($\textbf{75}$ %AUROC, $\textbf{+8.36}$ %improvement), construction …
Poster
Min Cen · Zhenfeng Zhuang · Yuzhe Zhang · Min Zeng · Baptiste Magnier · Lequan Yu · Hong Zhang · Liansheng Wang

[ Exhibit Hall I ]

Abstract
Graph-based Multiple Instance Learning (MIL) is commonly applied in survival analysis using Hematoxylin and Eosin (H\&E)-stained whole slide images (WSIs) because it effectively captures topological information. However, variations in WSI preparation—such as differences in staining and scanning—can introduce semantic bias. Additionally, topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To address these issues, we introduce a dual structural causal model as the theoretical foundation and further propose a novel and interpretable dual causal graph-based MIL model, named C$^2$MIL, for robust survival analysis. C$^2$MIL adopts a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy integrating disentangling supervision and contrastive learning is proposed to ensure simultaneous refinement of semantic and topological causalities. Experimental results reveal that C$^2$MIL outperforms existing methods in both generalization and interpretability and can serve as a causal enhancement for various MIL baselines. The code will be available later.
Poster
Shijie Ma · Yuying Ge · Teng Wang · Yuxin Guo · Yixiao Ge · Ying Shan

[ Exhibit Hall I ]

Abstract
The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that **visually** perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing **only** global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth exploration, we have finally arrived at an effective method that consistently outperforms prior arts …
Poster
Junhao Xiao · Yang Wei · Jingyu Wang · Yongchao Wang · Xiuli Bi · Bin Xiao

[ Exhibit Hall I ]

Abstract
Morphological differences and dense spatial relations of organs make multi-organ segmentation challenging. Current segmentation networks, primarily based on CNNs and Transformers, represent organs by aggregating information within fixed regions. However, aggregated representations often fail to accurately describe the shape differences and spatial relationships of multi-organs, which leads to imprecise morphological modeling and ambiguous boundary representation. In response, we propose a novel multi-organ segmentation network via dynamic graph reconstruction, called DGRNet. Unlike existing approaches, DGRNet employs a graph-based paradigm to reconstruct multi-organs and leverages the topological flexibility of graphs to represent irregular organ morphology. Based on graph connectivity, the precise information interaction makes more efficient multi-organ modeling. Furthermore, DGRNet introduces a category-aware guidance mechanism that utilizes organ category-specific priors to explicitly define inter-organ boundaries, addressing the issue of ambiguous margin delineation in multi-organ regions. We conducted extensive experiments on five datasets (including CT and MRI), showing that DGRNet outperforms state-of-the-art methods and models complex multi-organ areas better, highlighting its effectiveness and robustness.
Poster
Bin Xie · Hao Tang · Bin Duan · Dawen Cai · Yan Yan · Gady Agam

[ Exhibit Hall I ]

Abstract
The Segment Anything Model (SAM), a prompt-driven foundation model for natural image segmentation, has demonstrated impressive zero-shot performance. However, SAM is not directly applicable to medical image segmentation due to its inability to predict semantic labels, reliance on additional prompts, and suboptimal performance in this domain. To address these limitations, we propose MaskSAM, a novel prompt-free SAM adaptation framework for medical image segmentation based on mask classification. MaskSAM introduces a prompt generator integrated with SAM’s image encoder to produce auxiliary classifier tokens, binary masks, and bounding boxes. Each pair of auxiliary mask and box prompts eliminates the need for user-provided prompts. Semantic label prediction is enabled by summing the auxiliary classifier tokens with learnable global classifier tokens in SAM’s mask decoder. Additionally, we design a 3D depth-convolution adapter for image embeddings and a 3D depth-MLP adapter for prompt embeddings, which are injected into each transformer block in the image encoder and mask decoder to efficiently fine-tune SAM for volumetric medical imaging.Our method achieves state-of-the-art performance, with a Dice score of 90.52% on AMOS2022, outperforming nnUNet by 2.7%. MaskSAM also surpasses nnUNet by 1.7% on ACDC and 1.0% on the Synapse dataset, demonstrating its effectiveness in medical image segmentation.
Poster
Evangelos Kazakos · Cordelia Schmid · Josef Sivic

[ Exhibit Hall I ]

Abstract
We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. We introduce the following contributions. First, we present a large-scale automatic annotation method that aggregates captions grounded with bounding boxes across individual frames into temporally dense and consistent bounding box annotations. We apply this approach on the HowTo100M dataset to construct a large-scale pre-training dataset, named HowToGround1M. We also introduce a Grounded Video Caption Generation model, dubbed GROVE, and pre-train the model on HowToGround1M. Second, we introduce a new dataset, called iGround, of 3500 videos with manually annotated captions and dense spatio-temporally grounded bounding boxes. This allows us to measure progress on this challenging problem, as well as to fine-tune our model on this small-scale but high-quality data. Third, we demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset compared to a number of baselines, as well as on the VidSTG and ActivityNet-Entities datasets. We perform extensive ablations that demonstrate the importance of pre-training using our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model. Data, …
Poster
Huy Ta · Duy Anh Huynh · Yutong Xie · Yuankai Qi · Qi Chen · Phi Le Nguyen · Sen Tran · Son Lam Phung · Anton Hengel · Zhibin Liao · Minh-Son To · Johan Verjans · Vu Phan

[ Exhibit Hall I ]

Abstract
Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations. In this paper, we empirically demonstrate two key observations. First, current VLMs assign high norms to background tokens, diverting the model's attention from regions of disease. Second, the global tokens used for cross-modal learning are not representative of local disease tokens. This hampers identifying correlations between the text and disease tokens. To address this, we introduce simple, yet effective Disease-Aware Prompting (DAP) process, which uses the explainability map of a VLM to identify the appropriate image features. This simple strategy amplifies disease-relevant regions while suppressing background interference. Without any additional pixel-level annotations, DAP improves visual grounding accuracy by 20.74\% compared to state-of-the-art methods across three major chest X-ray datasets.
Poster
Xiao Li · Yiming Zhu · Yifan Huang · Wei Zhang · Yingzhe He · Jie Shi · Xiaolin Hu

[ Exhibit Hall I ]

Abstract
Object detection plays a crucial role in many security-sensitive applications, such as autonomous driving and video surveillance. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, e.g., adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attacks. While AT has been extensively studied in the $l_\infty$-bounded attack settings on classification models, AT against physically realizable attacks on object detectors has received limited exploration. Early attempts are only performed to defend against adversarial patches, leaving AT against a wider range of physically realizable attacks under-explored. In this work, we consider defending against various physically realizable attacks with a unified AT method. We propose PBCAT, a novel Patch-Based Composite Adversarial Training strategy. PBCAT optimizes the model by incorporating the combination of small-area gradient-guided adversarial patches and imperceptible global adversarial perturbations covering the entire image. With these designs, PBCAT has the potential to defend against not only adversarial patches but also unseen physically realizable attacks such as adversarial textures. Extensive experiments in multiple settings demonstrated that PBCAT significantly improved robustness against various physically realizable attacks over state-of-the-art defense methods. …
Poster
Dongheon Lee · Seokju Yun · Youngmin Ro

[ Exhibit Hall I ]

Abstract
In this paper, we tackle the high computational cost of transformers for lightweight image super-resolution (SR).Motivated by the observations of self-attention's inter-layer repetition, we introduce a convolutionized self-attention module named Convolutional Attention (ConvAttn) that emulates self-attention's long-range modeling capability and instance-dependent weighting with a single shared large kernel and dynamic kernels.By utilizing the ConvAttn module, we significantly reduce the reliance on self-attention and its involved memory-bound operations while maintaining the representational capability of transformers.Furthermore, we overcome the challenge of integrating flash attention into the lightweight SR regime, effectively mitigating self-attention's inherent memory bottleneck.We scale up window size to 32$\times$32 with flash attention rather than proposing an intricated self-attention module, significantly improving PSNR by 0.31dB on Urban100$\times$2 while reducing latency and memory usage by 16$\times$ and 12.2$\times$.Building on these approaches, our proposed network, termed Emulating Self-attention with Convolution (ESC), notably improves PSNR by 0.27 dB on Urban100$\times$4 compared to HiT-SRF, reducing the latency and memory usage by 3.7$\times$ and 6.2$\times$, respectively.Extensive experiments demonstrate that our ESC maintains the ability for long-range modeling, data scalability, and the representational power of transformers despite most self-attentions being replaced by the ConvAttn module.
Poster
Jianfang He · Min Cao · Silong Peng · Qiong Xie

[ Exhibit Hall I ]

Abstract
Large vision-language models such as CLIP have made significant strides in zero-shot anomaly detection through prompt engineering.However, most existing methods typically process each test image individually, ignoring the practical rarity of abnormal patches in real-world scenarios.Although some batch-based approaches exploit the rarity by processing multiple samples concurrently, they generally introduce unacceptable latency for real-time applications.To mitigate these limitations, we propose RareCLIP, a novel online zero-shot anomaly detection framework that enables sequential image processing in real-time without requiring prior knowledge of the target domain.RareCLIP capitalizes on the zero-shot capabilities of CLIP and integrates a dynamic test-time rarity estimation mechanism.A key innovation of our framework is the introduction of a prototype patch feature memory bank, which aggregates representative features from historical observations and continuously updates their corresponding rarity measures.For each incoming image patch, RareCLIP computes a rarity score by aggregating the rarity measures of its nearest neighbors within the memory bank.Moreover, we introduce a prototype sampling strategy based on dissimilarity to enhance computational efficiency, as well as a similarity calibration strategy to enhance the robustness of rarity estimation.Extensive experiments demonstrate that RareCLIP attains state-of-the-art performance with 98.2\% image-level AUROC on MVTec AD and 94.5\% on VisA, while achieving a latency of 59.4 …
Poster
Maksim Golyadkin · Rubanova Alexandrovna · Aleksandr Utkov · Dmitry Nikolotov · Ilya Makarov

[ Exhibit Hall I ]

Abstract
The recognition of ancient Egyptian hieroglyphs presents significant challenges due to the vast stylistic variations and the scarcity of labeled data. While deep learning has shown promising results, existing approaches often rely on single-source or synthetic datasets, limiting their generalization ability. To advance research in hieroglyph recognition, we introduce the Multisource Egyptian Hieroglyphs (MEH) dataset, the first multi-style dataset for hieroglyph classification. MEH comprises 10 distinct groups, each representing a unique writing style, with labels derived from professionally verified text digitizations. Using this dataset, we explore three key aspects of hieroglyph recognition: (1) analyzing how different writing styles affect model generalization, (2) evaluating synthetic data generation for expanding hieroglyph class coverage, and (3) assessing classification performance of existing models. To support future large-scale dataset creation, we propose a style-aware synthetic data generation method and introduce a hieroglyph labeling tool to simplify annotation and accelerate text digitization.
Poster
Bangxiang Lan · Ruobing Xie · Ruixiang Zhao · Xingwu Sun · Zhanhui Kang · Gang Yang · Xirong Li

[ Exhibit Hall I ]

Abstract
The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, \ie F-Pig, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT-1k, MSRVTT-3k, MSVD, VATEX and DiDeMo, demonstrate that our method achieves a significant improvement over the baseline, with an increase of $1.6\% \sim 3.9\%$ in R@1.
Poster
Ruiting Dai · Chenxi Li · Yandong Yan · Lisi Mo · Ke Qin · Tao He

[ Exhibit Hall I ]

Abstract
Previous multimodal learning models for missing modalities predominantly employ diffusion models to recover absent data conditioned on the available modalities. However, these approaches often overlook a critical issue: modality generation bias. In other words, while some modalities may be generated with high quality, others—such as video—may prove challenging to synthesize effectively. We argue that this limitation is primarily due to the inherent modality gap, ultimately resulting in imbalanced training. To overcome this challenge, we introduce a novel Multi-stage Duplex Diffusion Network (MD^2N) designed to achieve unbiased missing-modality recovery. The key idea of our approach is the development of a modality transfer module within the recovery process, which facilitates smooth cross-modality generation. This module is trained using duplex diffusion models, enabling the available and missing modalities to generate each other in an intersecting manner through three distinct stages: global structure generation, modality transfer, and local cross-modal refinement. At training, the generation of the available and missing data mutually influences and finally achieves a generation balance state. Experimental results demonstrate that our proposed method significantly outperforms current state-of-the-art techniques, achieving up to a 4% improvement over IMDer on the CMU-MOSEI dataset.
Poster
Juncan Deng · Shuaiting Li · Zeyu Wang · Kedong Xu · Hong Gu · Kejie Huang

[ Exhibit Hall I ]

Abstract
Visual Mamba networks (ViMs) extend the selective space state model (Mamba) to various vision tasks and demonstrate significant potential. Vector quantization (VQ), on the other hand, decomposes network weights into codebooks and assignments, significantly reducing memory usage and computational latency to enable ViMs deployment on edge devices. Although existing VQ methods have achieved extremely low-bit quantization (e.g., 3-bit, 2-bit, and 1-bit) in convolutional neural networks and Transformer-based networks, directly applying these methods to ViMs results in unsatisfactory accuracy. We identify several key challenges: 1) The weights of Mamba-based blocks in ViMs contain numerous outliers, significantly amplifying quantization errors. 2) When applied to ViMs, the latest VQ methods suffer from excessive memory consumption, lengthy calibration procedures, and suboptimal performance in the search for optimal codewords. In this paper, we propose ViM-VQ, an efficient post-training vector quantization method tailored for ViMs. ViM-VQ consists of two innovative components: 1) a fast convex combination optimization algorithm that efficiently updates both the convex combinations and the convex hulls to search for optimal codewords, and 2) an incremental vector quantization strategy that incrementally confirms optimal codewords to mitigate truncation errors. Experimental results demonstrate that ViM-VQ achieves state-of-the-art performance in low-bit quantization across various visual tasks.
Poster
Xi Fang · Jiankun Wang · Xiaochen Cai · Shang Chien · Shuwen Yang · Haoyi Tao · Nan wang · Lin Yao · Linfeng Zhang · Guolin Ke

[ Exhibit Hall I ]

Abstract
In recent decades, chemistry publications and patents have increased rapidly. A significant portion of key information is embedded in molecular structure figures, complicating large-scale literature searches and limiting the application of large language models in fields such as biology, chemistry, and pharmaceuticals. The automatic extraction of precise chemical structures is of critical importance. However, the presence of numerous Markush structures in real-world documents, along with variations in molecular image quality, drawing styles, and noise, significantly limits the performance of existing optical chemical structure recognition (OCSR) methods. We present MolParser, a novel end-to-end OCSR method that efficiently and accurately recognizes chemical structures from real-world documents, including difficult Markush structure. We use a extended SMILES encoding rule to annotate our training dataset. Under this rule, we build MolParser-7M, the largest annotated molecular image dataset to our knowledge. While utilizing a large amount of synthetic data, we employed active learning methods to incorporate substantial in-the-wild data, specifically samples cropped from real patents and scientific literature, into the training process. We trained an end-to-end molecular image captioning model, MolParser, using a curriculum learning approach. MolParser significantly outperforms classical and learning-based methods across most scenarios, with potential for broader downstream applications. The dataset is publicly …
Poster
Zeqi Zheng · Yanchen Huang · Yingchao Yu · Zizheng Zhu · Junfeng Tang · Zhaofei Yu · Yaochu Jin

[ Exhibit Hall I ]

Abstract
Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency. However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). It emulates the brain's lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including CIFAR-10 (+0.45\%), CIFAR-100 (+0.48\%), CIFAR10-DVS (+2.70\%), N-Caltech101 (+1.94\%), and ImageNet-1K (+1.6\%). Notably, on the ImageNet-1K dataset, SpiLiFormer (69.9M parameters, 4 time steps, 384 resolution) outperforms E-SpikeFormer (173.0M parameters, 8 time steps, 384 resolution), a SOTA spiking Transformer, by 0.46\% using only 39\% of the parameters and half the time steps. Our code and training checkpoints will be released upon acceptance.
Poster
Zhuqiang Lu · Zhenfei Yin · Mengwei He · Zhihui Wang · Zicheng Liu · Zhiyong Wang · Kun Hu

[ Exhibit Hall I ]

Abstract
Recently, Vision Large Language Models (VLLMs) with integrated vision encoders have shown promising performance in vision understanding. They encode visual content into sequences of visual tokens, enabling joint processing of visual and textual data. However, understanding videos, especially long videos, remains a challenge as the rapid growth of visual tokens during video encoding risks exceeding VLLMs' context window length and significantly escalates computational cost. To restrict the number of visual tokens, existing VLLMs either: (1) uniformly downsample videos into a fixed number of frames or (2) reducing the number of visual tokens encoded from each frame. We argue that the former neglects temporal dynamics in videos, while the latter fails to preserve spatial details within individual frame. In this work, we propose Balanced-VLLM (B-VLLM), a novel VLLM framework designed to model task relevant spatio-temporal cues, while restricting the number of visual tokens within the VLLM's context window length. Central to our framework is a text-conditioned adaptive frame selection module that dynamically identifies task-relevant frames, which are further de-duplicated with a temporal frame token merging strategy.The visual tokens of these frames then undergo spatial token sampling and an optional spatial token merging strategy for granular control against the token budget. Experiments …
Poster
Zhiqi Ge · Juncheng Li · Xinglei Pang · Minghe Gao · Kaihang Pan · Wang Lin · Hao Fei · Wenqiao Zhang · Siliang Tang · Yueting Zhuang

[ Exhibit Hall I ]

Abstract
Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based agents built on Large Language Models (LLMs) often require frequent updates due to platform-specific APIs, visual agents leveraging Multimodal Large Language Models (MLLMs) offer enhanced adaptability by interacting directly with Graphical User Interfaces (GUIs). However, these agents face significant challenges in visual perception, particularly when handling high-resolution, visually complex digital environments. This paper introduces Iris, a foundational visual agent that addresses these challenges through two key innovations: Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL). ISC dynamically identifies and prioritizes visually dense regions using a edge detection algorithm, enabling efficient processing by allocating more computational resources to areas with higher information density. SRDL enhances the agent's ability to handle complex tasks by leveraging a dual-learning loop, where improvements in referring (describing UI elements) reinforce grounding (locating elements) and vice versa, all without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations, outperforming methods using 10x more training data. These improvements further translate to significant gains in both web and OS agent downstream tasks. The project is …
Poster
Aashish Sharma

[ Exhibit Hall I ]

Abstract
In this paper, we address the problem of small object detection (SOD) by introducing our novel approach - Dynamically Multiplexed Expanded Features Set (DM-EFS) form. Detecting small objects is challenging as they usually suffer from inadequate feature representation. Hence, to address this, we propose the Expanded Features Set (EFS) form - a simple yet effective idea to improve the feature representation of small objects by utilizing the untapped higher resolution features from the shallower layers of the backbone module. We observe that the EFS form improves the SOD performance. However, due to processing of additional features, it has a higher computational cost which reduces inference efficiency. Hence, to address this, we propose Dynamic Feature Multiplexing (DFM) - a novel design that optimizes the usage of the EFS form during inference by dynamically multiplexing it to create our aforementioned DM-EFS form. Since our DM-EFS form is a multiplexed (or subsampled) optimal version of the EFS form, it improves the SOD performance like the EFS form but with a lower computational cost. Extensive experiments confirm the efficacy of our DM-EFS approach. Integrated with YOLOv7 base model, our DM-EFS achieves state-of-the art results on diverse SOD datasets outperforming the base model and SOD …
Poster
Zonglin Di · Jing Shi · Yifei Fan · Hao Tan · Alexander Black · John Collomosse · Yang Liu

[ Exhibit Hall I ]

Abstract
The image difference captioning (IDC) task is to describe the distinctions between two images. However, existing datasets do not offer comprehensive coverage across all image-difference categories. In this work, we introduce a high-quality dataset, DiffTell with various types of image manipulations, including global image alterations, object-level changes, and text manipulations. The data quality is controlled by careful human filtering. Additionally, to scale up the data collection without prohibitive human labor costs, we explore the possibility of automatically filtering for quality control. We demonstrate that both traditional methods and recent multimodal large language models (MLLMs) exhibit performance improvements on the IDC task after training on the DiffTell dataset. Through extensive ablation studies, we provide a detailed analysis of the performance gains attributed to DiffTell. Experiments show DiffTell significantly enhances the availability of resources for IDC research, offering a more comprehensive foundation and benchmark for future investigations.
Poster
Ao Wang · Lihao Liu · Hui Chen · Zijia Lin · Jungong Han · Guiguang Ding

[ Exhibit Hall I ]

Abstract
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference …
Poster
WU Sitong · Haoru Tan · Yukang Chen · Shaofeng Zhang · Jingyao Li · Bei Yu · Xiaojuan Qi · Jiaya Jia

[ Exhibit Hall I ]

Abstract
Evaluating the quality of image-text pair data plays a crucial role in various data processing strategies for vision-language pre-training. Currently, most popular metrics rely on off-the-shelf vision-language models to generate quality scores for paired image and text based on their feature similarity, such as CLIP-Score. However, we observe a prevalent phenomenon, that is, different scoring models yield varying quality scores for the same data. This quality score disparity directly affects the result of data processing, leading to the discrepancy between datasets processed using different quality scores. Subsequently, this dataset disparity further results in the performance differences of models individually trained on the dataset processed by distinct quality scores. Notably, no single quality score performs optimally across all evaluation tasks. Each score exhibits an inherent bias towards certain concepts or tasks, and different scores have complementary effects on the model performance. This brings great confusion when choosing the scoring model. In this paper, we first investigate these disparity phenomena and analyze the reason. Then, we propose a simple yet effective method, named Mixture-of-Scores (MoS), to extract the essence of existing quality scores while eliminating their biases by integrating them into a more robust score based on a data-adaptive ensemble strategy. Particularly, …
Poster
Weiying Xie · Zihan Meng · Jitao Ma · Wenjin Guo · Haowei Li · Haonan Qin · Leyuan Fang · Yunsong Li

[ Exhibit Hall I ]

Abstract
Quantization-aware Training (QAT) technology helps deep models adapt to precision loss by simulating quantization operations. However, existing methods fail to reach the optimal solution due to inadequate exploration of quantization solution space. To address the issue, we propose a novel QAT method, Allowing Oscillation Quantization (AOQ), which expands the reachable solution space through weight oscillation. Notably, unlike previous methods that suppress oscillation throughout training, in the early and middle training stages, AOQ promotes oscillation to explore a broader range of quantized configurations. In the later stage, AOQ suppresses oscillation to ensure stable convergence. Furthermore, by decoupling the quantization thresholds and levels, we encourage meaningful oscillation and improve the stability of learnable quantization parameters. Extensive experiments on various models, including ResNet, MobileNet, DeiT and Swin Transformer, demonstrate the effectiveness of our method. Specifically, with 2-bit quantization, AOQ achieves a performance improvement of $0.4$%$\sim$$2.2$% on ImageNet compared to state-of-the-art methods.
Poster
Vladimir Kulikov · Matan Kleiner · Inbar Huberman-Spiegelglas · Tomer Michaeli

[ Exhibit Hall I ]

Abstract
Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results, and therefore many methods additionally intervene in the sampling process. Such methods achieve improved results but are not seamlessly transferable between model architectures. Here, we introduce FlowEdit, a text-based editing method for pre-trained T2I flow models, which is inversion-free, optimization-free and model agnostic. Our method constructs an ODE that directly maps between the source and target distributions (corresponding to the source and target text prompts) and achieves a lower transport cost than the inversion approach. This leads to state-of-the-art results, as we illustrate with Stable Diffusion 3 and FLUX.
Poster
Yiren Song · Danze Chen · Mike Zheng Shou

[ Exhibit Hall I ]

Abstract
Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a DiT based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments show that LayerTracer surpasses optimization-based and neural baselines in generation quality and editability.

Demonstration: Demos 5 Thu 23 Oct 10:45 a.m.  

  • CF3: Compact and Fast 3D Feature Fields, Hyunjoon Lee, Joonkyu Min, Jaesik Park
  • Stable Signer, Sen Fang
  • Hi3DGen: High-fidelity 3D Geometry Generation from Images via Normal Bridging, Chongjie Ye

Oral 6B: Segmentation and grouping Thu 23 Oct 01:00 p.m.  

Oral
Dengke Zhang · Fagui Liu · Quan Tang

[ Kalakaua Ballroom ]

Abstract
Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of inter-class correlations. Additionally, we introduce two additional branches to strengthen patch features’ spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks.
Oral
WEIMING ZHANG · Dingwen Xiao · Lei Chen · Lin Wang

[ Kalakaua Ballroom ]

Abstract
Entity Segmentation (ES) aims at identifying and segmenting distinct entities within an image without the need for predefined class labels. This characteristic makes ES well-suited to open-world applications with adaptation to diverse and dynamically changing environments, where new and previously unseen entities may appear frequently. Existing ES methods either require large annotated datasets or high training costs, limiting their scalability and adaptability. Recently, the Segment Anything Model (SAM), especially in its Automatic Mask Generation (AMG) mode, has shown potential for holistic image segmentation. However, it struggles with over-segmentation and under-segmentation, making it less effective for ES. In this paper, we introduce E-SAM, a novel training-free framework that exhibits exceptional ES capability. Specifically, we first propose Multi-level Mask Generation (MMG) that hierarchically processes SAM's AMG outputs to generate reliable object-level masks while preserving fine details at other levels. Entity-level Mask Refinement (EMR) then refines these object-level masks into accurate entity-level masks. That is, it separates overlapping masks to address the redundancy issues inherent in SAM's outputs and merges similar masks by evaluating entity-level consistency. Lastly, Under-Segmentation Refinement (USR) addresses under-segmentation by generating additional high-confidence masks fused with EMR outputs to produce the final ES map. These three modules are seamlessly optimized …
Oral
Yiqing Shen · Bohan Liu · Chenjia Li · Lalithkumar Seenivasan · Mathias Unberath

[ Kalakaua Ballroom ]

Abstract
Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- an LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and …
Oral
Andrea Simonelli · Norman Müller · Peter Kontschieder

[ Kalakaua Ballroom ]

Abstract
The increasing availability of digital 3D environments, whether through image reconstruction, generation, or scans obtained via lasers or robots, is driving innovation across various fields. Among the numerous applications, there is a significant demand for those that enable 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and consistently perform well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as Gaussian Splatting.
Oral
Binbin Xiang · Maciej Wielgosz · Stefano Puliti · Kamil Král · Martin Krůček · Azim Missarov · Rasmus Astrup

[ Kalakaua Ballroom ]

Abstract
The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code will be released post-acceptance.

Oral 6A: Physical Scene Perception Thu 23 Oct 01:00 p.m.  

Oral
Elisabetta Fedele · Boyang Sun · Francis Engelmann · Marc Pollefeys · Leonidas Guibas

[ Exhibit Hall III ]

Abstract
We present SuperDec, an approach for compact 3D scene representations based on geometric primitives, namely superquadrics.While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.
Oral
Hamadi Chihaoui · Paolo Favaro

[ Exhibit Hall III ]

Abstract
Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior(DIIP). We take inspiration from the Deep Image Prior (DIP). since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate \methodnameacr on various degradation-blind IR tasks, including JPEG artifact removal, deblurring, denoising and super-resolution with state-of-the-art results.
Oral

[ Exhibit Hall III ]

Abstract
A lens brings a $\textit{single}$ plane into focus on a planar sensor; hence, parts of the scene that are outside this planar focus plane are resolved on the sensor under defocus. Can we break this precept by enabling a lens that can change its depth-of-field arbitrarily? This work investigates the design and implementation of such a computational lens with spatially-selective focusing. Our design uses an optical arrangement of Lohmann lenses and phase spatial light modulators to allow each pixel to focus onto a different depth. We extend classical techniques used in autofocusing to the spatially-varying scenario where the depth map is iteratively estimated using contrast and disparity cues, enabling the camera to progressively shape its depth-of-field to the scene's depth. By obtaining an optical all-in-focus image, our technique advances upon a broad swathe of prior work ranging from depth-from-focus/defocus to coded aperture techniques in two key aspects: the ability to bring an entire scene in focus simultaneously, and the ability to maintain the highest possible spatial resolution.
Oral

[ Exhibit Hall III ]

Abstract
mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets.In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20\% per $10\times$ data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a $10\times$ increase in training data. Finally, we estimate a total data requirement of $\approx$100M samples (3000 …
Oral
Xinyu Zhou · Peiqi Duan · Yeliduosi Xiaokaiti · Chao Xu · Boxin Shi

[ Exhibit Hall III ]

Abstract
Visual vibrometry has emerged as a powerful technique for remote acquisition of audio signals and the physical properties of materials. To capture high-frequency vibrations, frame-based visual vibrometry approaches often require a high-speed video camera and bright lighting to compensate for the short exposure time. In this paper, we introduce event-based visual vibrometry, a new high-speed visual vibration sensing method using an event camera. Exploiting the high temporal resolution, dynamic range, and low bandwidth characteristics of event cameras, event-based visual vibrometry achieves high-speed vibration sensing under common lighting conditions with enhanced data efficiency. Specifically, we leverage a hybrid camera system and propose an event-based subtle motion estimation framework that integrates an optimization-based approach for estimating coarse motion within short time intervals and a neural network to mitigate the inaccuracies in the coarse motion estimation. We demonstrate our method by capturing vibration caused by audio sources and estimating material properties for various objects.

Poster Session 6 & Exhibit Hall with Coffee Break Thu 23 Oct 02:30 p.m.  

Poster
Qingqian Yang · Peishen Yan · Xiaoyu Wu · Jiaru Zhang · Tao Song · Yang Hua · Hao Wang · Liangliang Wang · Haibing Guan

[ Exhibit Hall I ]

Abstract
The distributed nature of federated learning exposes it to significant security threats, among which backdoor attacks are one of the most prevalent. However, existing backdoor attacks face a trade-off between attack strength and stealthiness: attacks maximizing the attack strength are often detectable, while stealthier approaches significantly reduce the effectiveness of the attack itself. Both of them result in ineffective backdoor injection. In this paper, we propose an adaptive layer-wise gradient alignment strategy to effectively evade various robust defense mechanisms while preserving attack strength. Without requiring additional knowledge, we leverage the previous global update as a reference for alignment to ensure stealthiness during dynamic FL training. This fine-grained alignment strategy applies appropriate constraints to each layer, which helps to significantly maintain attack impact. To demonstrate the effectiveness of our method, we conduct exhaustive evaluations across a wide range of datasets and networks. Our experimental results show that the proposed attack effectively bypasses eight state-of-the-art (SOTA) defenses and achieves high backdoor accuracy, outperforming existing attacks by up to 54.76%. Additionally, it significantly preserves attack strength and maintains robust performance across diverse scenarios, highlighting its adaptability and generalizability.
Poster
Weihao Yu · Yuanhao Cai · Ruyi Zha · Zhiwen Fan · Chenxin Li · Yixuan Yuan

[ Exhibit Hall I ]

Abstract
Four-dimensional computed tomography (4D CT) reconstruction is crucial for capturing dynamic anatomical changes but faces inherent limitations from conventional phase-binning workflows. Current methods discretize temporal resolution into fixed phases with respiratory gating devices, introducing motion misalignment and restricting clinical practicality. In this paper, We propose X$^2$-Gaussian, a novel framework that enables continuous-time 4D-CT reconstruction by integrating dynamic radiative Gaussian splatting with self-supervised respiratory motion learning. Our approach models anatomical dynamics through a spatiotemporal encoder-decoder architecture that predicts time-varying Gaussian deformations, eliminating phase discretization. To remove dependency on external gating devices, we introduce a physiology-driven periodic consistency loss that learns patient-specific breathing cycles directly from projections via differentiable optimization. Extensive experiments demonstrate state-of-the-art performance, achieving a 9.93 dB PSNR gain over traditional methods and 2.25 dB improvement against prior Gaussian splatting techniques. By unifying continuous motion modeling with hardware-free period learning, X$^2$-Gaussian advances high-fidelity 4D CT reconstruction for dynamic clinical imaging.
Poster
Hyunjun Jung · Hae-Gon Jeon

[ Exhibit Hall I ]

Abstract
A concept of light-fields computed from multiple view images on regular grids has proven its benefit for scene representations, and supported realistic renderings of novel views and photographic effects such as refocusing and shallow depth of field. In spite of its effectiveness of light flow computations, obtaining light fields requires either computational costs or specialized devices like a bulky camera setup and a specialized microlens array. In an effort to broaden its benefit and applicability, in this paper, we propose a novel view synthesis method for light field generation from only single images, named $\textit{inverse image-based rendering}$. Unlike previous attempts to implicitly rebuild 3D geometry or to explicitly represent objective scenes, our method reconstructs light flows in a space from image pixels, which behaves in the opposite way to image-based rendering. To accomplish this, we design a neural rendering pipeline to render a target ray in an arbitrary viewpoint. Our neural renderer first stores the light flow of source rays from the input image, then computes the relationships among them through cross-attention, and finally predicts the color of the target ray based on these relationships. After the rendering pipeline generates the first novel view from a single input image, the …
Poster
Xiyu Zhang · Jiayi Ma · Jianwei Guo · Wei Hu · Zhaoshuai Qi · Fei HUI · Jiaqi Yang · Yanning Zhang

[ Exhibit Hall I ]

Abstract
Geometric constraints between feature matches are critical in 3D point cloud registration problems. Existing approaches typically model unordered matches as a consistency graph and sample consistent matches to generate hypotheses. However, explicit graph construction introduces noise, posing great challenges for handcrafted geometric constraints to render consistency among matches. To overcome this, we propose HyperGCT, a flexible dynamic $\bf{Hyper}$-$\bf{G}$NN-learned geometric $\bf{C}$onstrain$\bf{T}$ that leverages high-order consistency among 3D correspondences. To our knowledge, HyperGCT is the first method that mines robust geometric constraints from dynamic hypergraphs for 3D registration. By dynamically optimizing the hypergraph through vertex and edge feature aggregation, HyperGCT effectively captures the correlations among correspondences, leading to accurate hypothesis generation. Extensive experiments on 3DMatch, 3DLoMatch, KITTI-LC, and ETH show that HyperGCT achieves state-of-the-art performance. Furthermore, our method is robust to graph noise, demonstrating a significant advantage in terms of generalization. The code will be released.
Poster
Feng Yang · Yichao Cao · Xiu Su · Dan Niu · Xuanpeng Li

[ Exhibit Hall I ]

Abstract
Understanding real-world 3D point clouds is challenging due to domain shifts, causing geometric variations like density changes, noise, and occlusions. The key challenge is disentangling domain-invariant semantics from domain-specific geometric variations, as point clouds exhibit local inconsistency and global redundancy, making direct alignment ineffective. To address this, we propose CounterPC, a counterfactual intervention-based domain adaptation framework, which formulates domain adaptation within a causal latent space, identifying category-discriminative features entangled with intra-class geometric variation confounders. Through counterfactual interventions, we generate counterfactual target samples that retain domain-specific characteristics while improving class separation, mitigating domain bias for optimal feature transfer. To achieve this, we introduce two key modules: i) Joint Distribution Alignment, which leverages 3D foundation models (3D-FMs) and a self-supervised autoregressive generative prediction task to unify feature alignment, and ii) Counterfactual Feature Realignment, which employs Optimal Transport (OT) to align category-relevant and category-irrelevant feature distributions, ensuring robust sample-level adaptation while preserving domain and category properties. CounterPC outperforms state-of-the-art methods on PointDA and GraspNetPC-10, achieving accuracy improvements of 4.7 and 3.6, respectively. Code and pre-trained weights will be publicly released.
Poster
Jiawei Xu · Kai Deng · Zexin Fan · Shenlong Wang · Jin Xie · jian Yang

[ Exhibit Hall I ]

Abstract
Modeling and rendering dynamic urban driving scenes is crucial for self-driving simulation. Current high-quality methods typically rely on costly manual object tracklet annotations, while self-supervised approaches fail to capture dynamic object motions accurately and decompose scenes properly, resulting in rendering artifacts. We introduce AD-GS, a novel self-supervised framework for high-quality free-viewpoint rendering of driving scenes from a single log. At its core is a novel learnable motion model that integrates locality-aware B-spline curves with global-aware trigonometric functions, enabling flexible yet precise dynamic object modeling. Rather than requiring comprehensive semantic labeling, AD-GS automatically segments scenes into objects and background with the simplified pseudo 2D segmentation, representing objects using dynamic Gaussians and bidirectional temporal visibility masks. Further, our model incorporates visibility reasoning and physically rigid regularization to enhance robustness. Extensive evaluations demonstrate that our annotation-free model significantly outperforms current state-of-the-art annotation-free methods and is competitive with annotation-dependent approaches.
Poster
Wangbo Yu · Chaoran Feng · Jianing Li · Jiye Tang · Jiashu Yang · Zhenyu Tang · Meng Cao · Xu Jia · Yuchao Yang · Li Yuan · Yonghong Tian

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in synthesizing novel views of 3D scenes. However, its training is heavily reliant on high-quality images and precise camera poses. Meeting these criteria can be challenging in non-ideal real-world conditions, where motion-blurred images frequently occur due to high-speed camera movements or low-light environments.To address these challenges, we introduce Event Stream Assisted Gaussian Splatting (**EvaGaussians**), a novel approach that harnesses event streams captured by event cameras to facilitate the learning of high-quality 3D-GS from blurred images. Capitalizing on the high temporal resolution and dynamic range offered by event streams, we seamlessly integrate them into the initialization and optimization of 3D-GS, thereby enhancing the acquisition of high-fidelity novel views with intricate texture details. We also contribute two novel datasets comprising RGB frames, event streams, and corresponding camera parameters, featuring a wide variety of scenes and various camera motions. The comparison results reveal that our approach not only excels in generating high-fidelity novel views, but also offers faster training and inference speeds. Video results are available at the supplementary **project page**.
Poster
Wongyun Yu · Ahyun Seo · Minsu Cho

[ Exhibit Hall I ]

Abstract
Symmetry is a fundamental concept that has been studied extensively; however, its detection in complex scenes remains challenging in computer vision. Recent heatmap-based methods identify potential regions of symmetry axes but lack precision for individual axis. In this work, we introduce a novel framework for axis-level detection of the most common symmetry types—reflection and rotation—representing them as explicit geometric primitives i.e., lines and points. We formulate a dihedral group-equivariant dual-branch architecture, where each branch exploits the properties of dihedral group-equivariant features in a novel, specialized manner for each symmetry type. Specifically, for reflection symmetry, we propose orientational anchors aligned with group components to enable orientation-specific detection, and reflectional matching that computes similarity between patterns and their mirrored counterparts across potential reflection axes. For rotational symmetry, we propose rotational matching that computes the similarity between patterns at fixed angular intervals. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art methods.
Poster
Hanyu Zhou · Haonan Wang · Haoyue Liu · Yuxing Duan · Luxin Yan · Gim Hee Lee

[ Exhibit Hall I ]

Abstract
High-dynamic scene reconstruction aims to represent static background with rigid spatial features and dynamic objects with deformed continuous spatiotemporal features. Typically, existing methods adopt unified representation model (e.g., Gaussian) to directly match the spatiotemporal features of dynamic scene from frame camera. However, this unified paradigm fails in the potential discontinuous temporal features of objects due to frame imaging and the heterogeneous spatial features between background and objects. To address this issue, we disentangle the spatiotemporal features into various latent representations to alleviate the spatiotemporal mismatching between background and objects. In this work, we introduce event camera to compensate for frame camera, and propose a spatiotemporal-disentangled Gaussian splatting framework for high-dynamic scene reconstruction. As for dynamic scene, we figure out that background and objects have appearance discrepancy in frame-based spatial features and motion discrepancy in event-based temporal features, which motivates us to distinguish the spatiotemporal features between background and objects via clustering. As for dynamic object, we discover that Gaussian representations and event data share the consistent spatiotemporal characteristic, which could serve as a prior to guide the spatiotemporal disentanglement of object Gaussians. Within Gaussian splatting framework, the cumulative scene-object disentanglement can improve the spatiotemporal discrimination between background and objects to …
Poster
Xuzhi Wang · Xinran Wu · Song Wang · Lingdong Kong · Ziping Zhao

[ Exhibit Hall I ]

Abstract
Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The code will be publicly available.
Poster
Haoyu Fu · Diankun Zhang · Zongchuang Zhao · Jianfeng Cui · DINGKANG LIANG · Chong Zhang · Dingyuan Zhang · Hongwei Xie · BING WANG · Xiang Bai

[ Exhibit Hall I ]

Abstract
End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation.ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.47 Driving Score (DS) and 54.62\% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 28.08\% SR.
Poster
Zongyan Han · Mohamed El Amine Boudjoghra · Jiahua Dong · Jinhong Wang · Rao Anwer

[ Exhibit Hall I ]

Abstract
Unified segmentation of 3D point clouds is crucial for scene understanding, but is hindered by its sparse structure, limited annotations, and the challenge of distinguishing fine-grained object classes in complex environments. Existing methods often struggle to capture rich semantic and contextual information due to limited supervision and a lack of diverse multimodal cues, leading to suboptimal differentiation of classes and instances.To address these challenges, we propose VDG-Uni3DSeg, a novel framework that integrates pre-trained vision-language models (e.g., CLIP) and large language models (LLMs) to enhance 3D segmentation. By leveraging LLM-generated textual descriptions and reference images from the internet, our method incorporates rich multimodal cues, facilitating fine-grained class and instance separation.We further design a Semantic-Visual Contrastive Loss to align point features with multimodal queries and a Spatial Enhanced Module to model scene-wide relationships efficiently. Operating within a closed-set paradigm that utilizes multimodal knowledge generated offline, VDG-Uni3DSeg achieves state-of-the-art results in semantic, instance and panoptic segmentation, offering a scalable and practical solution for 3D understanding.We will release our code and models.
Poster
Stanislaw Szymanowicz · Jason Y. Zhang · Pratul Srinivasan · Ruiqi Gao · Arthur Brussee · Aleksander Holynski · Ricardo Martin Brualla · Jonathan Barron · Philipp Henzler

[ Exhibit Hall I ]

Abstract
We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of 300 times.
Poster
Qikui Zhu

[ Exhibit Hall I ]

Abstract
A reliable, hard-landmark-sensitive loss is urgently needed in the field of heatmap-based facial landmark detection, as existing standard regression losses are ineffective at capturing small errors caused by peak mismatches and struggle to adaptively focus on hard-to-detect landmarks. These limitations potentially result in misguided model training, impacting both the coverage and accuracy of the model. To this end, we propose a novel POsition-aware and Sample-Sensitive Loss, named PossLoss, for reliable, hard-landmark sensitive landmark detection. Specifically, our PossLoss is position-aware, incorporating relative positional information to accurately differentiate and locate the peak of the heatmap, while adaptively balancing the influence of landmarks and background pixels through self-weighting, addressing the extreme imbalance between landmarks and non-landmarks. More advanced is that our PossLoss is sample-sensitive, which can distinguish easy and hard landmarks and adaptively make the model focused more on hard landmarks. Moreover, it addresses the difficulty of accurately evaluating heatmap distribution, especially in addressing small errors due to peak mismatches. We analyzed and evaluated our PossLoss on three challenging facial landmark detection tasks. The experimental results show that our PossLoss significantly improves the performance of landmark detection and outperforms the state-of-the-art methods. The source code will be made available on GitHub.
Poster
yejun Shou · Haocheng Wang · Lingfeng Shen · Qian Zheng · Gang Pan · Yanlong Cao

[ Exhibit Hall I ]

Abstract
Point cloud registration is a fundamental task in 3D vision, playing a crucial role in various fields. With the rapid advancement of RGB-D sensors, unsupervised point cloud registration methods based on RGB-D sequences have demonstrated excellent performance. However, existing methods struggle in scenes with low overlap and photometric inconsistency. Low overlap results in numerous correspondence outliers, while photometric inconsistency hinders the model's ability to extract discriminative features. To address these challenges, we first propose the Overlapping Constraint for Inliers Detection (OCID) module, which filters and optimizes the initial correspondence set using an overlappping constraint. This module robustly selects reliable correspondences within the overlapping region while maintaining a balance between accuracy and efficiency. Additionally, we introduce a novel scene representation, 3DGS, which integrates both geometric and texture information, making it particularly well-suited for RGB-D registration tasks. Building on this, we propose the Gaussian Rendering for Photometric Adaptation (GRPA) module, which refines the geometric transformation and enhances the model's adaptability to scenes with inconsistent photometric information. Extensive experiments on ScanNet and ScanNet1500 demonstrate that our method achieves state-of-the-art performance.
Poster
Dubing Chen · Huan Zheng · Yucheng Zhou · Xianfei Li · Wenlong Liao · Tao He · Pai Peng · Jianbing Shen

[ Exhibit Hall I ]

Abstract
Vision-based 3D semantic occupancy prediction is essential for autonomous systems, converting 2D camera data into 3D semantic grids. Current methods struggle to align 2D evidence with 3D predictions, undermining reliability and interpretability. This limitation drives a new exploration of the task’s causal foundations. We propose a novel approach that leverages causal principles to enhance semantic consistency in 2D-to-3D geometric transformation. Our framework introduces a causal loss that backpropagates 3D class features to 2D space for semantic alignment, ensuring 3D locations accurately reflect corresponding 2D regions. Building on this, we develop a Semantic Causality-Aware Lifting (SCA Lifting) method with three components, all guided by our causal loss: Channel-Grouped Lifting to adaptively map distinct semantics to appropriate positions, Learnable Camera Parameters to enhance camera perturbation robustness, and Normalized Convolution to propagate features to sparse regions. The evaluations demonstrate substantial gains in accuracy and robustness, positioning our method as a versatile solution for advancing 3D vision. Experimental results demonstrate that our approach significantly improves robustness to camera perturbations, enhances the semantic causal consistency in 2D-to-3D transformations, and yields substantial accuracy gains on the Occ3D dataset.
Poster
Xiaofan Li · Zhihao Xu · Chenming Wu · Zhao Yang · Yumeng Zhang · Jiang-Jiang Liu · Haibao Yu · Xiaoqing Ye · YuAn Wang · Shirui Li · Xun Sun · Ji Wan · Jun Wang

[ Exhibit Hall I ]

Abstract
Accurate localization using visual information is a critical yet challenging task, especially in urban environments where nearby buildings and construction sites significantly degrade GNSS (Global Navigation Satellite System) signal quality. This issue underscores the importance of visual localization techniques in scenarios where GNSS signals are unreliable. This paper proposes U-ViLAR, a novel uncertainty-aware visual localization framework designed to address these challenges while enabling adaptive localization using high-definition (HD) maps or navigation maps. Specifically, our method first extracts features from the input visual data and maps them into Bird’s-Eye-View (BEV) space to enhance spatial consistency with the map input. Subsequently, we introduce: a) Perceptual Uncertainty-guided Association, which mitigates errors caused by perception uncertainty, and b) Localization Uncertainty-guided Registration, which reduces errors introduced by localization uncertainty. By effectively balancing the coarse-grained large-scale localization capability of association with the fine-grained precise localization capability of registration, our approach achieves robust and accurate localization. Experimental results demonstrate that our method achieves state-of-the-art performance across multiple localization tasks. Furthermore, our model has undergone rigorous testing on large-scale autonomous driving fleets and has demonstrated stable performance in various challenging urban scenarios. The source code will be released.
Poster
Inseung Hwang · Kiseok Choi · Hyunho Ha · Min H. Kim

[ Exhibit Hall I ]

Abstract
Snapshot polarization imaging calculates polarization states from linearly polarized subimages. To achieve this, a polarization camera employs a double Bayer-patterned sensor to capture both color and polarization. It demonstrates low light efficiency and low spatial resolution, resulting in increased noise and compromised polarization measurements. Although burst super-resolution effectively reduces noise and enhances spatial resolution, applying it to polarization imaging poses challenges due to the lack of tailored datasets and reliable ground truth noise statistics. To address these issues, we introduce PolarNS and PolarBurstSR, two innovative datasets developed specifically for polarization imaging. PolarNS provides characterization of polarization noise statistics, facilitating thorough analysis, while PolarBurstSR functions as a benchmark for burst super-resolution in polarization images. These datasets, collected under various real-world conditions, enable comprehensive evaluation. Additionally, we present a model for analyzing polarization noise to quantify noise propagation, tested on a large dataset captured in a darkroom environment. As part of our application, we compare the latest burst super-resolution models, highlighting the advantages of training tailored to polarization compared to RGB-based methods. This work establishes a benchmark for polarization burst super-resolution and offers critical insights into noise propagation, thereby enhancing polarization image reconstruction.
Poster
Ying Xue · Jiaxi Jiang · Rayan Armani · Dominik Hollidt · Yi-Chi Liao · Christian Holz

[ Exhibit Hall I ]

Abstract
Tracking human motion using wearable inertial measurement units (IMUs) overcomes occlusion and environmental limitations inherent in vision-based approaches.However, such sparse IMU tracking also compromises translation estimates and accurate relative positioning between multiple individuals, as inertial cues are inherently self-referential and provide no direct spatial reference or relational information about others.In this paper, we present a novel approach that leverages the distances between the IMU sensors worn by one person as well as between those across multiple people.Our method Inter Inertial Poser derives these absolute inter-sensor distances from ultra-wideband ranging (UWB) and inputs them into structured state-space models to integrate temporal motion patterns for precise 3D pose estimation.Our novel coarse-to-fine optimization process further leverages these inter-sensor distances for accurately estimating the trajectories between individuals. To evaluate our method, we introduce Inter-UWB, the first IMU+UWB dataset for two-person tracking, which comprises 200\,minutes of motion recordings from 14\,participants. Our results show that Inter Inertial Poser outperforms the state-of-the-art methods in both accuracy and robustness across synthetic and real-world captures, demonstrating the promise of IMU+UWB-based multi-human motion capture in the wild.
Poster
Yukun Huang · Yanning Zhou · Jianan Wang · Kaiyi Huang · Xihui Liu

[ Exhibit Hall I ]

Abstract
3D panorama synthesis is a promising yet challenging task that demands high-quality and diverse visual appearance and geometry of the generated omnidirectional content. Existing methods leverage rich image priors from pre-trained 2D foundation models to circumvent the scarcity of 3D panoramic data, but the incompatibility between 3D panoramas and 2D single views limits their effectiveness. In this work, we demonstrate that by applying multi-plane synchronization to the operators from 2D foundation models, their capabilities can be seamlessly extended to the omnidirectional domain. Based on this design, we further introduce DreamCube, a multi-plane RGB-D diffusion model for 3D panorama generation, which maximizes the reuse of 2D foundation model priors to achieve diverse appearances and accurate geometry while maintaining multi-view consistency. Extensive experiments demonstrate the effectiveness of our approach in panoramic image generation, panoramic depth estimation, and 3D scene generation.
Poster
Mingqian Ji · Jian Yang · Shanshan Zhang

[ Exhibit Hall I ]

Abstract
Current multi-view 3D object detection methods typically transfer 2D features into 3D space using depth estimation or 3D position encoder, but in a fully data-driven and implicit manner, which limits the detection performance. Inspired by the success of radiance fields on 3D reconstruction, we assume they can be used to enhance the detector's ability of 3D geometry estimation. However, we observe a decline in detection performance, when we directly use them for 3D rendering as an auxiliary task. From our analysis, we find the performance drop is caused by the strong responses on the background when rendering the whole scene. To address this problem, we propose object-centric radiance fields, focusing on modeling foreground objects while discarding background noises. Specifically, we employ object-centric radiance fields (OcRF) to enhance 3D voxel features via an auxiliary task of rendering foreground objects. We further use opacity - the side-product of rendering- to enhance the 2D foreground BEV features via height-aware opacity-based attention (HOA), where attention maps at different height levels are generated separately via multiple networks in parallel. Extensive experiments on the nuScenes validation and test datasets demonstrate that our OcRFDet achieves superior performance, outperforming previous state-of-the-art methods with 57.2\% mAP and 64.8\% NDS …
Poster
Simon Boeder · Fabian Gigengack · Benjamin Risse

[ Exhibit Hall I ]

Abstract
Occupancy estimation has become a prominent task in 3D computer vision, particularly within the autonomous driving community.In this paper, we present a novel approach to occupancy estimation, termed GaussianFlowOcc, which is inspired by Gaussian Splatting and replaces traditional dense voxel grids with a sparse 3D Gaussian representation.Our efficient model architecture based on a Gaussian Transformer significantly reduces computational and memory requirements by eliminating the need for expensive 3D convolutions used with inefficient voxel-based representations that predominantly represent empty 3D spaces.GaussianFlowOcc effectively captures scene dynamics by estimating temporal flow for each Gaussian during the overall network training process, offering a straightforward solution to a complex problem that is often neglected by existing methods.Moreover, GaussianFlowOcc is designed for scalability, as it employs weak supervision and does not require costly dense 3D voxel annotations based on additional data (e.g., LiDAR).Through extensive experimentation, we demonstrate that GaussianFlowOcc significantly outperforms all previous methods for weakly supervised occupancy estimation on the nuScenes dataset while featuring an inference speed that is 50 times faster than current SotA.
Poster
Xiaolin Liu · Tianyi zhou · Hongbo Kang · Jian Ma · Ziwen Wang · Jing Huang · Wenguo Weng · Yu-Kun Lai · Kun Li

[ Exhibit Hall I ]

Abstract
Evacuation simulations are vital for improving safety, pinpointing risks, and refining emergency protocols. However, no existing methods can simulate realistic, personalized, and online 3D evacuation motions. In this paper, aligned with the sensory-decision-motor (SDM) flow of the human brain, we propose an online SDM-united 3D evacuation simulation framework with a 3D-adaptive Social Force Model and a proxemics-aware personalization method. Additionally, we introduce Part-level Force Visualization to assist in evacuation analysis. We experimentally validate that our framework supports online personalized dynamic path planning and behaviors throughout the evacuation process, and is compatible with uneven terrain. Visually, our method generates evacuation results that are more realistic and plausible, providing enhanced insights for evacuation strategy development. The code will be released for research purposes.
Poster
Zhengkang Xiang · Zizhao Li · Amir Khodabandeh · Kourosh Khoshelham

[ Exhibit Hall I ]

Abstract
Lidar point cloud synthesis based on generative models offers a promising solution to augment deep learning pipelines, particularly when real-world data is scarce or lacks diversity. By enabling flexible object manipulation, this synthesis approach can significantly enrich training datasets and enhance discriminative models. However, existing methods focus on unconditional lidar point cloud generation, overlooking their potential for real-world applications. In this paper, we propose SG-LDM, a Semantic-Guided Lidar Diffusion Model that employs latent alignment to enable robust semantic-to-lidar synthesis. By directly operating in the native lidar space and leveraging explicit semantic conditioning, SG-LDM achieves state-of-the-art performance in generating high-fidelity lidar point clouds guided by semantic labels. Moreover, we propose the first diffusion-based lidar translation framework based on SG-LDM, which enables cross-domain translation as a domain adaptation strategy to enhance downstream perception performance. Systematic experiments demonstrate that SG-LDM significantly outperforms existing lidar diffusion models and the proposed lidar translation framework further improves data augmentation performance in the lidar segmentation task by addressing the domain gap between the synthetic and real data.
Poster
Boxiao Pan · Adam Harley · Francis Engelmann · Karen Liu · Leonidas Guibas

[ Exhibit Hall I ]

Abstract
The ability to predict collision-free future trajectories from egocentric observations is crucial in applications such as humanoid robotics, VR / AR, and assistive navigation. In this work, we introduce the challenging problem of predicting a sequence of future 6D head poses from an egocentric video. In particular, we predict both head translations and rotations to learn the active information-gathering behavior expressed through head-turning events. To solve this task, we propose a framework that reasons over a temporally aggregated 3D latent space, which implicitly models the geometric constraints for both the static and dynamic parts of the environment. Motivated by the lack of training data in this space, we further contribute a data collection pipeline using the Project Aria glasses, and provide a dataset collected through this approach. Our dataset, dubbed Aria Navigation Dataset (AND), consists of 4 hours of recording of users navigating in real-world scenarios. It includes diverse situations and navigation behaviors, providing a valuable resource for learning real-world egocentric navigation policies. Extensive experiments show that our model learns human-like navigation behaviors such as waiting / slowing down, rerouting, and looking around for traffic while generalizing to unseen environments.
Poster
Abiao Li · Chenlei Lv · Guofeng Mei · Yifan Zuo · Jian Zhang · Yuming Fang

[ Exhibit Hall I ]

Abstract
Most masked point cloud modeling (MPM) methods follow a regression paradigm to reconstruct the coordinate or feature of masked regions. However, they tend to over-constrain the model to learn the details of the masked region, resulting in failure to capture generalized features. To address this limitation, we propose \textbf{\textit{PointGAC}}, a novel clustering-based MPM method that aims to align the feature distribution of masked regions. Specially, it features an online codebook-guided teacher-student framework. Firstly, it presents a geometry-aware partitioning strategy to extract initial patches. Then, the teacher model updates a codebook via online $K$-means based on features extracted from the complete patches. This procedure facilitates codebook vectors to become cluster centers. Afterward, we assigns the unmasked features to their corresponding cluster centers, and the student model aligns the assignment for the reconstructed masked features. This strategy focuses on identifying the cluster centers to which the masked features belong, enabling the model to learn more generalized feature representations. Benefiting from a proposed codebook maintenance mechanism, codebook vectors are actively updated, which further increases the efficiency of semantic feature learning. Experiments validate the effectiveness of the proposed method on various downstream tasks.
Poster
Qi Xun Yeo · Yanyan Li · Gim Hee Lee

[ Exhibit Hall I ]

Abstract
Modern 3D semantic scene graph estimation methods utilise ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighbouring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighbouring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighbourhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our code will be open-sourced upon paper acceptance.
Poster
Wenhang Ge · Jiantao Lin · Guibao SHEN · Jiawei Feng · Tao Hu · Xinli Xu · Ying-Cong Chen

[ Exhibit Hall I ]

Abstract
We propose PRM, a novel photometric stereo based large reconstruction model to reconstruct high-quality meshes with fine-grained details. Previous large reconstruction models typically prepare training images under fixed and simple lighting, offering minimal photometric cues for precise reconstruction. Furthermore, images containing specular surfaces are treated as out-of-distribution samples, resulting in degraded reconstruction quality. To handle these challenges, PRM renders photometric stereo images by varying materials and lighting, which not only improves the local details by providing rich photometric cues but also increases the model’s robustness to variations in the appearance of input images. To offer enhanced flexibility, we incorporate a real-time physically-based rendering (PBR) method and mesh rasterization for ground-truth rendering. By using an explicit mesh as 3D representation, PRM ensures the application of differentiable PBR for predicted rendering. This approach models specular color more accurately for photometric stereo images than previous neural rendering methods and supports multiple supervisions for geometry optimization. Extensive experiments demonstrate that PRM significantly outperforms other models.
Poster
Yanyan Li · Youxu Fang · Zunjie Zhu · Kunyi Li · Yong Ding · Federico Tombari

[ Exhibit Hall I ]

Abstract
Simultaneously localizing camera poses and constructing Gaussian radiance fields in dynamic scenes establish a crucial bridge between 2D images and the 4D real world.Instead of removing dynamic objects as distractors and reconstructing only static environments, this paper proposes an efficient architecture that incrementally tracks camera poses and establishes the 4D Gaussian radiance fields in unknown scenarios by using a sequence of RGB-D images.First, by generating motion masks, we obtain static and dynamic priors for each pixel. .To eliminate the influence of static scenes and improve the efficiency on learning the motion of dynamic objects, we classify the Gaussian primitives into static and dynamic Gaussian sets, while the the sparse control points along with an MLP is utilized to model the transformation fields of the dynamic Gaussians.To more accurately learn the motion of dynamic Gaussians, a novel 2D optical flow map reconstruction algorithm is designed to render optical flows of dynamic objects between neighbor images, which are further used to supervise the 4D Gaussian radiance fields along with traditional photometric and geometric constraints.In experiments, qualitative and quantitative evaluation results show that the proposed method achieves robust tracking and high-quality view synthesis performance in real-world environments.
Poster
David Svitov · Pietro Morerio · Lourdes Agapito · ALESSIO DEL BUE

[ Exhibit Hall I ]

Abstract
We present billboard Splatting (BBSplat) - a novel approach for novel view synthesis based on textured geometric primitives. BBSplat represents the scene as a set of optimizable textured planar primitives with learnable RGB textures and alpha-maps to control their shape. BBSplat primitives can be used in any Gaussian Splatting pipeline as drop-in replacements for Gaussians. The proposed primitives close the rendering quality gap between 2D and 3D Gaussian Splatting (GS), enabling the accurate extraction of 3D mesh as in the 2DGS framework. Additionally, the explicit nature of planar primitives enables the use of the ray-tracing effects in rasterization.Our novel regularization term encourages textures to have a sparser structure, enabling an efficient compression that leads to a reduction in the storage space of the model up to $\times17$ times compared to 3DGS. Our experiments show the efficiency of BBSplat on standard datasets of real indoor and outdoor scenes such as Tanks\&Temples, DTU, and Mip-NeRF-360. Namely, we achieve a state-of-the-art PSNR of 29.72 for DTU at Full HD resolution.
Poster
Shida Sun · Yue Li · Yueyi Zhang · Zhiwei Xiong

[ Exhibit Hall I ]

Abstract
Non-line-of-sight (NLOS) imaging, recovering the hidden volume from indirect reflections, has attracted increasing attention due to its potential applications. Despite promising results, existing NLOS reconstruction approaches are constrained by the reliance on empirical physical priors, e.g., single fixed path compensation. Moreover, these approaches still possess limited generalization ability, particularly when dealing with scenes at a low signal-to-noise ratio (SNR). To overcome the above problems, we introduce a novel learning-based approach, comprising two key designs: Learnable Path Compensation (LPC) and Adaptive Phasor Field (APF). The LPC applies tailored path compensation coefficients to adapt to different objects in the scene, effectively reducing light wave attenuation, especially in distant regions. Meanwhile, the APF learns the precise Gaussian window of the illumination function for the phasor field, dynamically selecting the relevant spectrum band of the transient measurement. Experimental validations demonstrate that our proposed approach, only trained on synthetic data, exhibits the capability to seamlessly generalize across various real-world datasets captured by different imaging systems and characterized by low SNRs.
Poster
Chongjie Ye · Yushuang Wu · Ziteng Lu · Jiahao Chang · Xiaoyang Guo · Jiaqing Zhou · Hao Zhao · Xiaoguang Han

[ Exhibit Hall I ]

Abstract
With the growing demand for high-fidelity 3D models from 2D images, existing methods still face significant challenges in accurately reproducing fine-grained geometric details due to limitations in domain gaps and inherent ambiguities in RGB images. To address these issues, we propose Hi3DGen, a novel framework for generating high-fidelity 3D geometry from images via normal bridging. Hi3DGen consists of three key components: (1) an image-to-normal estimator that decouples the low-high frequency image pattern with noise injection and dual-stream training to achieve generalizable, stable, and sharp estimation; (2) a normal-to-geometry learning approach that uses normal-regularized latent diffusion learning to enhance 3D geometry generation fidelity; and (3) a 3D data synthesis pipeline that constructs a high-quality dataset to support training. Extensive experiments demonstrate the effectiveness and superiority of our framework in generating rich geometric details, outperforming state-of-the-art methods in terms of fidelity. Our work provides a new direction for high-fidelity 3D geometry generation from images by leveraging normal maps as an intermediate representation.
Poster
Yuanhao Cai · He Zhang · Kai Zhang · Yixun Liang · Mengwei Ren · Fujun Luan · Qing Liu · Soo Ye Kim · Jianming Zhang · Zhifei Zhang · Yuqian Zhou · YULUN ZHANG · Xiaokang Yang · Zhe Lin · Alan Yuille

[ Exhibit Hall I ]

Abstract
Existing feedforward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric cases. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object generation and scene reconstruction from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generality of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that DiffusionGS yields improvements of 2.20 dB/23.25 and 1.34 dB/19.16 in PSNR/FID for objects and scenes than the state-of-the-art methods, without using 2D diffusion prior and depth estimator. Plus, our method enjoys over 5$\times$ faster speed ($\sim$6s on an A100 GPU). Code will be released.
Poster
Shaokai Wu · Yuxiang Lu · Yapan Guo · Wei Ji · Suizhi Huang · Fengyu Yang · Shalayiding Sirejiding · Qichen He · Jing Tong · Yanbiao Ji · Yue Ding · Hongtao Lu

[ Exhibit Hall I ]

Abstract
Computed Tomography (CT) is a widely used imaging technique that provides detailed cross-sectional views of objects. Over the past decade, Deep Learning-based Reconstruction (DLR) methods have led efforts to enhance image quality and reduce noise, yet they often require large amounts of data and are computationally intensive. Inspired by recent advancements in scene reconstruction, some approaches have adapted NeRF and 3D Gaussian Splatting (3DGS) techniques for CT reconstruction. However, these methods are not ideal for direct 3D volume reconstruction. In this paper, we propose a novel Discretized Gaussian Representation (DGR) for CT reconstruction, which directly reconstructs the 3D volume using a set of discretized Gaussian functions in an end-to-end manner. To further enhance computational efficiency, we introduce a Fast Volume Reconstruction technique that aggregates the contributions of these Gaussians into a discretized volume in a highly parallelized fashion. Our extensive experiments on both real-world and synthetic datasets demonstrate that DGR achieves superior reconstruction quality and significantly improved computational efficiency compared to existing DLR and instance reconstruction methods. Our code is available in the supplementary material.
Poster
Yijia Hong · Yuan-Chen Guo · Ran Yi · Yulong Chen · Yanpei Cao · Lizhuang Ma

[ Exhibit Hall I ]

Abstract
Decomposing physically-based materials from images into their constituent properties remains challenging, particularly when maintaining both computational efficiency and physical consistency. While recent diffusion-based approaches have shown promise, they face substantial computational overhead due to multiple denoising steps and separate models for different material properties. We present SuperMat, a single-step framework that achieves high-quality material decomposition with one-step inference. This enables end-to-end training with perceptual and re-render losses while decomposing albedo, metallic, and roughness maps at millisecond-scale speeds. We further extend our framework to 3D objects through a UV refinement network, enabling consistent material estimation across viewpoints while maintaining efficiency. Experiments demonstrate that SuperMat achieves state-of-the-art PBR material decomposition quality while reducing inference time from seconds to milliseconds per image, and completes PBR material estimation for 3D objects in approximately 3 seconds.
Poster
Avinash Paliwal · xilong zhou · Wei Ye · Jinhui Xiong · Rakesh Ranjan · Nima Kalantari

[ Exhibit Hall I ]

Abstract
In this paper, we propose RI3D, a novel 3DGS-based approach that harnesses the power of diffusion models to reconstruct high-quality novel views given a sparse set of input images. Our key contribution is separating the view synthesis process into two tasks of reconstructing visible regions and hallucinating missing regions, and introducing two personalized diffusion models, each tailored to one of these tasks. Specifically, one model ('repair') takes a rendered image as input and predicts the corresponding high-quality image, which in turn is used as a pseudo ground truth image to constrain the optimization. The other model ('inpainting') primarily focuses on hallucinating details in unobserved areas. To integrate these models effectively, we introduce a two-stage optimization strategy: the first stage reconstructs visible areas using the repair model, and the second stage reconstructs missing regions with the inpainting model while ensuring coherence through further optimization. Moreover, we augment the optimization with a novel Gaussian initialization method that obtains per-image depth by combining 3D-consistent and smooth depth with highly detailed relative depth. We demonstrate that by separating the process into two tasks and addressing them with the repair and inpainting models, we produce results with detailed textures in both visible and missing regions …
Poster
Luoxi Zhang · Pragyan Shrestha · Yu Zhou · Chun Xie · Itaru Kitahara

[ Exhibit Hall I ]

Abstract
Single-view 3D reconstruction aims to recover the complete 3D geometry and appearance of objects from a single RGB image and its corresponding camera parameters. Yet, the task remains challenging due to incomplete image information and inherent ambiguity. Existing methods primarily encounter two issues: balancing extracting local details with the construction of global topology and the interference caused by the early fusion of RGB and depth features in high-texture regions, destabilizing SDF optimization. We propose Dual-S3D, a novel single-view 3D reconstruction framework to address these challenges. Our method employs a hierarchical dual-path feature extraction strategy based on stages that utilize CNNs to anchor local geometric details. In contrast, subsequent stages leverage a Transformer integrated with selective SSM to capture global topology, enhancing scene understanding and feature representation. Additionally, we design an auxiliary branch that progressively fuses precomputed depth features with pixel-level features to decouple visual and geometric cues effectively. Extensive experiments on the 3D-FRONT and Pix3D datasets demonstrate that our approach significantly outperforms existing methods—reducing Chamfer distance by 51%, increasing F-score by 33.6%, and improving normal consistency by 10.3%—thus achieving state-of-the-art reconstruction quality.
Poster
Donghyun Lee · Dawoon Jeong · Jae W. Lee · Hongil Yoon

[ Exhibit Hall I ]

Abstract
Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.
Poster
Youming Deng · Wenqi Xian · Guandao Yang · Leonidas Guibas · Gordon Wetzstein · Steve Marschner · Paul Debevec

[ Exhibit Hall I ]

Abstract
Large field-of-view (FOV) cameras can simplify and accelerate scene capture because they provide complete coverage with fewer views. However, existing reconstruction pipelines fail to take full advantage of large-FOV input data because they convert input views to perspective images, resulting in stretching that prevents the use of the full image. Additionally, they calibrate lenses using models that do not accurately fit real fisheye lenses in the periphery. We present a new reconstruction pipeline based on Gaussian Splatting that uses a flexible lens model and supports fields of view approaching 180 degrees. We represent lens distortion with a hybrid neural field based on an Invertible ResNet and use a cubemap to render wide-FOV images while retaining the efficiency of the Gaussian Splatting pipeline. Our system jointly optimizes lens distortion, camera intrinsics, camera poses, and scene representations using a loss measured directly against the original input pixels. We present extensive experiments on both synthetic and real-world scenes, demonstrating that our model accurately fits real-world fisheye lenses and that our end-to-end self-calibration approach provides higher-quality reconstructions than existing methods.
Poster
Yuran Wang · Yingping Liang · Yutao Hu · Ying Fu

[ Exhibit Hall I ]

Abstract
Learning-based stereo matching models struggle in adverse weather conditions due to the scarcity of corresponding training data and the challenges in extracting discriminative features from degraded images. These limitations significantly hinder zero-shot generalization to out-of-distribution weather conditions. In this paper, we propose **RobuSTereo** a novel framework that enhances the zero-shot generalization of stereo matching models under adverse weather by addressing both data scarcity and feature extraction challenges. First, we introduce a diffusion-based simulation pipeline with a stereo consistency module, which generates high-quality stereo data tailored for adverse conditions. By training stereo matching models on our synthetic datasets, we reduce the domain gap between clean and degraded images, significantly improving the models’ robustness to unseen weather conditions. The stereo consistency module ensures structural alignment across synthesized image pairs, preserving geometric integrity and enhancing depth estimation accuracy. Second, we design a robust feature encoder that combines a specialized ConvNet with a denoising transformer to extract stable and reliable features from degraded images. The ConvNet captures fine-grained local structures, while the denoising transformer refines global representations, effectively mitigating the impact of noise, low visibility, and weather-induced distortions. This enables more accurate disparity estimation even under challenging visual conditions. Extensive experiments demonstrate that **RobuSTereo** …
Poster
DongZhenXing DongZhenXing · Jiazhou Chen

[ Exhibit Hall I ]

Abstract
The planning of digital orthodontic treatment requires providing tooth alignment, which relays clinical experiences heavily and consumes a lot of time and labor to determine manually. In this work, we proposed an automatic tooth alignment neural network based on Swin-transformer. We first re-organized 3D point clouds based on dental arch lines and converted them into order-sorted multi-channel textures, improving both accuracy and efficiency. We then designed two new orthodontic loss functions that quantitatively evaluate the occlusal relationship between the upper and lower jaws. They are important clinical constraints, first introduced and lead to cutting-edge prediction accuracy. To train our network, we collected a large digital orthodontic dataset in more than 2 years, including various complex clinical cases. We will release this dataset after the paper's publishment and believe it will benefit the community. Furthermore, we proposed two new orthodontic dataset augmentation methods considering tooth spatial distribution and occlusion. We compared our method with most SOTA methods using this dataset, and extensive ablation studies and experiments demonstrated the high accuracy and efficiency of our method.
Poster
Zuo-Liang Zhu · jian Yang · Beibei Wang

[ Exhibit Hall I ]

Abstract
3D Gaussian splatting (3DGS) has shown its detailed expressive ability and highly efficient rendering speed in the novel view synthesis (NVS) task. The application to inverse rendering still faces several challenges, as the discrete nature of Gaussian primitives makes it difficult to apply geometry constraints. Recent works introduce the signed distance field (SDF) as an extra continuous representation to regularize the geometry defined by Gaussian primitives. It improves the decomposition quality, at the cost of increasing memory usage and complicating training. Unlike these works, we introduce a **discretized SDF** to represent the continuous SDF in a discrete manner by encoding it within each Gaussian using a sampled value. This approach allows us to link the SDF with the Gaussian opacity through an SDF-to-opacity transformation, enabling rendering the SDF via splatting and avoiding the computational cost of ray marching. The key challenge is to regularize the discrete samples to be consistent with the underlying SDF, as the discrete representation can hardly apply the gradient-based constraints (e.g., Eikonal loss). For this, we project Gaussians onto the zero-level set of SDF and enforce alignment with the surface from splatting, namely a projection-based consistency loss. Thanks to the discretized SDF, our method achieves higher …
Poster
Yuxiang Ji · Boyong He · Zhuoyue Tan · Liaoni Wu

[ Exhibit Hall I ]

Abstract
Multimodal geo-localization methods can inherently overcome the limitations of unimodal sensor systems by leveraging complementary information from different modalities.However, existing retrieval-based methods rely on a comprehensive multimodal database, which is often challenging to fulfill in practice.In this paper, we introduce a more practical problem for localizing drone-view images by collaborating multimodal data within a satellite-view reference map, which integrates multimodal information while avoiding the need for an extensive multimodal database.We present \textsc{MMGeo} that learns to push the composition of multimodal representations to the target reference map through a unified framework.By utilizing a comprehensive multimodal query (image, point cloud/depth/text), we can achieve more robust and accurate geo-localization, especially in unknown and complex environments.Additionally, we extend two visual geo-localization datasets GTA-UAV and UAV-VisLoc to multi-modality, establishing the first UAV geo-localization datasets that combine image, point cloud, depth and text data.Experiments demonstrate the effectiveness of \textsc{MMGeo} for UAV multimodal compositional geo-localization, as well as the generalization capabilities to real-world scenarios.
Poster
Tianyi Xu · Fan Zhang · Boxin Shi · Tianfan Xue · Yujin Wang

[ Exhibit Hall I ]

Abstract
Mainstream high dynamic range (HDR) imaging techniques typically rely on fusing multiple images captured with different exposure setups (shutter speed and ISO). A good balance between shutter speed and ISO are critical for high-quality HDR, as high ISO introduce significant noise, whereas long shutter speeds may lead to noticeable motion blur—both. However, existing methods often overlook the complex interaction between shutter speed and ISO and fail to account for motion blur effects in dynamic scenes.In this work, we propose AdaptiveAE, a reinforcement learning-based method that optimizes the selection of shutter speed and ISO combinations to maximize HDR reconstruction quality in dynamic environments. AdaptiveAE integrates an image synthesis pipeline that incorporates motion blur and noise simulation in our training procedure and leveraging semantic information and exposure histogram. It can adaptively select optimal ISO and shutter speed sequences based on a user-defined exposure time budget, find a better exposure schedule than traditional fixed exposure solution. Experimental results across multiple datasets demonstrate that AdaptiveAE achieves state-of-the-art performance.
Poster
Qianjiang Hu · Wei Hu

[ Exhibit Hall I ]

Abstract
Generating realistic 3D outdoor scenes is essential for applications in autonomous driving, virtual reality, environmental science, and urban development. Traditional 3D generation approaches using single-layer diffusion methods can produce detailed scenes for individual objects but struggle with high-resolution, large-scale outdoor environments due to scalability limitations. Recent hierarchical diffusion models tackle this by progressively scaling up low-resolution scenes. However, they often sample fine details from total noise rather than from the coarse scene, which limits the efficiency. We propose a novel cube-absorb discrete diffusion (CADD) model, which deploys low-resolution scenes as the base state in the diffusion process to generate fine details, eliminating the need to sample entirely from noise. Moreover, we introduce the Sparse Cube Diffusion Transformer (SCDT), a transformer-based model with a sparse cube attention operator, optimized for generating large-scale sparse voxel scenes. Our method demonstrates state-of-the-art performance on the CarlaSC and KITTI360 datasets, supported by qualitative visualizations and extensive ablation studies that highlight the impact of the CADD process and sparse block attention operator on high-resolution 3D scene generation.
Poster
Jongsuk Kim · Jae Young Lee · Gyojin Han · Dong-Jae Lee · Minki Jeong · Junmo Kim

[ Exhibit Hall I ]

Abstract
Recent advancements in deep learning and the availability of high-quality real-world driving datasets have propelled end-to-end autonomous driving. Despite this progress, relying solely on real-world data limits the variety of driving scenarios for training. Synthetic scenario generation has emerged as a promising solution to enrich the diversity of training data; however, its application within E2E AD models remains largely unexplored. This is primarily due to the absence of a designated ego vehicle and the associated sensor inputs, such as camera or LiDAR, typically provided in real-world scenarios. To address this gap, we introduce SynAD, the first framework designed to enhance real-world E2E AD models using synthetic data. Our method designates the agent with the most comprehensive driving information as the ego vehicle in a multi-agent synthetic scenario. We further project path-level scenarios onto maps and employ a newly developed Map-to-BEV Network to derive bird’s-eye-view features without relying on sensor inputs. Finally, we devise a training strategy that effectively integrates these map-based synthetic data with real driving data. Experimental results demonstrate that SynAD effectively integrates all components and notably enhances safety performance. By bridging synthetic scenario generation and E2E AD, SynAD paves the way for more comprehensive and robust autonomous driving …
Poster
Anusha Krishnan · Shaohui Liu · Paul-Edouard Sarlin · Oscar Gentilhomme · David Caruso · Maurizio Monge · Richard Newcombe · Jakob Engel · Marc Pollefeys

[ Exhibit Hall I ]

Abstract
Precise 6-DoF simultaneous localization and mapping (SLAM) from onboard sensors is critical for wearable devices capturing egocentric data, which exhibits specific challenges, such as a wider diversity of motions and viewpoints, prevalent dynamic visual content, or long sessions affected by time-varying sensor calibration. While recent progress on SLAM has been swift, academic research is still driven by benchmarks that do not reflect these challenges or do not offer sufficiently accurate ground truth poses. In this paper, we introduce a new dataset and benchmark for visual-inertial SLAM with egocentric, multi-modal data. We record hours and kilometers of trajectories through a city center with glasses-like devices equipped with various sensors. We leverage surveying tools to obtain control points as indirect pose annotations that are metric, centimeter-accurate, and available at city scale. This makes it possible to evaluate extreme trajectories that involve walking at night or traveling in a vehicle. We show that state-of-the-art systems developed by academia are not robust to these challenges and we identify components that are responsible for this. In addition, we design tracks with different levels of difficulty to ease in-depth analysis and evaluation of less mature approaches. The dataset and benchmark will be made publicly available.
Poster
Zhile Chen · Hui Ji

[ Exhibit Hall I ]

Abstract
High Dynamic Range (HDR) imaging with modulo cameras involves solving a challenging inverse problem, where degradation occurs due to the modulo operation applied to the target HDR image. Existing methods operate directly in the image domain, overlooking the underlying properties of the modulo operation. Motivated by Itoh's continuity condition in optics, we reformulate modulo HDR reconstruction in image gradient domain, leveraging the inherent properties of modulo-wrapped gradients to simplify the problem. Furthermore, to address possible ambiguities on large image gradients, we introduce an auxiliary variable with a learnable sparsity prior in an optimization formulation to absorb the related residuals. This is implemented within an unfolding network, where sparsity is enforced through a spiking neuron-based module. Experiments show that our method outperforms existing approaches while being among the lightest models of existing works.
Poster
Xin Zhang · Anpei Chen · Jincheng Xiong · Pinxuan Dai · Yujun Shen · Weiwei Xu

[ Exhibit Hall I ]

Abstract
Gaussian splatting techniques have shown promising results in novel view synthesis, achieving high fidelity and efficiency. However, their high reconstruction quality comes at the cost of requiring a large number of primitives. We identify this issue as stemming from the entanglement of geometry and appearance in Gaussian Splatting. To address this, we introduce a neural shell texture, a global representation that encodes texture information around the surface. We use Gaussian primitives as both a geometric representation and texture field samplers, efficiently splatting texture features into image space. Our evaluation demonstrates that this disentanglement enables high parameter efficiency, fine texture detail reconstruction, and easy textured mesh extraction, all while using significantly fewer primitives.
Poster
Tuo Feng · Wenguan Wang · Yi Yang

[ Exhibit Hall I ]

Abstract
In autonomous driving, accurately predicting occupancy and motion is crucial for safe navigation within dynamic environments. However, existing methods often suffer from difficulties in handling complex scenes and uncertainty arising from sensor data. To address these issues, we propose a new Gaussian-based World Model (GWM), seamlessly integrating raw multi-modal sensor inputs. In 1st stage, Gaussian representation learner utilizes self-supervised pretraining to learn robust Gaussian representation. Gaussian representation integrates semantic and geometric information and establishes a robust probabilistic understanding of the environment. In 2nd stage, GWM seamlessly integrates learning, simulation, and planning into a unified framework, empowering the uncertainty-aware simulator & planner to jointly forecast future scene evolutions and vehicle trajectories. Simulator generates future scene predictions by modeling both static and dynamic elements, while planner calculates optimal paths to minimize collision risks, thus enhancing navigation safety. Overall, GWM employs a sensor-to-planning world model that directly processes raw sensor data, setting it apart from previous methods. Experiments show that GWM outperforms state-of-the-art approaches by 16.8% in semantic comprehension and 5.8% in motion prediction. Moreover, we provide an in-depth analysis of Gaussian representations under complex scenarios. Our code will be released.
Poster
JIXUAN FAN · Wanhua Li · Yifei Han · Tianru Dai · Yansong Tang

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting has demonstrated notable success in large-scale scene reconstruction, but challenges persist due to high training memory consumption and storage overhead. Hybrid representations that integrate implicit and explicit features offer a way to mitigate these limitations. However, when applied in parallelized block-wise training, two critical issues arise since reconstruction accuracy deteriorates due to reduced data diversity when training each block independently, and parallel training restricts the number of divided blocks to the available number of GPUs. To address these issues, we propose Momentum-GS, a novel approach that leverages momentum-based self-distillation to promote consistency and accuracy across the blocks while decoupling the number of blocks from the physical GPU count. Our method maintains a teacher Gaussian decoder updated with momentum, ensuring a stable reference during training. This teacher provides each block with global guidance in a self-distillation manner, promoting spatial consistency in reconstruction.To further ensure consistency across the blocks, we incorporate block weighting, dynamically adjusting each block’s weight according to its reconstruction accuracy. Extensive experiments on large-scale scenes show that our method consistently outperforms existing techniques, achieving a 18.7% improvement in LPIPS over CityGaussian with much fewer divided blocks and establishing a new state of the art.
Poster
Kwon Byung-Ki · Qi Dai · Lee Hyoseok · Chong Luo · Tae-Hyun Oh

[ Exhibit Hall I ]

Abstract
We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accuratedepth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy.With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation.
Poster
Seokjun Choi · Hoon-Gyu Chung · Yujin Jeon · Giljoo Nam · Seung-Hwan Baek

[ Exhibit Hall I ]

Abstract
Inverse rendering aims to reconstruct geometry and reflectance from captured images. Display-camera imaging systems offer unique advantages for this task: each pixel can easily function as a programmable point light source, and the polarized light emitted by LCD displays facilitates diffuse-specular separation. Despite these benefits, there is currently no public real-world dataset captured using display-camera systems, unlike other setups such as light stages. This absence hinders the development and evaluation of display-based inverse rendering methods.In this paper, we introduce the first real-world dataset for display-based inverse rendering. To achieve this, we construct and calibrate an imaging system comprising an LCD display and stereo polarization cameras. We then capture a diverse set of objects with diverse geometry and reflectance under one-light-at-a-time (OLAT) display patterns. We also provide high-quality ground-truth geometry. Our dataset enables the synthesis of captured images under arbitrary display patterns and different noise levels. Using this dataset, we evaluate the performance of existing photometric stereo and inverse rendering methods, {and provide a simple, yet effective baseline for display inverse rendering, outperforming state-of-the-art inverse rendering methods}. The dataset and code will be publicly available.
Poster
Zhiwei Xu

[ Exhibit Hall I ]

Abstract
Path smoothness is often overlooked in path imitation learning from expert demonstrations. In this paper, we introduce a novel learning method, termed deep angular A$^\ast$ (DAA$^\ast$), by incorporating the proposed path angular freedom (PAF) into A$^\ast$ to improve path similarity through adaptive path smoothness. The PAF aims to explore the effect of move angles on path node expansion by finding the trade-off between their minimum and maximum values, allowing for high adaptiveness for imitation learning. DAA$^\ast$ improves path optimality by closely aligning with the reference path through joint optimization of path shortening and smoothing, which correspond to heuristic distance and PAF, respectively. Throughout comprehensive evaluations on 7 datasets, including 4 maze datasets, 2 video-game datasets, and a real-world drone-view dataset containing 2 scenarios, we demonstrate remarkable improvements of our DAA$^\ast$ over neural A$^\ast$ in path similarity between the predicted and reference paths with a shorter path length when the shortest path is plausible, improving by **9.0\% SPR**, **6.9\% ASIM**, and **3.9\% PSIM**. Furthermore, when jointly learning pathfinding with both path loss and path probability map loss, DAA$^\ast$ significantly outperforms the state-of-the-art TransPath by **6.7\% SPR**, **6.5\% PSIM**, and **3.7\% ASIM**. We also discuss the minor trade-off between path optimality and …
Poster
Siyu Ren · Junhui Hou · Weiyao Lin · Wenping Wang

[ Exhibit Hall I ]

Abstract
We present NeCGS, the first neural compression paradigm, which can compress a geometry set encompassing thousands of detailed and diverse 3D mesh models by up to 900 times with high accuracy and preservation of detailed geometric structures. Specifically, we first propose TSDF-Def, a new implicit representation that is capable of accurately representing irregular 3D mesh models with various structures into regular 4D tensors of uniform and compact size, where 3D surfaces can be extracted through the deformable marching cubes. Then we construct a quantization-aware auto-decoder network architecture to regress these 4D tensors to explore the local geometric similarity within each shape and across different shapes for redundancy removal, resulting in more compact representations, including an embedded feature of a smaller size associated with each 3D model and a network parameter shared by all models. We finally encode the resulting features and network parameters into bitstreams through entropy coding. Besides, our NeCGS can handle the dynamic scenario well, where new 3D models are constantly added to a compressed set. Extensive experiments and ablation studies demonstrate the significant advantages of our NeCGS over state-of-the-art methods both quantitatively and qualitatively. We have included the source code in the Supplemental Material.
Poster
Xiuyu Yang · Shuhan Tan · Philipp Kraehenbuehl

[ Exhibit Hall I ]

Abstract
An ideal traffic simulator replicates the realistic long-term point-to-point trip that a self-driving system experiences during deployment.Prior models and benchmarks focus on closed-loop motion simulation for initial agents in a scene.This is problematic for long-term simulation.Agents enter and exit the scene as the ego vehicle enters new regions.We propose InfGen, a unified next-token prediction model that performs interleaved closed-loop motion simulation and scene generation.InfGen automatically switches between closed-loop motion simulation and scene generation mode.It enables stable long-term rollout simulation.InfGen performs at the state-of-the-art in short-term (9s) traffic simulation, and significantly outperforms all other methods in long-term (30s) simulation.The code and model of InfGen will be released upon acceptance.
Poster
Geonho Bang · Minjae Seong · Jisong Kim · Geunju Baek · Daye Oh · Junhyung Kim · Junho Koh · Jun Won Choi

[ Exhibit Hall I ]

Abstract
Radar-camera fusion methods have emerged as a cost-effective approach for 3D object detection but still lag behind LiDAR-based methods in performance. Recent works have focused on employing temporal fusion and Knowledge Distillation (KD) strategies to overcome these limitations. However, existing approaches have not sufficiently accounted for uncertainties arising from object motion or sensor-specific errors inherent in radar and camera modalities. In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). RAKD is designed to consider the inherent errors in the range and azimuth directions, enabling effective knowledge transfer from LiDAR features to refine inaccurate BEV representations. TKD mitigates temporal misalignment caused by dynamic objects by aligning historical radar-camera BEV features with LiDAR representations. RDKD enhances feature discrimination by distilling relational knowledge from the teacher model, allowing the student to understand better and differentiate foreground and background features. RCTDistill achieves state-of-the-art radar–camera fusion performance on both the nuScenes and view-of-delft (VoD) datasets, with the fastest inference speed of 26.2 FPS.
Poster
Xingbo YAO · xuanmin Wang · Hao WU · Chengliang PING · ZHANG Doudou · Hui Xiong

[ Exhibit Hall I ]

Abstract
Directly generating 3D cities from satellite imagery opens up new possibilities for gaming and mapping services. However, this task remains challenging due to the limited information in satellite views, making it difficult for existing methods to achieve both photorealistic textures and geometric accuracy. To address these challenges, we propose MagicCity, a novel large-scale generative model for photorealistic 3D city generation with geometric consistency. Given a satellite image, our framework first extracts 3D geometric information and encodes it alongside textural features using a dual encoder. These features then guide a multi-branch diffusion model to generate city-scale, geometrically consistent multi-view images. To further enhance texture consistency across different viewpoints, we propose an Inter-Frame Cross Attention mechanism that enables feature sharing across different frames. Additionally, we incorporate a Hierarchical Geometric-Aware Module and a Consistency Evaluator to improve overall scene consistency. Finally, the generated images are fed into our robust 3D reconstruction pipeline to produce high-visual quality and geometrically consistent 3D cities. Moreover, we contribute CityVista, a high-quality dataset comprising 500 3D city scenes along with corresponding multi-view images and satellite imagery to advance research in 3D city generation. Experimental results demonstrate that MagicCity surpasses state-of-the-art methods in both geometric consistency and visual quality.
Poster
Li-Heng Chen · Zi-Xin Zou · Chang Liu · Tianjiao Jing · Yanpei Cao · Shi-Sheng Huang · Hongbo Fu · Hua Huang

[ Exhibit Hall I ]

Abstract
Accurate surface reconstruction from unposed images is crucial for efficient 3D object or scene creation. However, it remains challenging, particularly for the joint camera pose estimation. Previous approaches have achieved impressive pose-free surface reconstruction results in dense-view settings, but could easily fail for sparse-view scenarios without sufficient visual overlap. In this paper, we propose a new technique for pose-free surface reconstruction, which follows triplane-based signed distance field (SDF) learning but regularizes the learning by explicit points sampled from ray-based diffusion of camera pose estimation. Our key contribution is a novel Geometric Consistent Ray Diffusion model (GCRayDiffusion), where we represent camera poses as neural bundle rays and regress the distribution of noisy rays via a diffusion model. More importantly, we further condition the denoising process of RGRayDiffusion using the triplane-based SDF of the entire scene, which provides effective 3D consistent regularization to achieve multi-view consistent camera pose estimation. Finally, we incorporate RGRayDiffusion into the triplane-based SDF learning by introducing on-surface geometric regularization from the sampling points of the neural bundle rays, which leads to highly accurate pose-free surface reconstruction results even for sparse-view inputs. Extensive evaluations on public datasets show that our GCRayDiffusion achieves more accurate camera pose estimation than previous …
Poster
Hang Yang · Le Hui · Jianjun Qian · Jin Xie · Jian Yang

[ Exhibit Hall I ]

Abstract
Generalizable surface reconstruction aims to recover the scene surface from a sparse set of images in a feed-forward manner. Existing neural implicit representation-based methods evaluate numerous points along camera rays to infer the geometry, resulting in inefficient reconstruction. Recently, 3D Gaussian Splatting offers an alternative efficient scene representation and has inspired a series of surface reconstruction methods. However, these methods require dense views and can not be generalized to new scenes. In this paper, we propose a novel surface reconstruction method with Gaussian splatting, named GSRecon, which leverages the advantages of rasterization-based rendering to achieve efficient reconstruction. To obtain accurate geometry representation, we propose a geometry-aware cross-view enhancement module to improve the unreliable geometry estimation in the current view by incorporating accurate geometric information from other views. To generate the fine-grained Gaussian primitives, we propose a hybrid cross-view feature aggregation module that integrates an efficient voxel branch and a fine-grained point branch to jointly capture cross-view geometric information. Extensive experiments on the DTU, BlendedMVS, and Tanks and Temples datasets validate that GSRecon achieves state-of-the-art performance and efficient reconstruction speed.
Poster
Shaowen Tong · Zimin Xia · Alexandre Alahi · Xuming He · Yujiao Shi

[ Exhibit Hall I ]

Abstract
Cross-view localization, the task of estimating a camera's 3-degrees-of-freedom (3-DoF) pose by aligning ground-level images with satellite images, is crucial for large-scale outdoor applications like autonomous navigation and augmented reality. Existing methods often rely on fully supervised learning, which requires costly ground-truth pose annotations. In this work, we propose GeoDistill, a weakly supervised self-distillation framework that uses teacher-student learning with Field-of-View (FoV)-based masking to enhance local feature learning for robust cross-view localization. In GeoDistill, the teacher model localizes a panoramic image, while the student model predicts locations from a limited FoV counterpart created by FoV-based masking. By aligning the student's predictions with those of the teacher, the student focuses on key features like lane lines and ignores textureless regions, such as roads. This results in more accurate predictions and reduced uncertainty, regardless of whether the query images are panoramas or limited FoV images. Our experiments show that GeoDistill significantly improves localization performance across different frameworks. Additionally, we introduce a novel orientation estimation network that predicts relative orientation without requiring precise planar position ground truth. GeoDistill provides a scalable and efficient solution for real-world cross-view localization challenges.
Poster
Haonan Han · Rui Yang · Huan Liao · Haonan Han · Zunnan Xu · Xiaoming Yu · Junwei Zha · Xiu Li · Wanhua Li

[ Exhibit Hall I ]

Abstract
Traditional image-to-3D models often struggle with scenes containing multipleobjects due to biases and occlusion complexities. To address this challenge, wepresent REPARO, a novel approach for compositional 3D asset generation fromsingle images. REPARO employs a two-step process: first, it extracts individualobjects from the scene and reconstructs their 3D meshes using off-the-shelf imageto-3D models; then, it optimizes the layout of these meshes through differentiablerendering techniques, ensuring coherent scene composition. By integrating optimaltransport-based long-range appearance loss term and high-level semantic loss termin the differentiable rendering, REPARO can effectively recover the layout of 3Dassets. The proposed method can significantly enhance object independence, detailaccuracy, and overall scene coherence. Extensive evaluation of multi-object scenesdemonstrates that our REPARO offers a comprehensive approach to address thecomplexities of multi-object 3D scene generation from single images.
Poster
Mukilan Karuppasamy · Shankar Gangisetty · Shyam Nandan Rai · Carlo Masone · C.V. Jawahar

[ Exhibit Hall I ]

Abstract
Autonomous driving (AD) systems are becoming increasinglycapable of handling complex tasks, largely due to recentadvances in deep learning and AI. As the interactions betweenautonomous systems and humans grow, the interpretabilityof driving system decision-making processes becomes crucialfor safe driving. Successful human-machine interactionrequires understanding the underlying representations of theenvironment and the driving task, which remains a significantchallenge in deep learning-based systems. To address this, weintroduce the task of interpretability in maneuver predictionbefore they occur for driver safety, i.e., driver intent prediction(DIP), which plays a critical role in AD systems. To fosterresearch in interpretable DIP, we curate the eXplainableDriving Action Anticipation Dataset (DAAD-X), a newmultimodal, ego-centric video dataset to provide hierarchical,high-level textual explanations as causal reasoning for thedriver’s decisions. These explanations are derived fromboth the driver’s eye-gaze and the ego-vehicle’s perspective.Next, we propose Video Concept Bottleneck Model (VCBM),a framework that generates spatio-temporally coherentexplanations inherently, without relying on post-hoc techniques.Finally, through extensive evaluations of the proposed VCBMon DAAD-X dataset, we demonstrate that transformer-basedmodels exhibit greater interpretability compared to conventionalCNN-based models. Additionally, we introduce a multilabelt-SNE visualization technique to illustrate the disentanglementand causal correlation among multiple explanations. Thedataset and code will be released on acceptance.
Poster
Stuti Pathak · Prashant Kumar · Dheeraj Baiju · Nicholus Mboga · Gunther Steenackers · Rudi Penne

[ Exhibit Hall I ]

Abstract
Point clouds acquired in constrained, challenging, uncontrolled, and multi-sensor real-world settings are noisy, incomplete, and non-uniformly sparse. This presents acute challenges for the vital task of point cloud completion. Using tools from Algebraic Topology and Persistent Homology ($\mathcal{PH}$), we demonstrate that current benchmark object point clouds lack rich topological features that are integral part of point clouds captured in realistic environments. To facilitate research in this direction, we contribute the first real-world industrial dataset for point cloud completion, RealPC - a diverse, rich and varied set of point clouds. It consists of $\sim$ 40,000 pairs across 21 categories of industrial structures in railway establishments. Benchmark results on several strong baselines reveal that existing methods fail in real-world scenarios. We discover a striking observation - unlike current datasets, RealPC consists of multiple 0- and 1-dimensional $\mathcal{PH}$-based topological features. We prove that integrating these topological priors into existing works helps improve completion. We present how 0-dimensional $\mathcal{PH}$ priors extract the global topology of a complete shape in the form of a 3D skeleton and assist a model in generating topologically consistent complete shapes. Since computing Homology is expensive, we present a simple, yet effective Homology Sampler guided network, BOSHNet that bypasses the …
Poster
Zewei Zhou · Hao Xiang · Zhaoliang Zheng · Zhihao Zhao · Mingyue Lei · Yun Zhang · Tianhui Cai · Xinyi Liu · Johnson Liu · Maheswari Bajji · Xin Xia · Zhiyu Huang · Bolei Zhou · Jiaqi Ma

[ Exhibit Hall I ]

Abstract
Vehicle-to-everything (V2X) technologies offer a promising paradigm to mitigate the limitations of constrained observability in single-vehicle systems. Prior work primarily focuses on single-frame cooperative perception, which fuses agents' information across different spatial locations but ignores temporal cues and temporal tasks (e.g., temporal perception and prediction). In this paper, we focus on the spatio-temporal fusion in V2X scenarios and design one-step and multi-step communication strategies (when to transmit) as well as examine their integration with three fusion strategies - early, late, and intermediate (what to transmit), providing comprehensive benchmarks with 11 fusion models (how to fuse). Furthermore, we propose V2XPnP, a novel intermediate fusion framework within one-step communication for end-to-end perception and prediction. Our framework employs a unified Transformer-based architecture to effectively model complex spatio-temporal relationships across multiple agents, frames, and high-definition map. Moreover, we introduce the V2XPnP Sequential Dataset that supports all V2X collaboration modes and addresses the limitations of existing real-world datasets, which are restricted to single-frame or single-mode cooperation. Extensive experiments demonstrate our framework outperforms state-of-the-art methods in both perception and prediction tasks. The codebase and dataset will be released to facilitate future V2X research.
Poster
Zhuoran Yang · Xi Guo · Chenjing Ding · Chiyu Wang · Wei Wu · Yanyong Zhang

[ Exhibit Hall I ]

Abstract
Autonomous driving relies on robust models trained on high-quality, large-scale multi-view driving videos for tasks like perception and planning. While world models offer a cost-effective solution for generating realistic driving videos, they struggle to maintain instance-level temporal consistency and spatial geometric fidelity. To address these challenges, we propose **InstaDrive**, a novel framework that enhances driving video realism through two key advancements: (1) Instance Flow Guider module, which extracts and propagates instance features across frames to enforce temporal consistency, preserving instance identity over time. (2) Spatial Geometric Aligner module, which improves spatial reasoning, ensures precise instance positioning, and explicitly models occlusion hierarchies. By incorporating these instance-aware mechanisms, InstaDrive achieves state-of-the-art video generation quality and enhances downstream autonomous driving tasks on the nuScenes dataset. Additionally, we utilize CARLA's autopilot to procedurally and stochastically simulate rare yet safety-critical driving scenarios across diverse maps and regions, enabling rigorous safety evaluation for autonomous systems.
Poster
Jiro Abe · Gaku Nakano · Kazumine Ogura

[ Exhibit Hall I ]

Abstract
We propose NormalLoc, a novel visual localization method for estimating the 6-DoF pose of a camera using textureless 3D models. Existing methods often rely on color or texture information, limiting their applicability in scenarios where such information is unavailable. NormalLoc addresses this limitation by using rendered normal images generated from surface normals of 3D models to establish a training scheme for both global descriptor computation and matching. This approach enables robust visual localization even when geometric details are limited. Experimental results demonstrate that NormalLoc achieves state-of-the-art performance for visual localization on textureless 3D models, especially in scenarios with limited geometric detail.
Poster
Gunjan Chhablani · Xiaomeng Ye · Muhammad Zubair Irshad · Zsolt Kira

[ Exhibit Hall I ]

Abstract
The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and fine-tuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhone-captured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate real-world conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on sim-to-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation tasks. Moreover, our approach yields a high sim-vs-real correlation (0.87–0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort. Code and data will be released to facilitate further …
Poster
Jiale Xu · Shenghua Gao · Ying Shan

[ Exhibit Hall I ]

Abstract
Sparse-view reconstruction models typically require precise camera poses, yet obtaining these parameters from sparse-view images remains challenging. We introduce \textbf{FreeSplatter}, a scalable feed-forward framework that generates high-quality 3D Gaussians from \textbf{uncalibrated} sparse-view images while estimating camera parameters within seconds. Our approach employs a streamlined transformer architecture where self-attention blocks facilitate information exchange among multi-view image tokens, decoding them into pixel-aligned 3D Gaussian primitives within a unified reference frame. This representation enables both high-fidelity 3D modeling and efficient camera parameter estimation using off-the-shelf solvers. We develop two specialized variants--for \textbf{object-centric} and \textbf{scene-level} reconstruction--trained on comprehensive datasets. Remarkably, FreeSplatter outperforms existing pose-dependent Large Reconstruction Models (LRMs) by a notable margin while achieving comparable or even better pose estimation accuracy compared to state-of-the-art pose-free reconstruction approach MASt3R in challenging benchmarks. Beyond technical benchmarks, FreeSplatter streamlines text/image-to-3D content creation pipelines, eliminating the complexity of camera pose management while delivering exceptional visual fidelity.
Poster
Rui Chen · Zehuan Wu · Yichen Liu · Yuxin Guo · Jingcheng Ni · Haifeng Xia · Siyu Xia

[ Exhibit Hall I ]

Abstract
The creation of diverse and realistic driving scenarios has become essential to enhance perception and planning capabilities of the autonomous driving system.However, generating long-duration, surround-view consistent driving videos remains a significant challenge. To address this, we present UniMLVG, a unified framework designed to generate extended street multi-perspective videos under precise control. By integrating single- and multi-view driving videos into the training data, our approach updates a DiT-based diffusion model equipped with cross-frame and cross-view modules across three stages with multi training objectives, substantially boosting the diversity and quality of generated visual content. Importantly, we propose an innovative explicit viewpoint modeling approach for multi-view video generation to effectively improve motion transition consistency. Capable of handling various input reference formats (e.g., text, images, or video), our UniMLVG generates high-quality multi-view videos according to the corresponding condition constraints such as 3D bounding boxes or frame-level text descriptions.Compared to the best models with similar capabilities, our framework achieves improvements of 48.2\% in FID and 35.2\% in FVD.
Poster
yunjiang xu · Yupeng Ouyang · Lingzhi Li · Jin Wang · Benyuan Yang

[ Exhibit Hall I ]

Abstract
Collaborative perception systems overcome single-vehicle limitations in long-range detection and occlusion scenarios by integrating multi-agent sensory data, improving accuracy and safety. However, frequent cooperative interactions and real-time requirements impose stringent bandwidth constraints. Previous works proves that query-based instance-level interaction reduces bandwidth demands and manual priors, however, LiDAR-focused implementations in collaborative perception remain underdeveloped, with performance still trailing state-of-the-art approaches. To bridge this gap, we propose INSTINCT (instance-level interaction architecture), a novel collaborative perception framework featuring three core components: 1) a quality-aware filtering mechanism for high-quality instance feature selection; 2) a dual-branch detection routing scheme to decouple collaboration-irrelevant and collaboration-relevant instances; 3) a Cross Agent Local Instance Fusion module to aggregate local hybrid instance features. Additionally, we enhance the ground truth (GT) sampling technique to facilitate training with diverse hybrid instance features. Extensive experiments across multiple datasets demonstrate that INSTINCT achieves superior performance.Specifically, our method achieves an improvement in accuracy 13.23\%/32.24\% in DAIR-V2X and V2V4Real while reducing the communication bandwidth to 1/281 and 1/264 compared to state-of-the-art methods. The code will be released soon.
Poster
Sandro Papais · Letian Wang · Brian Cheong · Steven Waslander

[ Exhibit Hall I ]

Abstract
We introduce ForeSight, a novel joint detection and forecasting framework for vision-based 3D perception in autonomous vehicles. Traditional approaches treat detection and forecasting as separate sequential tasks, limiting their ability to leverage temporal cues from past forecasts. ForeSight addresses this limitation with a multi-task streaming and bidirectional learning approach, allowing detection and forecasting to share query memory and propagate information seamlessly. The forecast-aware detection transformer enhances spatial reasoning by integrating trajectory predictions from a multiple hypothesis forecast memory queue, while the streaming forecast transformer improves temporal consistency using past forecasts and refined detections. Unlike tracking-based methods, ForeSight eliminates the need for explicit object association, reducing error propagation with a tracking-free model that efficiently scales across multi-frame sequences. Experiments on the nuScenes dataset show that ForeSight achieves state-of-the-art performance, achieving an EPA of 54.9\%, surpassing previous methods by 9.3\%, while also attaining the highest mAP among multi-view detection models and maintaining competitive motion forecasting accuracy.
Poster
Soham Dasgupta · Shanthika Naik · Preet Savalia · Sujay Kumar Ingle · Avinash Sharma

[ Exhibit Hall I ]

Abstract
Dynamic garment reconstruction from monocular video is an important yet challenging task due to the complex dynamics and unconstrained nature of the garments. Recent advancements in neural rendering have enabled high-quality geometric reconstruction with image/video supervision. However, implicit representation methods that use volume rendering often provide smooth geometry and fail to model high-frequency details. While template reconstruction methods model explicit geometry, they use vertex displacement for deformation which results in artifacts. Addressing these limitations, we propose NGD, a Neural Gradient-based Deformation method to reconstruct dynamically evolving textured garments from monocular videos. Additionally, we propose a novel adaptive remeshing strategy for modeling dynamically evolving surfaces like wrinkles and pleats of the skirt, leading to high-quality reconstruction. Finally, we learn dynamic texture maps to capture per-frame lighting and shadow effects. We provide extensive qualitative and quantitative evaluations to demonstrate significant improvements over existing SOTA methods and provide high-quality garment reconstructions.
Poster
Soonbin Lee · Fangwen Shu · Yago Sanchez de la Fuente · Thomas Schierl · Cornelius Hellge

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting is a recognized method for 3D scene representation, known for its high rendering quality and speed. However, its substantial data requirements present challenges for practical applications. In this paper, we introduce an efficient compression technique that significantly reduces storage overhead by using compact representation. We propose a unified architecture that combines point cloud data and feature planes through a progressive tri-plane structure. Our method utilizes 2D feature planes, enabling continuous spatial representation. To further optimize these representations, we incorporate entropy modeling in the frequency domain, specifically designed for standard video codecs. We also propose channel-wise bit allocation to achieve a better trade-off between bitrate consumption and feature plane representation. Consequently, our model effectively leverages spatial correlations within the feature planes to enhance rate-distortion performance using standard, non-differentiable video codecs. Experimental results demonstrate that our method outperforms existing methods in data compactness while maintaining high rendering quality.
Poster
Yingsi Qin · Aswin Sankaranarayanan · Matthew O'Toole

[ Exhibit Hall I ]

Abstract
A lens brings a $\textit{single}$ plane into focus on a planar sensor; hence, parts of the scene that are outside this planar focus plane are resolved on the sensor under defocus. Can we break this precept by enabling a lens that can change its depth-of-field arbitrarily? This work investigates the design and implementation of such a computational lens with spatially-selective focusing. Our design uses an optical arrangement of Lohmann lenses and phase spatial light modulators to allow each pixel to focus onto a different depth. We extend classical techniques used in autofocusing to the spatially-varying scenario where the depth map is iteratively estimated using contrast and disparity cues, enabling the camera to progressively shape its depth-of-field to the scene's depth. By obtaining an optical all-in-focus image, our technique advances upon a broad swathe of prior work ranging from depth-from-focus/defocus to coded aperture techniques in two key aspects: the ability to bring an entire scene in focus simultaneously, and the ability to maintain the highest possible spatial resolution.
Poster
Xinyu Zhou · Peiqi Duan · Yeliduosi Xiaokaiti · Chao Xu · Boxin Shi

[ Exhibit Hall I ]

Abstract
Visual vibrometry has emerged as a powerful technique for remote acquisition of audio signals and the physical properties of materials. To capture high-frequency vibrations, frame-based visual vibrometry approaches often require a high-speed video camera and bright lighting to compensate for the short exposure time. In this paper, we introduce event-based visual vibrometry, a new high-speed visual vibration sensing method using an event camera. Exploiting the high temporal resolution, dynamic range, and low bandwidth characteristics of event cameras, event-based visual vibrometry achieves high-speed vibration sensing under common lighting conditions with enhanced data efficiency. Specifically, we leverage a hybrid camera system and propose an event-based subtle motion estimation framework that integrates an optimization-based approach for estimating coarse motion within short time intervals and a neural network to mitigate the inaccuracies in the coarse motion estimation. We demonstrate our method by capturing vibration caused by audio sources and estimating material properties for various objects.
Poster
Xiang Xu · Lingdong Kong · Song Wang · Chuanwei Zhou · Qingshan Liu

[ Exhibit Hall I ]

Abstract
LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code will be made publicly accessible for future research.
Poster
Zipei Ma · Junzhe Jiang · Yurui Chen · Li Zhang

[ Exhibit Hall I ]

Abstract
The realistic reconstruction of street scenes is critical for developing real-world simulators in autonomous driving. Most existing methods rely on object pose annotations, using these poses to reconstruct dynamic objects and move them during the rendering process. This dependence on high-precision object annotations limits large-scale and extensive scene reconstruction. To address this challenge, we propose Bézier curve Gaussian splatting (BézierGS), which represents the motion trajectories of dynamic objects using learnable Bézier curves. This approach fully leverages the temporal information of dynamic objects and, through learnable curve modeling, automatically corrects pose errors. By introducing additional supervision on dynamic object rendering and inter-curve consistency constraints, we achieve reasonable and accurate separation and reconstruction of scene elements. Extensive experiments on the Waymo Open Dataset and the nuPlan benchmark demonstrate that BézierGS outperforms state-of-the-art alternatives in both dynamic and static scene components reconstruction and novel view synthesis.
Poster
Wenting Luan · Siqi Lu · Yongbin Zheng · Wanying XU · Lang Nie · Zongtan Zhou · Kang Liao

[ Exhibit Hall I ]

Abstract
The mainstream approach for correcting distortions in wide-angle images typically involves a cascading process of rectification followed by rectangling. These tasks address distorted image content and irregular boundaries separately, using two distinct pipelines. However, this independent optimization prevents the two stages from benefiting each other. It also increases susceptibility to error accumulation and misaligned optimization, ultimately degrading the quality of the rectified image and the performance of downstream vision tasks.In this work, we observe and verify that transformations based on motion representations (*e.g.*, Thin-Plate Spline) exhibit structural continuity in both rectification and rectangling tasks. This continuity enables us to establish their relationships through the perspective of structural morphing, allowing for an optimal solution within a single end-to-end framework.To this end, we propose ConBo-Net, a unified Content and Boundary modeling approach for one-stage wide-angle image correction. Our method jointly addresses distortion rectification and boundary rectangling in an end-to-end manner. To further enhance the model’s structural recovery capability, we incorporate physical priors based on the wide-angle camera model during training and introduce an ordinal geometric loss to enforce curvature monotonicity. Extensive experiments demonstrate that ConBo-Net outperforms state-of-the-art two-stage solutions. The code and dataset will be made available.
Poster
Jiahao LI · Xinhong Chen · Zhengmin JIANG · Qian Zhou · Yung-Hui Li · Jianping Wang

[ Exhibit Hall I ]

Abstract
Stereo matching achieves significant progress with iterative algorithms like RAFT-Stereo and IGEV-Stereo. However, these methods struggle in ill-posed regions with occlusions, textureless, or repetitive patterns, due to lacking global context and geometric information for effective iterative refinement. To enable the existing iterative approaches to incorporate global context, we propose the Global Regulation and Excitation via Attention Tuning (GREAT) framework which encompasses three attention modules. Specifically, Spatial Attention (SA) captures the global context within the spatial dimension, Matching Attention (MA) extracts global context along epipolar lines, and Volume Attention (VA) works in conjunction with SA and MA to construct a more robust cost-volume excited by global context and geometric details. To verify the universality and effectiveness of this framework, we integrate it into several representative iterative stereo-matching methods and validate it through extensive experiments, collectively denoted as GREAT-Stereo. This framework demonstrates superior performance in challenging ill-posed regions. Applied to IGEV-Stereo, among all published methods, our GREAT-IGEV ranks first on the Scene Flow test set, KITTI 2015, and ETH3D leaderboards, and achieves second on the Middlebury benchmark. Code for reproducibility will be available in the future.
Poster
Shijie Li · Zhongyao Cheng · Rong Li · Shuai Li · Juergen Gall · Xun Xu · Xulei Yang

[ Exhibit Hall I ]

Abstract
Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image, enabling diverse real-world applications. However, existing methods are often constrained by the local receptive field of Convolutional Neural Networks (CNNs), making it challenging to handle the non-uniform distribution of projected points (Fig. \ref{fig:perspective}) and effectively reconstruct missing information caused by the 3D-to-2D projection. In this work, we introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space. Specifically, we propose a Dual-Head Multi-Modality Encoder, which leverages a Transformer architecture to capture spatial relationships across all features in the 2D image domain, enabling more comprehensive 2D feature extraction. Additionally, we introduce the Frustum Mamba Decoder, built on the State Space Model (SSM), to efficiently capture long-range dependencies in 3D space. Furthermore, we propose a frustum reordering strategy within the Frustum Mamba Decoder to mitigate feature discontinuities in the reordered voxel sequence, ensuring better alignment with the scan mechanism of the State Space Model (SSM) for improved 3D representation learning. We conduct extensive experiments on the widely used Occ-ScanNet and NYUv2 datasets, demonstrating that our proposed method achieves state-of-the-art performance, validating its effectiveness. The code will be …
Poster
Yuping Wang · Xiangyu Huang · Xiaokang Sun · Mingxuan Yan · Shuo Xing · Zhengzhong Tu · Jiachen Li

[ Exhibit Hall I ]

Abstract
We introduce UniOcc, a comprehensive, unified benchmark for occupancy forecasting (i.e., predicting future occupancies based on historical information) and current-frame occupancy prediction from camera images. UniOcc unifies data from multiple real-world datasets (i.e., nuScenes, Waymo) and high-fidelity driving simulators (i.e., CARLA, OpenCOOD), which provides 2D/3D occupancy labels with per-voxel flow annotations and support for cooperative autonomous driving. Unlike existing studies that rely on suboptimal pseudo labels for evaluation, UniOcc incorporates novel evaluation metrics that do not depend on ground-truth occupancy, enabling robust assessment on additional aspects of occupancy quality. Through extensive experiments on state-of-the-art models, we demonstrate that large-scale, diverse training data and explicit flow information significantly enhance occupancy prediction and forecasting performance. We will release UniOcc to facilitate research in safe and reliable autonomous driving.
Poster
Tianqi Liu · Zihao Huang · Zhaoxi Chen · Guangcong Wang · Shoukang Hu · Liao Shen · Huiqiang Sun · Zhiguo Cao · Wei Li · Ziwei Liu

[ Exhibit Hall I ]

Abstract
We present **Free4D**, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. **1)** To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. **2)** To lift this coarse structure into spatial-temporal consistent multi-view videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. **3)** To turn these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable temporal-spatial rendering, marking a significant advancement in single-image-based 4D scene generation. Code will be released.
Poster
Yaopeng Lou · Liao Shen · Tianqi Liu · Jiaqi Li · Zihao Huang · Huiqiang Sun · Zhiguo Cao

[ Exhibit Hall I ]

Abstract
We present Multi-Baseline Gaussian Splatting (MuGS), a generalized feed-forward approach for novel view synthesis that effectively handles diverse baseline settings, including sparse input views with both small and large baselines.Specifically, we integrate features from Multi-View Stereo (MVS) and Monocular Depth Estimation (MDE) to enhance feature representations for generalizable reconstruction. Next, We propose a projection-and-sampling mechanism for deep depth fusion, which constructs a fine probability volume to guide the regression of the feature map. Furthermore, We introduce a reference-view loss to improve geometry and optimization efficiency.We leverage $3$D Gaussian representations to accelerate training and inference time while enhancing rendering quality.MuGS achieves state-of-the-art performance across multiple baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K). We also demonstrate promising zero-shot performance on the LLFF and Mip-NeRF 360 datasets. Code will be released.
Poster
Guangting Zheng · Jiajun Deng · Xiaomeng Chu · Yu Yuan · Houqiang Li · Yanyong Zhang

[ Exhibit Hall I ]

Abstract
Recently, 3D Gaussian Splatting (3DGS) has reshaped the field of photorealistic 3D reconstruction, achieving impressive rendering quality and speed. However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead.After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for large-scale Street Scene Reconstruction, effectively mitigating these limitations. Moreover, most existing street 3DGS methods rely on ground-truth 3D bounding boxes to separate dynamic and static components, but 3D bounding boxes are difficult to obtain, limiting real-world applicability. To address this, we propose an alternative solution with 2D boxes, which are easier to annotate or can be predicted by off-the-shelf vision foundation models. Such designs together make S3R-GS readily adapt to large, in-the-wild scenarios.Extensive experiments demonstrate that S3R-GS enhances rendering quality and significantly accelerates reconstruction. Remarkably, when applied to videos from the challenging Argoverse2 dataset, it achieves state-of-the-art PSNR and SSIM, reducing reconstruction time to below 50\%—and even 20\%—of competing methods. The code will be released to …
Poster
Yingqi Tang · Zhuoran Xu · Zhaotie Meng · Erkang Cheng

[ Exhibit Hall I ]

Abstract
Although end-to-end autonomous driving (E2E-AD) technologies have made significant progress in recent years, there remains an unsatisfactory performance on closed-loop evaluation. The potential of leveraging planning in query design and interaction has not yet been fully explored. In this paper, we introduce a multi-granularity planning query representation that integrates heterogeneous waypoints, including spatial, temporal, and driving-style waypoints across various sampling patterns. It provides additional supervision for trajectory prediction, enhancing precise closed-loop control for the ego vehicle. Additionally, we explicitly utilize the geometric properties of planning trajectories to effectively retrieve relevant image features based on physical locations using deformable attention. By combining these strategies, we propose a novel end-to-end autonomous driving framework, termed HiP-AD, which simultaneously performs perception, prediction, and planning within a unified decoder. HiP-AD enables comprehensive interaction by allowing planning queries to iteratively interact with perception queries in the BEV space while dynamically extracting image features from perspective views.Experiments demonstrate that HiP-AD outperforms all existing end-to-end autonomous driving methods on the closed-loop benchmark Bench2Drive and achieves competitive performance on the real-world dataset nuScenes. The code will be available upon acceptance.
Poster
Shenxing Wei · Jinxi Li · Yafei YANG · Siyuan Zhou · Bo Yang

[ Exhibit Hall I ]

Abstract
In this paper, we present a generalizable method for 3D surface reconstruction from raw point clouds or pre-estimated 3D Gaussians by 3DGS from RGB images. Unlike existing coordinate-based methods which are often computationally intensive when rendering explicit surfaces, our proposed method, named **RayletDF**, introduces a new technique called raylet distance field, which aims to directly predict surface points from query rays. Our pipeline consists of three key modules: a raylet feature extractor, a raylet distance field predictor, and a multi-raylet blender. These components work together to extract fine-grained local geometric features, predict raylet distances, and aggregate multiple predictions to reconstruct precise surface points. We extensively evaluate our method on multiple public real-world datasets, demonstrating superior performance in surface reconstruction from point clouds or 3D Gaussians. Most notably, our method achieves exceptional generalization ability, successfully recovering 3D surfaces in a single-forward pass across unseen datasets in testing.
Poster
Weiqi Zhang · Junsheng Zhou · Haotian Geng · Wenyuan Zhang · Liang Han

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has demonstrated its advantages in achieving fast and high-quality rendering. As point clouds serve as a widely-used and easily accessible form of 3D representation, bridging the gap between point clouds and Gaussians becomes increasingly important. Recent studies have explored how to convert the colored points into Gaussians, but directly generating Gaussians from colorless 3D point clouds remains an unsolved challenge. In this paper, we propose GAP, a novel approach that gaussianizes raw point clouds into high-fidelity 3D Gaussians with text guidance. Our key idea is to design a multi-view optimization framework that leverages a depth-aware image diffusion model to synthesize consistent appearances across different viewpoints. To ensure geometric accuracy, we introduce a surface-anchoring mechanism that effectively constrains Gaussians to lie on the surfaces of 3D shapes during optimization. Furthermore, GAP incorporates a diffuse-based inpainting strategy that specifically targets at completing hard-to-observe regions. We evaluate GAP on the Point-to-Gaussian generation task across varying complexity levels, from synthetic point clouds to challenging real-world scans, and even large-scale scenes.
Poster
Yesheng Zhang · Xu Zhao

[ Exhibit Hall I ]

Abstract
This work presents a novel framework for Visual Localization (VL), that is, regressing camera rays from query images to derive camera poses. As an overparameterized representation of the camera pose, camera rays possess superior robustness in optimization.Of particular importance, Camera Ray Regression (CRR) is privacy-preserving, rendering it a viable VL approach for real-world applications. Thus, we introduce DINO-based Multi-Mappers, coined DIMM, to achieve VL by CRR.DIMM utilizes DINO as a scene-agnostic encoder to obtain powerful features from images. To mitigate ambiguity, the features integrate both local and global perception, as well as potential geometric constraint. Then, a scene-specific mapper head regresses camera rays from these features. It incorporates a semantic attention module for soft fusion of multiple mappers, utilizing the rich semantic information in DINO features. In extensive experiments on both indoor and outdoor datasets, our methods showcase impressive performance, revealing a promising direction for advancements in VL.
Poster
Haiyang Ying · Matthias Zwicker

[ Exhibit Hall I ]

Abstract
Edges are one of the most basic parametric primitives to describe structural information in 3D. In this paper, we study parametric 3D edge reconstruction from calibrated multi-view images. Previous methods usually reconstruct a 3D edge point set from multi-view 2D edge images, and then fit 3D edges to the point set. However, noise in the point set may cause gaps among fitted edges, and the recovered edges may not align with input multi-view images since the edge fitting depends only on the reconstructed 3D point set. To mitigate these problems, we propose SketchSplat, a method to reconstruct accurate, complete, and compact 3D edges via differentiable multi-view sketch splatting. We represent 3D edges as sketches, which are parametric lines and curves defined by attributes including control points, scales, and opacity. During edge reconstruction, we iteratively sample Gaussian points from a set of sketches and rasterize the Gaussians onto 2D edge images. Then the gradient of the image error with respect to the input 2D edge images can be back-propagated to optimize the sketch attributes. Our method bridges 2D edge images and 3D edges in a differentiable manner, which ensures that 3D edges align well with 2D images and leads to accurate …
Poster
Chu Zhou · Yixin Yang · Junda Liao · Heng Guo · Boxin Shi · Imari Sato

[ Exhibit Hall I ]

Abstract
Polarization has found applications in various computer vision tasks by providing additional physical cues. However, due to the limitations of current imaging systems, polarimetric parameters are typically stored in discrete form, which is non-differentiable and limits their applicability in polarization-based vision. While current neural field methods have shown promise for continuous signal reconstruction, they struggle to model the intrinsic physical interdependencies among polarimetric parameters. In this work, we propose a physics-grounded representation scheme to represent polarimetric parameters as a unified complex-valued wavefunction. Tailored to this scheme, we propose a tuning-free fitting strategy along with a lightweight complex-valued neural network, enabling property-preserved reconstruction. Experimental results show that our method achieves state-of-the-art performance and facilitates smooth polarized image rendering and flexible resolution adjustments.
Poster
Yuchong Chen · Jian Yu · Shaoyan Gai · Zeyu Cai · Feipeng Da

[ Exhibit Hall I ]

Abstract
In structured light systems, measurement accuracy tends to decline significantly when evaluating complex textured surfaces, particularly at boundaries between different colors. To address this issue, this paper conducts a detailed analysis to develop an error model that illustrates the relationship between phase error and image characteristics, specifically the blur level, grayscale value, and grayscale gradient. Based on this model, a high-precision approach for measuring complex textured targets is introduced, employing a multiple filtering approach. This approach first applies a sequence of filters to vary the blur level of the captured patterns, allowing calculation of phase differences under different blur conditions. Then, these phase differences are used in the constructed error model to identify the critical parameter causing phase errors. Finally, phase recovery is performed using the calibrated parameter, effectively reducing errors caused by complex textures. Experimental comparisons exhibit that this method reduces the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) by 40.31% and 40.78%, respectively. In multiple experiments, its performance generally surpassed that of existing methods, demonstrating improved accuracy and robustness.
Poster
Han Ling · Yinghui Sun · Xian Xu · Quansen Sun

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has become one of the most promising 3D reconstruction technologies. However, label noise in real-world scenarios—such as moving objects, non-Lambertian surfaces, and shadows—often leads to reconstruction errors. Existing 3DGS-Bsed anti-noise reconstruction methods either fail to separate noise effectively or require scene-specific fine-tuning of hyperparameters, making them difficult to apply in practice.This paper re-examines the problem of anti-noise reconstruction from the perspective of epistemic uncertainty, proposing a novel framework, OCSplats. By combining key technologies such as hybrid noise assessment and observation-based cognitive correction, the accuracy of noise classification in areas with cognitive differences has been significantly improved.Moreover, to address the issue of varying noise proportions in different scenarios, we have designed a label noise classification pipeline based on dynamic anchor points. This pipeline enables OCSplats to be applied simultaneously to scenarios with vastly different noise proportions without adjusting parameters. Extensive experiments demonstrate that OCSplats always achieve leading reconstruction performance and precise label noise classification in scenes of different complexity levels. Code will be available.
Poster
Runjia Li · Philip Torr · Andrea Vedaldi · Tomas Jakab

[ Exhibit Hall I ]

Abstract
We propose a novel approach for long-term autoregressive scene generation in the form of a camera-conditioned video stream.Existing methods either rely on explicit geometry estimation in inpainting-based approaches, which suffer from geometric inaccuracies, or use a limited context window in video-based approaches, which struggle with long-term coherence.To address these limitations, we introduce Surfel-Indexed Memory of Views (SIMView), a mechanism that anchors past views to surface elements (surfels) they previously observed.This allows us to retrieve and condition novel view generation on the most relevant past views rather than just the latest ones.By leveraging information about the scene's geometric structure, our method significantly enhances long-term scene consistency while reducing computational overhead.We evaluate our approach on challenging long-term scene synthesis benchmarks, demonstrating superior performance in scene coherence and camera control compared to existing methods.
Poster
Jiacheng Chen · Ziyu Jiang · Mingfu Liang · Bingbing Zhuang · Jong-Chyi Su · Sparsh Garg · Ying Wu · Manmohan Chandraker

[ Exhibit Hall I ]

Abstract
Video generation for driving scenes has gained increasing attention due to its broad range of applications, including autonomous driving, robotics, and mixed reality. However, generating high-quality, long-horizon, and 3D-consistent videos remains a challenge.We propose AutoScape, a framework designed for long-horizon driving scene generation. The framework comprises two stages: 1) Keyframe Generation, which anchors global scene appearance and geometry by autoregressively generating 3D-consistent keyframes using a joint RGB-D diffusion model, and 2) Interpolation, which employs a video diffusion model to generate dense frames conditioned on consecutive keyframes, ensuring temporal continuity and geometric consistency.With three innovative design choices to guarantee 3D consistency—RGB-D Diffusion, 3D Information Conditioning, and Warp Consistent Guidance—AutoScape achieves superior performance, generating realistic and geometrically consistent driving videos of up to 20 seconds at 12 FPS. Specifically, it improves the FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively, setting a new benchmark for long-horizon video generation in driving scenes.
Poster
Chenjian Gao · Lihe Ding · Rui Han · Zhanpeng Huang · Zibin Wang · Tianfan Xue

[ Exhibit Hall I ]

Abstract
Inserting 3D objects into videos is a longstanding challenge in computer graphics with applications in augmented reality, virtual try-on, and video composition. Achieving both temporal consistency, or realistic lighting remains difficult, particularly in dynamic scenarios with complex object motion, perspective changes, and varying illumination. While 2D diffusion models have shown promise for producing photorealistic edits, they often struggle with maintaining temporal coherence across frames. Conversely, traditional 3D rendering methods excel in spatial and temporal consistency but fall short in achieving photorealistic lighting. In this work, we propose a hybrid object insertion pipeline that combines the strengths of both paradigms. Specifically, we focus on inserting bracelets into dynamic wrist scenes, leveraging the high temporal consistency of 3D Gaussian Splatting (3DGS) for initial rendering and refining the results using a 2D diffusion-based enhancement model to ensure realistic lighting interactions. Our method introduces a shading-driven pipeline that separates intrinsic object properties (albedo, shading, reflectance) and refines both shading and sRGB images for photorealism. To maintain temporal coherence, we optimize the 3DGS model with multi-frame weighted adjustments. This is the first approach to synergize 3D rendering and 2D diffusion for video object insertion, offering a robust solution for realistic and consistent video editing.
Poster
Ruida Zhang · Chengxi Li · Chenyangguang Zhang · Xingyu Liu · Haili Yuan · Yanyan Li · Xiangyang Ji · Gim Hee Lee

[ Exhibit Hall I ]

Abstract
Realistic scene reconstruction in driving scenarios poses significant challenges due to fast-moving objects. Most existing methods rely on labor-intensive manual labeling of object poses to reconstruct dynamic objects in canonical space and move them based on these poses during rendering. While some approaches attempt to use 3D object trackers to replace manual annotations, the limited generalization of 3D trackers - caused by the scarcity of large-scale 3D datasets - results in inferior reconstructions in real-world settings. In contrast, 2D foundation models demonstrate strong generalization capabilities. To eliminate the reliance on 3D trackers and enhance robustness across diverse environments, we propose a stable object tracking module by leveraging associations from 2D deep trackers within a 3D object fusion strategy. We address inevitable tracking errors by further introducing a motion learning strategy in an implicit feature space that autonomously corrects trajectory errors and recovers missed detections. Experimental results on Waymo-NOTR and KITTI show that our method outperforms existing approaches. Our code will be made publicly available.
Poster
Hanyang Kong · Xingyi Yang · Xinchao Wang

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has recently emerged as an efficient representation for high-quality 3D reconstruction and rendering. Despite its superior rendering quality and speed, 3DGS heavily relies on the assumption of geometric consistency among input images. In real-world scenarios, violations of this assumption—such as occlusions, dynamic objects, or camera blur—often lead to reconstruction artifacts and rendering inaccuracies. To address these challenges, we introduce RogSplat, a robust framework that leverages generative models to enhance the reliability of 3DGS. Specifically, RogSplat identifies and rectifies occluded regions during the optimization of unstructured scenes. Outlier regions are first detected using our proposed fused features and then accurately inpainted by the proposed RF-Refiner, ensuring reliable reconstruction of occluded areas while preserving the integrity of visible regions. Extensive experiments demonstrate that RogSplat achieves state-of-the-art reconstruction quality on the RobustNeRF and NeRF-on-the-go datasets, significantly outperforming existing methods in challenging real-world scenarios involving dynamic objects.
Poster
Yida Wang · Xueyang Zhang · Kun Zhan · Peng Jia · XianPeng Lang

[ Exhibit Hall I ]

Abstract
Neural surface reconstruction faces critical challenges in achieving geometrically accurate and visually coherent results under complex real-world conditions. We present a unified framework that simultaneously resolves multi-view radiance inconsistencies, enhances low-textured surface recovery, and preserves fine structural details through three fundamental innovations. First, our SDF-guided visibility factor $\mathbb{V}$ establishes continuous occlusion reasoning to eliminate reflection-induced ambiguities in multi-view supervision. Second, we introduce local geometry constraints via ray-aligned patch analysis $\mathbb{P}$, enforcing planarity in textureless regions while maintaining edge sensitivity through adaptive feature weighting. Third, we reformulate Eikonal regularization with rendering-prioritized relaxation, enabling detail preservation by conditioning geometric smoothness on local radiance variations. Unlike prior works that address these aspects in isolation, our method achieves synergistic optimization where multi-view consistency, surface regularity, and structural fidelity mutually reinforce without compromise. Extensive experiments across synthetic and real-world datasets demonstrate state-of-the-art performance, with quantitative improvements of 21.4\% in Chamfer distance over reflection-aware baselines and 2.32 dB PSNR gains against neural rendering counterparts. Qualitative results showcase unprecedented reconstruction quality for challenging cases including specular instruments, urban layouts with thin structures, and Lambertian surfaces with sub-millimeter details. Our code will be publicly released to facilitate research in unified neural surface recovery.
Poster
Sicong Du · Jiarun Liu · Qifeng Chen · Hao-Xiang Chen · Tai-Jiang Mu · Sheng Yang

[ Exhibit Hall I ]

Abstract
A single-pass driving clip frequently results in incomplete scanning of the road structure, making reconstructed scene expanding a critical requirement for sensor simulators to effectively regress driving actions. Although contemporary 3D Gaussian Splatting (3DGS) techniques achieve remarkable reconstruction quality, their direct extension through the integration of diffusion priors often introduces cumulative physical inconsistencies and compromises training efficiency. To address these limitations, we present RGE-GS, a novel expansive reconstruction framework that synergizes diffusion-based generation with reward-guided Gaussian integration. The RGE-GS framework incorporates two key innovations: First, we propose a reward network that learns to identify and prioritize consistently generated patterns prior to reconstruction phases, thereby enabling selective retention of diffusion outputs for spatial stability. Second, during the reconstruction process, we devise a differentiated training strategy that automatically adjust Gaussian optimization progress according to scene converge metrics, which achieving better convergence than baseline methods. Extensive evaluations of publicly available datasets demonstrate that RGE-GS achieves state-of-the-art performance in reconstruction quality.
Poster
Wenjing Bian · Axel Barroso-Laguna · Tommaso Cavallari · Victor Prisacariu · Eric Brachmann

[ Exhibit Hall I ]

Abstract
Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints to recover the scene geometry, SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards a more plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.
Poster
Juncheng Mu · Chengwei Ren · Weixiang Zhang · Liang Pan · Xiao-Ping Zhang · Yue Gao

[ Exhibit Hall I ]

Abstract
Learning cross-modal correspondences is essential for image-to-point cloud (I2P) registration. Existing methods achieve this mostly by utilizing metric learning to enforce feature alignment across modalities, disregarding the inherent modality gap between image and point data. Consequently, this paradigm struggles to ensure accurate cross-modal correspondences. To this end, inspired by the cross-modal generation success of recent large diffusion models, we propose **Diff$^2$I2P**, a fully **Diff**erentiable **I2P** registration framework, leveraging a novel and effective **Diff**usion prior for bridging the modality gap. Specifically, we propose a Control-Side Score Distillation (CSD) technique to distill knowledge from a depth-conditioned diffusion model to directly optimize the predicted transformation. However, the gradients on the transformation fail to backpropagate onto the cross-modal features due to the non-differentiability of correspondence retrieval and PnP solver. To this end, we further propose a Deformable Correspondence Tuning (DCT) module to estimate the correspondences in a differentiable way, followed by the transformation estimation using a differentiable PnP solver. With these two designs, the Diffusion model serves as a strong prior to guide the cross-modal feature learning of image and point cloud for forming robust correspondences, which significantly improves the registration. Extensive experimental results demonstrate that **Diff$^2$I2P** consistently outperforms state-of-the-art I2P registration methods, achieving …
Poster
Conghao Wong · Ziqian Zou · Beihao Xia

[ Exhibit Hall I ]

Abstract
Learning to forecast trajectories of intelligent agents has caught much more attention recently.However, it remains a challenge to accurately account for agents' intentions and social behaviors when forecasting, and in particular, to simulate the unique randomness within each of those components in an explainable and decoupled way.Inspired by vibration systems and their resonance properties, we propose the Resonance (short for Re) model to encode and forecast pedestrian trajectories in the form of ``co-vibrations''.It decomposes trajectory modifications and randomnesses into multiple vibration portions to simulate agents' reactions to each single cause, and forecasts trajectories as the superposition of these independent vibrations separately.Also, benefiting from such vibrations and their spectral properties, representations of social interactions can be learned by emulating the resonance phenomena, further enhancing its explainability.Experiments on multiple datasets have verified its usefulness both quantitatively and qualitatively.
Poster
Lin Zeng · Boming Zhao · Jiarui Hu · Xujie Shen · Ziqiang Dang · Hujun Bao · Zhaopeng Cui

[ Exhibit Hall I ]

Abstract
Novel view synthesis with neural models has advanced rapidly in recent years, yet adapting these models to scene changes remains an open problem. Existing methods are either labor-intensive, requiring extensive model retraining, or fail to capture detailed types of changes over time. In this paper, we present GaussianUpdate, a novel approach that combines 3D Gaussian representation with continual learning to address these challenges. Our method effectively updates the Gaussian radiance fields with current data while preserving information from past scenes. Unlike existing methods, GaussianUpdate explicitly models different types of changes through a novel multi-stage update strategy. Additionally, we introduce a visibility-aware continual learning approach with generative replay, enabling self-aware updating without the need to store images. The experiments on the benchmark dataset demonstrate our method achieves superior and real-time rendering with the capability of visualizing changes over different times.
Poster
Zhimin Liao · Ping Wei · Ruijie Zhang · Shuaijia Chen · Haoxuan Wang · Ziyang Ren

[ Exhibit Hall I ]

Abstract
Forecasting the evolution of 3D scenes and generating unseen scenarios through occupancy-based world models offers substantial potential to enhance the safety of autonomous driving systems. While tokenization has revolutionized image and video generation, efficiently tokenizing complex 3D scenes remains a critical challenge for 3D world models. To address this, we propose $I^{2}$-World, an efficient framework for 4D occupancy forecasting. Our method decouples scene tokenization into intra-scene and inter-scene tokenizers. The intra-scene tokenizer employs a multi-scale residual quantization strategy to hierarchically compress 3D scenes while preserving spatial details. The inter-scene tokenizer residually aggregates temporal dependencies across timesteps. This dual design retains the compactness of 3D tokenizers while capturing the dynamic expressiveness of 4D approaches. Unlike decoder-only GPT-style autoregressive models, $I^{2}$-World adopts an encoder-decoder architecture. The encoder aggregates spatial context from the current scene and predicts a transformation matrix to guide future scene generation. The decoder, conditioned on this matrix and historical tokens, ensures temporal consistency during generation. Experiments demonstrate that $I^{2}$-World achieves state-of-the-art performance, surpassing existing approaches by $\textbf{41.8}$% in 4D occupancy forecasting with exceptional efficiency—requiring only $\textbf{2.9 GB}$ of training memory and achieving real-time inference at $\textbf{94.8 FPS}$.
Poster
Jungmin Lee · Seonghyuk Hong · Juyong Lee · Jaeyoon Lee · Jongwon Choi

[ Exhibit Hall I ]

Abstract
Multi-modal data fusion plays a crucial role in integrating diverse physical properties. While RGB images capture external visual features, they lack internal features, whereas X-ray images reveal internal structures but lack external details. To bridge this gap, we propose \textit{Insideout}, a novel 3DGS framework that integrates RGB and X-ray data to represent the structure and appearance of objects. Our approach consists of three key components: internal structure training, hierarchical fitting, and detail-preserving refinement. First, RGB and radiative Gaussian splats are trained to capture surface structure. Then, hierarchical fitting ensures scale and positional synchronization between the two modalities. Next, cross-sectional images are incorporated to learn internal structures and refine layer boundaries. Finally, the aligned Gaussian splats receive color from RGB Gaussians, and fine Gaussian is duplicated to enhance surface details. Experiments conducted on a newly collected dataset of paired RGB and X-ray images demonstrate the effectiveness of \textit{InsideOut} in accurately representing internal and external structures.
Poster
chenghui Lu · Dilong Li · Jianlong Kwan · Ziyi Chen · Haiyan Guan

[ Exhibit Hall I ]

Abstract
Point cloud oversegmentation, as a fundamental preprocess step in point cloud understanding, is a challenging task as its spatial proximity and semantic similarity requirement. Most existing works struggle to efficiently group semantically consistent points into superpoints while maintaining spatial proximity. In this paper, we propose a novel serialization based point cloud oversegmentation method, which leverages serialization to avoid complex spatial queries, directly accessing neighboring points through sequence locality for similarity matching and superpoint clustering. Specifically, we first serialize point clouds into Hilbert curve and spatially-continuously partition them into multiple initial segments. Then, to guarantee the internal semantic consistency of superpoints, we design an adaptive update algorithm that clusters superpoints by matching feature similarities between neighboring segments and updates features via Cross-Attention. Experimental results show that the proposed method achieves state-of-the-art in point cloud oversegmentation across multiple large-scale indoor and outdoor datasets. Moreover, the proposed method can be flexibly adapted to the semantic segmentation task, and achieves promising performance.
Poster
Yang LI · Jinglu Wang · Lei Chu · Xiao Li · Shiu-hong Kao · Ying-Cong Chen · Yan Lu

[ Exhibit Hall I ]

Abstract
The advent of 3D Gaussian Splatting (3DGS) has advanced 3D scene reconstruction and novel view synthesis. With the growing interest of interactive applications that need immediate feedback, online 3DGS reconstruction in real-time is in high demand. However, none of existing methods yet meet the demand due to three main challenges: the absence of predetermined camera parameters, the need for generalizable 3DGS optimization, and the necessity of reducing redundancy. We propose StreamGS, an online generalizable 3DGS reconstruction method for unposed image streams, which progressively transform image streams to 3D Gaussian streams by predicting and aggregating per-frame Gaussians. Our method overcomes the limitation of the initial point reconstruction \cite{dust3r} in tackling out-of-domain (OOD) issues by introducing a content adaptive refinement. The refinement enhances cross-frame consistency by establishing reliable pixel correspondences between adjacent frames. Such correspondences further aid in merging redundant Gaussians through cross-frame feature aggregation. The density of Gaussians is thereby reduced, empowering online reconstruction by significantly lowering computational and memory costs. Extensive experiments on diverse datasets have demonstrated that StreamGS achieves quality on par with optimization-based approaches but does so 150 times faster, and exhibits superior generalizability in handling OOD scenes.
Poster
Baojie Fan · Xiaotian Li · Yuhan Zhou · Yuyu Jiang · Jiandong Tian · Huijie Fan

[ Exhibit Hall I ]

Abstract
The multi-modal 3D semantic occupancy task provides a comprehensive understanding of the scene and has received considerable attention in the field of autonomous driving. However, existing methods mainly focus on processing large-scale voxels, which bring high computational costs and degrade details. Additionally, they struggle to accurately capture occluded targets and distant information. In this paper, we propose a novel LiDAR-Camera 3D semantic occupancy prediction framework called RIOcc, with collaborative feature refinement and multi-scale cross-modal fusion transformer. Specifically, RIOcc encodes multi-modal data into a unified Bird's Eye View (BEV) space, which reduces computational complexity and enhances the efficiency of feature alignment. Then, multi-scale feature processing substantially expands the receptive fields. Meanwhile, in the LiDAR branch, we design the Dual-branch Pooling (DBP) to adaptively enhance geometric features across both the Channel and Grid dimensions. In the camera branch, the Wavelet and Semantic Encoders are developed to extract high-level semantic features with abundant edge and structural information. Finally, to facilitate effective cross-modal complementarity, we develop the Deformable Dual-Attention (DDA) module. Extensive experiments demonstrate that RIOcc achieves state-of-the-art performance, with 54.2 mIoU and 25.9 mIoU on the Occ3D-nuScenes and nuScenes-Occupancy datasets, respectively.
Poster
Simon Niedermayr · Christoph Neuhauser · Rüdiger Westermann

[ Exhibit Hall I ]

Abstract
We introduce an image upscaling technique tailored for 3D Gaussian Splatting (3DGS) on lightweight GPUs. Compared to 3DGS, it achieves significantly higher rendering speeds and reduces artifacts commonly observed in 3DGS reconstructions. Our technique upscales low-resolution 3DGS renderings with a marginal increase in cost by directly leveraging the analytical image gradients of Gaussians for gradient-based bicubic spline interpolation.The technique is agnostic to the specific 3DGS implementation, achieving novel view synthesis at rates 3×–4× higher than the baseline implementation.Through extensive experiments on multiple datasets, we showcase the performance improvements and high reconstruction fidelity attainable with gradient-aware upscaling of 3DGS images.We further demonstrate the integration of gradient-aware upscaling into the gradient-based optimization of a 3DGS model and analyze its effects on reconstruction quality and performance.
Poster
Changsong Lei · Yaqian Liang · Shaofeng Wang · Jiajia Dai · Yong-Jin Liu

[ Exhibit Hall I ]

Abstract
Digital orthodontics represents a prominent and critical application of computer vision technology in the medical field. So far, the labor-intensive process of collecting clinical data, particularly in acquiring paired 3D orthodontic teeth models, which constitutes a crucial bottleneck for developing tooth arrangement neural networks. Although numerous general 3D shape generation models have been proposed, most of them focus on single-object generation and are insufficient for generating anatomically structured teeth models, each comprising 24-32 segmented teeth. In this paper, we propose TeethGenerator, a novel two-stage framework designed to synthesize paired 3D teeth models pre- and post-orthodontic, aiming to facilitate the training of downstream tooth arrangement networks. Specifically, our approach consists of two key modules: (1) a teeth shape generation module that leverages a diffusion model to learn the distribution of morphological characteristics of teeth, enabling the generation of diverse post-orthodontic teeth models; and (2) a teeth style generation module that synthesizes corresponding pre-orthodontic teeth models by incorporating desired styles as conditional inputs. Extensive qualitative and quantitative experiments demonstrate that our synthetic dataset aligns closely with the distribution of real orthodontic data, and promotes tooth alignment performance significantly when combined with real data for training. The code and dataset will be made …
Poster
Saimouli Katragadda · Cho-Ying Wu · Yuliang Guo · Xinyu Huang · Guoquan Huang · Liu Ren

[ Exhibit Hall I ]

Abstract
To enable AI agents to interact seamlessly with both humans and 3D environments, they must not only perceive the 3D world accurately but also align human language with 3D spatial representations. While prior work has made significant progress by integrating language features into geometrically detailed 3D scene representations using 3D Gaussian Splatting (GS), these approaches rely on computationally intensive offline preprocessing of language features for each input image, limiting adaptability to new environments.In this work, we introduce Online Language Splatting, the first framework to achieve online, near real-time, open-vocabulary language mapping within a 3DGS-SLAM system without requiring pre-generated language features. The key challenge lies in efficiently fusing high-dimensional language features into 3D representations while balancing the computation speed, memory usage, rendering quality and open-vocabulary capability. To this end, we innovatively design: (1) a high-resolution CLIP embedding module capable of generating detailed language feature maps in 18ms per frame, (2) a two-stage online auto-encoder that compresses 768-dimensional CLIP features to 15 dimensions while preserving open-vocabulary capabilities, and (3) a color-language disentangled optimization approach to improve rendering quality.Experimental results show that our online method not only surpasses the state-of-the-art offline methods in accuracy but also achieves more than $40\times$ efficiency boost, demonstrating …
Poster
Min Kim · Younho Jeon · Sungho Jo

[ Exhibit Hall I ]

Abstract
Wearable Inertial Measurement Units (IMUs) allow non-intrusive motion tracking, but limited sensor placements can introduce uncertainty in capturing detailed full-body movements. Existing methods mitigate this issue by selecting more physically plausible motion patterns but do not directly address inherent uncertainties in the data. We introduce the Probabilistic Inertial Poser (ProbIP), a novel probabilistic model that transforms sparse IMU data into human motion predictions without physical constraints. ProbIP utilizes RU-Mamba blocks to predict a matrix Fisher distribution over rotations, effectively estimating both rotation matrices and associated uncertainties. To refine motion distribution through layers, our Progressive Distribution Narrowing (PDN) technique enables stable learning across a diverse range of motions. Experimental results demonstrate that ProbIP achieves state-of-the-art performance on multiple public datasets with six IMU sensors and yields competitive outcomes even with fewer sensors. Our contributions include the development of ProbIP with RU-Mamba blocks for probabilistic motion estimation, applying PDN for uncertainty reduction, and evidence of superior results with six and reduced sensor configurations.
Poster
Linshen Liu · Boyan Su · Junyue Jiang · Guanlin Wu · Cong Guo · Ceyu Xu · Hao Frank Yang

[ Exhibit Hall I ]

Abstract
The paper introduces Edge-based Mixture-of-Experts (MoE) Collaborative Computing (EMC2) system, the first multimodal MoE framework designed to address the conflicting requirements of low latency and high accuracy in diverse traffic scenarios for autonomous driving safety. EMC2’s key innovation is its scenario-aware computing architecture optimized for edge devices, which adaptively fuses LiDAR and image inputs by leveraging the complementary strengths of sparse 3D point clouds and dense 2D pixel grids. Specifically, an adaptive multimodal data bridge is designed that preprocesses LiDAR and image data using customized multi-scale pooling. A scenario-adaptive dispatcher then routes these fused features to specialized experts based on the object clarity and distance. Three collaborative expert models with complementary encoder-decoder architectures are designed and trained using a novel hierarchical multimodal loss and balanced sampling strategies. Then, in the inference stage, the EMC2 incorporates hardware-software co-optimization, spanning CPU thread allocation, GPU memory management, and computational graph optimization, to collaboratively enable efficient deployment on edge computing devices. Extensive evaluations conducted on open-source datasets demonstrate EMC2's superior performance, achieving an average accuracy improvement of 3.58% and an impressive 159.06% inference speedup compared to 15 leading methods on Jetson platforms. Such enhancements clearly meet the real-time operational expectations for autonomous vehicles, directly …
Poster
Jianing Zhang · Jiayi Zhu · Feiyu Ji · Xiaokang Yang · Xiaoyun Yuan

[ Exhibit Hall I ]

Abstract
Metalenses offer significant potential for ultra-compact computational imaging but face challenges from complex optical degradation and computational restoration difficulties. Existing methods typically rely on precise optical calibration or massive paired datasets, which are non-trivial for real-world imaging systems. Furthermore, lack of control over the inference process often results in undesirable hallucinated artifacts. We introduce Degradation-Modeled Multipath Diffusion for tunable metalens photography, leveraging powerful natural image priors from pretrained models instead of large datasets. Our framework uses positive, neutral, and negative-prompt paths to balance high-frequency detail generation, structural fidelity, and suppression of metalens-specific degradation, alongside pseudo data augmentation. A tunable decoder enables controlled trade-offs between fidelity and perceptual quality. Additionally, a spatially varying degradation-aware attention (SVDA) module adaptively models complex optical and sensor-induced degradation. Finally, we design and build a millimeter-scale MetaCamera for real-world validation. Extensive results show that our approach outperforms state-of-the-art methods, achieving high-fidelity and sharp image reconstruction. More materials: https://dmdiff.github.io/.
Poster
Baijun Ye · Minghui Qin · Saining Zhang · Moonjun Gong · Shaoting Zhu · Hao Zhao · Hang Zhao

[ Exhibit Hall I ]

Abstract
Occupancy is crucial for autonomous driving, providing essential geometric priors for perception and planning. However, existing methods predominantly rely on LiDAR-based occupancy annotations, which limits scalability and prevents leveraging vast amounts of potential crowdsourced data for auto-labeling. To address this, we propose GS-Occ3D, a scalable vision-only framework that directly reconstructs occupancy. Vision-only occupancy reconstruction poses significant challenges due to sparse viewpoints, dynamic scene elements, severe occlusions, and long-horizon motion. Existing vision-based methods primarily rely on mesh representations, which suffer from incomplete geometry and additional post-processing, limiting scalability. To overcome these issues, GS-Occ3D optimizes an explicit occupancy representation using an Octree-based Gaussian Surfel formulation, ensuring efficiency and scalability. Additionally, we decompose scenes into static background, ground, and dynamic objects, enabling tailored modeling strategies: (1) Ground is explicitly reconstructed as a dominant structural element, significantly improving large-area consistency; (2) Dynamic vehicles are separately modeled to better capture motion-related occupancy patterns. Extensive experiments on the Waymo dataset demonstrate that GS-Occ3D achieves state-of-the-art geometry reconstruction results. We successfully curate vision-only binary occupancy ground truth across diverse urban scenes and validate its effectiveness for downstream occupancy models on the Occ3D-Waymo dataset. Our results highlight the potential of large-scale vision-based occupancy reconstruction as a new paradigm …
Poster
Wuyang Li · Wentao Pan · Xiaoyuan Liu · Zhendong Luo · Chenxin Li · Hengyu Liu · Din Tsai · Mu Chen · Yixuan Yuan

[ Exhibit Hall I ]

Abstract
Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we deploy a gradient-guided distillation to adaptively transfer knowledge from the foundational model. Extensive experiments demonstrate that our method surpasses state-of-the-art methods in metalens segmentation and restoration by …
Poster
Changxing Liu · Genjia Liu · Zijun Wang · Jinchang Yang · Siheng Chen

[ Exhibit Hall I ]

Abstract
Vehicle-to-vehicle (V2V) cooperative autonomous driving holds great promise for improving safety by addressing the perception and prediction uncertainties inherent in single-agent systems. However, traditional cooperative methods are constrained by rigid collaboration protocols and limited generalization to unseen interactive scenarios. While LLM-based approaches offer generalized reasoning capabilities, their challenges in spatial planning and unstable inference latency hinder their direct application in cooperative driving. To address these limitations, we propose CoLMDriver, the first full-pipeline LLM-based cooperative driving system, enabling effective language-based negotiation and real-time driving control. CoLMDriver features a parallel driving pipeline with two key components: (i) an LLM-based negotiation module under an actor-critic paradigm, which continuously refines cooperation policies through feedback from previous decisions of all vehicles; and (ii) an intention-guided waypoint generator, which translates negotiation outcomes into executable waypoints. Additionally, we introduce InterDrive, a CARLA-based simulation benchmark comprising 10 challenging interactive driving scenarios for evaluating V2V cooperation. Experimental results demonstrate that CoLMDriver significantly outperforms existing approaches, achieving an 11% higher success rate across diverse highly interactive V2V driving scenarios. The code will be released.
Poster
Aoxiang Fan · Corentin Dumery · Nicolas Talabot · Pascal Fua

[ Exhibit Hall I ]

Abstract
Neural Radiance Fields (NeRF) has emerged as a compelling framework for scene representation and 3D recovery. To improve its performance on real-world data, depth regularizations have proven to be the most effective ones. However, depth estimation models not only require expensive 3D supervision in training, but also suffer from generalization issues. As a result, the depth estimations can be erroneous in practice, especially for outdoor unbounded scenes. In this paper, we propose to employ view-consistent distributions instead of fixed depth value estimations to regularize NeRF training. Specifically, the distribution is computed by utilizing both low-level color features and high-level distilled features from foundation models at the projected 2D pixel-locations from per-ray sampled 3D points. By sampling from the view-consistency distributions, an implicit regularization is imposed on the training of NeRF. We also propose a novel depth-pushing loss that works in conjunction with the sampling technique to jointly provide effective regularizations for eliminating the failure modes. Extensive experiments conducted on various scenes from public datasets demonstrate that our proposed method can generate significantly better novel view synthesis results than state-of-the-art NeRF variants as well as different depth regularization methods.
Poster
Ruangrawee Kitichotkul · Shashwath Bharadwaj · Joshua Rapp · Yanting Ma · Alexander Mehta · Vivek Goyal

[ Exhibit Hall I ]

Abstract
Conventional wisdom suggests that single-photon lidar (SPL) should operate in low-light conditions to minimize dead-time effects.Many methods have been developed to mitigate these effects in synchronous SPL systems. However, solutions for free-running SPL remain limited despite the advantage of reduced histogram distortion from dead times.To improve the accuracy of free-running SPL, we propose a computationally efficient joint maximum likelihood estimator of the signal flux, the background flux, and the depth, along with a complementary regularization framework that incorporates a learned point cloud score model as a prior.Simulations and experiments demonstrate that free-running SPL yields lower estimation errors than its synchronous counterpart under identical conditions, with our regularization further improving accuracy.
Poster
Shuofeng Sun · Haibin Yan

[ Exhibit Hall I ]

Abstract
Farthest Point Sampling (FPS) is widely used in existing point-based models because it effectively preserves structural integrity during downsampling. However, it incurs significant computational overhead, severely impacting the model's inference efficiency. Random sampling or grid sampling is considered \textbf{faster downsampling methods}; however, these fast downsampling methods may lead to the loss of geometric information during the downsampling process due to their overly simplistic and fixed rules, which can negatively affect model performance. To address this issue, we propose FastAdapter, which aggregates local contextual information through a small number of anchor points and facilitates interactions across spatial and layer dimensions, ultimately feeding this information back into the downsampled point cloud to mitigate the information degradation caused by fast downsampling methods. In addition to using FastAdapter to enhance model performance in methods that already employ fast downsampling, we aim to explore a more challenging yet valuable application scenario. Specifically, we focus on pre-trained models that utilize FPS, embedding FastAdapter and replacing FPS with random sampling for lightweight fine-tuning. This approach aims to significantly improve inference speed while maintaining relatively unchanged performance. Experimental results on ScanNet, S3DIS, and SemanticKITTI demonstrate that our method effectively mitigates the geometric information degradation issues caused by fast …
Poster
Xiangbin Wei · Yuanfeng Wang · Ao XU · Lingyu Zhu · Dongyong Sun · Keren Li · Yang Li · Qi Qin

[ Exhibit Hall I ]

Abstract
Building on recent advances in Bayesian statistics and image denoising, we propose Noise2Score3D, a fully unsupervised framework for point cloud denoising. Noise2Score3D learns the score function of the underlying point cloud distribution directly from noisy data, eliminating the need for clean data during training. Using Tweedie's formula, our method performs denoising in a single step, avoiding the iterative processes used in existing unsupervised methods, thus improving both accuracy and efficiency. Additionally, we introduce Total Variation for Point Clouds as a denoising quality metric, which allows for the estimation of unknown noise parameters. Experimental results demonstrate that Noise2Score3D achieves state-of-the-art performance on standard benchmarks among unsupervised learning methods in Chamfer distance and point-to-mesh metrics. Noise2Score3D also demonstrates strong generalization ability beyond training datasets. Our method, by addressing the generalization issue and challenge of the absence of clean data in learning-based methods, paves the way for learning-based point cloud denoising methods in real-world applications.
Poster
Jun Yin · Pengyu Zeng · Licheng Shen · Miao Zhang · Jing Zhong · Yuxing Han · Shuai Lu

[ Exhibit Hall I ]

Abstract
Image-based 3D reconstruction has made significant progress in typical scenarios, achieving high fidelity in capturing intricate textures. However, in the Architecture, Engineering, and Construction (AEC) design stages, existing technologies still face considerable challenges, particularly in handling specific window-to-wall ratios, ensuring window detail consistency, and enabling interactive editing. To address this research gap and encourage greater community attention on this practical architectural design problem, we propose a new task: Editable and Consistent Single-View 3D Reconstruction of Buildings with Specific Window-to-Wall Ratios. To accomplish this: 1) We introduce the ArchiSet dataset, the first public, real-world architectural design dataset, including 13,728 3D building forms in the format of point clouds, voxels, meshes, and window-to-wall ratio information, providing comprehensive support for 3D architectural design research. The dataset also contains over 1,482,624 images in three types—sketches, color block diagrams, and renderings—accompanied by paired window masks for detailed evaluation. 2) We evaluated state-of-the-art single-view 3D reconstruction algorithms on ArchiSet, identifying several limitations, such as the loss of volumetric detail, incomplete window details, and limited editability. 3) We introduce BuildingMesh, a diffusion model specifically designed for generating and editing 3D architectural forms from a single image with customizable window-to-wall ratios, suitable for dynamic architectural design workflows. We …
Poster
Radu Beche · Sergiu Nedevschi

[ Exhibit Hall I ]

Abstract
The development of aerial holistic scene understanding algorithms is hindered by the scarcity of comprehensive datasets that enable both semantic and geometric reconstruction. While synthetic datasets offer an alternative, existing options exhibit task-specific limitations, unrealistic scene compositions, and rendering artifacts that compromise real-world applicability. We introduce ClaraVid, a synthetic aerial dataset specifically designed to overcome these limitations. Comprising 16,917 high-resolution images captured at 4032×3024 from multiple viewpoints across diverse landscapes, ClaraVid provides dense depth maps, panoptic segmentation, sparse point clouds, and dynamic object masks, while mitigating common rendering artifacts. To further advance neural reconstruction, we introduce the Delentropic Scene Profile (DSP), a novel complexity metric derived from differential entropy analysis, designed to quantitatively assess scene difficulty and inform reconstruction tasks. Utilizing DSP, we systematically benchmark neural reconstruction methods, uncovering a consistent, measurable correlation between scene complexity and reconstruction accuracy. Empirical results indicate that higher delentropy strongly correlates with increased reconstruction errors, validating DSP as a reliable complexity prior. ClaraVid will be publicly released to support UAV research.
Poster
Francesco Milano · Manuel Lopez-Antequera · Naina Dhingra · Roland Siegwart · Robert Thiel

[ Exhibit Hall I ]

Abstract
Recovering a 3D surface from its surface normal map, a problem known as normal integration, is a key component for photometric shape reconstruction techniques such as shape-from-shading and photometric stereo. The vast majority of existing approaches for normal integration handle only implicitly the presence of depth discontinuities and are limited to orthographic or ideal pinhole cameras. In this paper, we propose a novel formulation that allows modeling discontinuities explicitly and handling generic central cameras. Our key idea is based on a local planarity assumption, that we model through constraints between surface normals and ray directions. Compared to existing methods, our approach more accurately approximates the relation between depth and surface normals, achieves state-of-the-art results on the standard normal integration benchmark, and is the first to directly handle generic central camera models.
Poster
Chong Cheng · Sicheng Yu · Zijian Wang · Yifan Zhou · Hao Wang

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has become a popular solution in SLAM due to its high-fidelity and real-time novel view synthesis performance. However, some previous 3DGS SLAM methods employ a differentiable rendering pipeline for tracking, lack geometric priors in outdoor scenes. Other approaches introduce separate tracking modules, but they accumulate errors with significant camera movement, leading to scale drift. To address these challenges, we propose a robust RGB-only outdoor 3DGS SLAM method: S3PO-GS. Technically, we establish a self-consistent tracking module anchored in the 3DGS pointmap, which avoids cumulative scale drift and achieves more precise and robust tracking with fewer iterations. Additionally, we design a patch-based pointmap dynamic mapping module, which introduces geometric priors while avoiding scale ambiguity. This significantly enhances tracking accuracy and the quality of scene reconstruction, making it particularly suitable for complex outdoor environments. Our experiments on the Waymo, KITTI, and DL3DV datasets demonstrate that S3PO-GS achieves state-of-the-art results in novel view synthesis and outperforms other 3DGS SLAM methods in tracking accuracy.
Poster
Yiyu Li · Haoyuan Wang · Ke Xu · Gerhard Hancke · Rynson W.H. Lau

[ Exhibit Hall I ]

Abstract
This paper presents SeHDR, a novel high dynamic range 3D Gaussian Splatting (HDR-3DGS) approach for generating HDR novel views given multi-view LDR images. Unlike existing methods that typically require the multi-view LDR input images to be captured from different exposures, which are tedious to capture and more likely to suffer from errors (e.g., object motion blurs and calibration/alignment inaccuracies), our approach learns the HDR scene representation from multi-view LDR images of a single exposure. Our key insight to this ill-posed problem is that by first estimating **Bracketed 3D Gaussians** (i.e., with different exposures) from single-exposure multi-view LDR images, we may then be able to merge these bracketed 3D Gaussians into an HDR scene representation. Specifically, SeHDR first learns base 3D Gaussians from single-exposure LDR inputs, where the spherical harmonics parameterize colors in a linear color space. We then estimate multiple 3D Gaussians with identical geometry but varying linear colors conditioned on exposure manipulations. Finally, we propose the Differentiable Neural Exposure Fusion (NeEF) to integrate the base and estimated 3D Gaussians into HDR Gaussians for novel view rendering. Extensive experiments demonstrate that SeHDR outperforms existing methods as well as carefully designed baselines.
Poster
Wang Liu · Wei Gao

[ Exhibit Hall I ]

Abstract
Information quantization has been widely adopted in multimedia content, such as images, videos, and point clouds. The goal of information quantization is to achieve efficient storage and transmission by reducing data precision or redundancy. However, the information distortion caused by quantization will lead to the degradation of signal fidelity and the performance of downstream tasks. This paper focuses on the geometry quantization distortion of point clouds and proposes a unified learning-based quality enhancement framework for omni-scene point clouds. Based on the characteristics of geometry quantization distortion, we analyze and find that existing upsampling methods are not competitive in dealing with point reduction and geometry displacement caused by coordinate quantization. Therefore, we design a general rooting-growing-pruning paradigm to efficiently perceive the geometry feature of quantized point clouds and improve the quality significantly. In addition, a novel loss constraint term related to the quantization step parameter is proposed to further improve quality and accelerate model convergence. To the best of our knowledge, this is the first unified quality enhancement framework for object and scene point clouds with coordinate quantization. Extensive experiments verify the superiority of the proposed method on multi-scale point clouds with different levels of quantization distortion, including object (ModelNet40, 8iVFB) …
Poster
Reza Rezaeian · Moein Heidari · Reza Azad · Dorit Merhof · Hamid Soltanian-Zadeh · Ilker Hacihaliloglu

[ Exhibit Hall I ]

Abstract
Implicit Neural Representation (INR), leveraging a neural network to transform coordinate input into corresponding attributes, has recently driven significant advances in several vision-related domains. However, the performance of INR is heavily influenced by the choice of the nonlinear activation function used in its multilayer perceptron (MLP) architecture. To date, multiple nonlinearities have been investigated, but current INRs still face limitations in capturing high-frequency components and diverse signal types. We show that these challenges can be alleviated by introducing a novel approach in INR architecture. Specifically, we propose SL$^{2}$A-INR, a hybrid network that combines a single-layer learnable activation function with an MLP that uses traditional ReLU activations. Our method performs superior across diverse tasks, including image representation, 3D shape reconstruction, and novel view synthesis. Through comprehensive experiments, SL$^{2}$A-INR sets new benchmarks in accuracy, quality, and robustness for INR.
Poster
Jialong Wu · Marco Braun · Dominic Spata · Matthias Rottmann

[ Exhibit Hall I ]

Abstract
Scene flow provides crucial motion information for autonomous driving. Recent LiDAR scene flow models utilize the rigid-motion assumption at the instance level, assuming objects are rigid bodies. However, these instance-level methods are not suitable for sparse radar point clouds. In this work, we present a novel **T**raffic-**A**ware **R**adar **S**cene flow estimation method, named TARS, which utilizes the motion rigidity at the traffic level. To address the challenges in radar scene flow, we perform object detection and scene flow jointly and boost the latter. We incorporate the feature map from the object detector, trained with detection losses, to make radar scene flow aware of the environment and road users. Therefrom, we construct a Traffic Vector Field (TVF) in the feature space, enabling a holistic traffic-level scene understanding in our scene flow branch. When estimating the scene flow, we consider both point-level motion cues from point neighbors and traffic-level consistency of rigid motion within the space. TARS outperforms the state of the art on a proprietary dataset and the View-of-Delft dataset, improving the benchmarks by 23% and 15%, respectively.
Poster
Yuval Haitman · Oded Bialer

[ Exhibit Hall I ]

Abstract
Radar-based object detection is essential for autonomous driving due to radar's long detection range. However, the sparsity of radar point clouds, especially at long range, poses challenges for accurate detection. Existing methods increase point density through temporal aggregation with ego-motion compensation, but this approach introduces scatter from dynamic objects, degrading detection performance. We propose DoppDrive, a novel Doppler-Driven temporal aggregation method that enhances radar point cloud density while minimizing scatter. Points from previous frames are shifted radially according to their dynamic Doppler component to eliminate radial scatter, with each point assigned a unique aggregation duration based on its Doppler and angle to minimize tangential scatter. DoppDrive is a point cloud density enhancement step applied before detection, compatible with any detector, and we demonstrate that it significantly improves object detection performance across various detectors and datasets.
Poster
Wenbin Teng · Gonglin Chen · Haiwei Chen · Yajie Zhao

[ Exhibit Hall I ]

Abstract
Recent progress in 3D reconstruction has enabled realistic 3D models from dense image captures, yet challenges persist with sparse views, often leading to artifacts in unseen areas. Recent works leverage Video Diffusion Models (VDMs) to generate dense observations, filling the gaps when only sparse views are available for 3D reconstruction tasks. A significant limitation of these methods is their slow sampling speed when using VDMs. In this paper, we present FVGen, a novel framework that addresses this challenge by enabling fast novel view synthesis using VDMs in as few as 4 sampling steps. We propose a novel video diffusion model distillation method that distills a multi-step denoising teacher model into a few-step denoising student model using Generative Adversarial Networks (GANs) and softened reverse KL-divergence minimization. Extensive experiments on real-world datasets show that, compared to prior works, our framework generates the same number of novel views with similar (or even better) visual quality while reducing sampling time by more than 90\%. FVGen significantly improves time efficiency for downstream reconstruction tasks, particularly when working with sparse input views (more than 2) where pre-trained VDMs need to be run multiple times to achieve better spatial coverage. Our code will be released upon acceptance …
Poster
Fengchen He · Dayang Zhao · Hao Xu · Tingwei Quan · Shaoqun zeng

[ Exhibit Hall I ]

Abstract
Many studies utilize dual-pixel (DP) sensor phase characteristicsfor various applications, such as depth estimation and deblurring.However, since the DP image features are entirely determined by the camera hardware, DP-depth paired datasets are very scarce, especially when performing depth estimation on customized cameras.To overcome this, studies simulate DP images using ideal optical system models.However, these simulations often violate real optical propagation laws,leading to poor generalization to real DP data.To address this, we investigate the domain gap between simulated and real DP data, and propose solutions using the Simulating DP images from ray tracing (Sdirt) scheme.The Sdirt scheme generates realistic DP images via ray tracingand integrates them into the depth estimation training pipeline.Experimental results show that models trained with Sdirt-simulated imagesgeneralize better to real DP data.The code and simulated datasets will be available on GitHub.
Poster
Chamin Hewa Koneputugodage · Dylan Campbell · Stephen Gould

[ Exhibit Hall I ]

Abstract
Recent methods for point cloud surface normal estimation predominantly use the generalized winding number field induced by the normals. Optimizing the field towards satisfying desired properties, such as the input points being on the surface defined by the field, provides a principled way to obtain globally consistent surface normals. However, we show that the existing winding number formulation for point clouds is a poor approximation near the input surface points, diverging as the query point approaches a surface point. This is problematic for methods that rely on the accuracy and stability of this approximation, requiring heuristics to compensate. Instead, we derive a more accurate approximation that is properly bounded and converges to the correct value. We then examine two distinct approaches that optimize for globally consistent normals using point cloud winding numbers. We show how the original unbounded formulation influences key design choices in both methods and demonstrate that substituting our formulation yields substantive improvements with respect to normal estimation and surface reconstruction accuracy.
Poster
Karlo Koledic · Luka Petrovic · Ivan Marković · Ivan Petrovic

[ Exhibit Hall I ]

Abstract
Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature, while the entanglement between camera parameters and depth amplifies issues further, hindering multi-dataset training and zero-shot accuracy. This challenge is particularly evident in autonomous vehicles and mobile robotics, where data is collected with fixed camera setups, limiting the geometric diversity. Yet, this context also presents an opportunity: the fixed relationship between the camera and the ground plane imposes additional perspective geometry constraints, enabling depth regression via vertical image positions of objects. However, this cue is highly susceptible to overfitting, thus we propose a novel canonical representation that maintains consistency across varied camera setups, effectively disentangling depth from specific parameters and enhancing generalization across datasets. We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues. A comprehensive evaluation demonstrates the effectiveness of the proposed approach on five autonomous driving datasets, achieving accurate metric depth estimation for varying resolutions, aspect ratios and camera setups. Notably, we achieve comparable accuracy to existing zero-shot methods, despite training on a single dataset with a single-camera setup.
Poster
Minghao Wen · Shengjie Wu · Kangkan Wang · Dong Liang

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting based 3Deditinghas demonstrated impressive performance in recent years.However, the multi-view editing often exhibits significant local inconsistency, especially in areas of non-rigid deformation,which lead to local artifacts, texture blurring, or semantic variations in edited 3D scenes.We also found that the existing editing methods, which rely entirely on text prompts make the editing process a "one-shot deal", making it difficult for users to control the editing degree flexibly.In response to these challenges, we present InterGSEdit, a novel framework for high-quality 3DGS editing via interactively selecting key views with users' preferences.We propose a CLIP-based Semantic Consistency Selection (CSCS) strategy to adaptively screen a group of semantically consistent reference views for each user-selected key view.Then, the cross-attention maps derived from the reference views are used in a weighted Gaussian Splatting unprojectionto construct the 3D Geometry-Consistent Attention Prior ($GAP^{3D}$).We project $GAP^{3D}$ to obtain 3D-constrained attention, which are fused with 2D cross-attention via Attention Fusion Network (AFN). AFN employs an adaptive attention strategy that prioritizes 3D-constrained attention for geometric consistency during early inference, and gradually prioritizes 2D cross-attention maps in diffusion for fine-grained features during the later inference.Extensive experiments demonstrate that InterGSEdit achieves state-of-the-art performance, delivering consistent, high-fidelity 3DGS editing with improved user …
Poster
Yexin Huang · Yongbin Lin · Lishengsa Yue · Zhihong Yao · Jie Wang

[ Exhibit Hall I ]

Abstract
Human-machine interaction technology requires not only the distribution of human visual attention but also the prediction of the gaze point trajectory. We introduce $\textbf{PILOT}$, a programmatic imitation learning approach that predicts a driver’s eye movements based on a set of rule-based conditions. These conditions—derived from driving operations and traffic flow characteristics—define how gaze shifts occur. They are initially identified through incremental synthesis, a heuristic search method, and then refined via L-BFGS, a numerical optimization technique. These human-readable rules enable us to understand drivers’ eye movement patterns and make efficient and explainable predictions. We also propose $\textbf{DATAD}$, a dataset that covers 12 types of autonomous driving takeover scenarios, collected from 60 participants and comprising approximately 600,000 frames of gaze point data. Compared to existing eye-tracking datasets, DATAD includes additional driving metrics and surrounding traffic flow characteristics, providing richer contextual information for modeling gaze behavior. Experimental evaluations of PILOT on DATAD demonstrate superior accuracy and faster prediction speeds compared to four baseline models. Specifically, PILOT reduces the MSE of predicted trajectories by 39.91\% to 88.02\% and improves the accuracy of gaze object predictions by 13.99\% to 55.06\%. Moreover, PILOT achieves these gains with approximately 30\% lower prediction time, offering both more accurate …
Poster
Yiyang Chen · Shanshan Zhao · Lunhao Duan · Changxing Ding · Dacheng Tao

[ Exhibit Hall I ]

Abstract
Diffusion-based models, widely used in text-to-image generation, have proven effective in 2D representation learning. Recently, this framework has been extended to 3D self-supervised learning by constructing a conditional point generator for enhancing 3D representations. However, its performance remains constrained by the 3D diffusion model, which is trained on the available 3D datasets with limited size. We hypothesize that the robust capabilities of text-to-image diffusion models, particularly Stable Diffusion (SD), which is trained on large-scale datasets, can help overcome these limitations. To investigate this hypothesis, we propose PointSD, a framework that leverages the SD model for 3D self-supervised learning. By replacing the SD model's text encoder with a 3D encoder, we train a point-to-image diffusion model that allows point clouds to guide the denoising of rendered noisy images. With the trained point-to-image diffusion model, we use noise-free images as the input and point clouds as the condition to extract SD features. Next, we train a 3D backbone by aligning its features with these SD features, thereby facilitating direct semantic learning. Comprehensive experiments on downstream point cloud tasks and ablation studies demonstrate that the SD model can enhance point cloud self-supervised learning. Our code will be publicly available.
Poster
Kota Shimomura · Masaki Nambata · Atsuya Ishikawa · Ryota Mimura · Takayuki Kawabuchi · Takayoshi Yamashita · Koki Inoue

[ Exhibit Hall I ]

Abstract
Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Since existing road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of …
Poster
Eunjin Son · HyungGi Jo · Wookyong Kwon · Sang Jun Lee

[ Exhibit Hall I ]

Abstract
Omnidirectional stereo matching (OSM) estimates $360^\circ$ depth by performing stereo matching on multi-view fisheye images. Existing methods assume a unimodal depth distribution, matching each pixel to a single object. However, this assumption constrains the sampling range, causing over-smoothed depth artifacts, especially at object boundaries. To address these limitations, we propose MDP-Omni, a novel OSM network that leverages parameter-free multimodal depth priors. Specifically, we introduce a depth prior-based sampling method, which adjusts the sampling range without additional parameters. Furthermore, we present the azimuth-based multi-view volume fusion module to build a single cost volume. It mitigates false matches caused by occlusions in warped multi-view volumes. Experimental results demonstrate that MDP-Omni significantly improves existing methods, particularly in capturing fine details.
Poster
Rui Yu · Xianghang Zhang · Runkai Zhao · Huaicheng Yan · Meng Wang

[ Exhibit Hall I ]

Abstract
End-to-end autonomous driving has been recently seen rapid development, exerting a profound influence on both industry and academia. However, the existing work places excessive focus on ego-vehicle status as their sole learning objectives and lacks of planning-oriented understanding, which limits the robustness of the overall decision-making prcocess. In this work, we introduce DistillDrive, an end-to-end knowledge distillation-based autonomous driving model that leverages diversified instance imitation to enhance multi-mode motion feature learning. Specifically, we employ a planning model based on structured scene representations as the teacher model, leveraging its diversified planning instances as multi-objective learning targets for the end-to-end model. Moreover, we incorporate reinforcement learning to enhance the optimization of state-to-decision mappings, while utilizing generative modeling to construct planning-oriented instances, fostering intricate interactions within the latent space. We validate our model on the nuScenes and NAVSIM datasets, achieving a 50\% reduction in collision rate and a 3-point improvement in closed-loop performance compared to the baseline model.
Poster
Xi Li · Tong Rao · Cihui Pan

[ Exhibit Hall I ]

Abstract
Recent feature matching methods have achieved remarkable performance but lack efficiency consideration. In this paper, we revisit the mainstream detector-free matching pipeline and improve all its stages considering both accuracy and efficiency. We propose an Efficient Deep feature Matching network, EDM. We first adopt a deeper CNN with fewer dimensions to extract multi-level features. Then we present a Correlation Injection Module that conducts feature transformation on high-level deep features, and progressively injects feature correlations from global to local for efficient multi-scale feature aggregation, improving both speed and performance. In the refinement stage, a novel lightweight bidirectional axis-based regression head is designed to directly predict subpixel-level correspondences from latent features, avoiding the significant computational cost of explicitly locating keypoints on high-resolution local feature heatmaps. Moreover, effective selection strategies are introduced to enhance matching accuracy. Extensive experiments show that our EDM achieves competitive matching accuracy on various benchmarks and exhibits excellent efficiency, offering valuable best practices for real-world applications. The code will be made publicly available.
Poster
Zhensheng Yuan · Haozhi Huang · Zhen Xiong · Di Wang · Guanghua Yang

[ Exhibit Hall I ]

Abstract
We present a resource-efficient framework that enables fast reconstruction and real-time rendering of urban-level scenarios while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize resource utilization. A controllable level-of-detail (LOD) strategy regulate the Gaussian density during training and rendering to balance quality, memory efficiency, and performance. The appearance transformation module mitigates inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and anti-aliasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality.
Poster
Elisabetta Fedele · Boyang Sun · Francis Engelmann · Marc Pollefeys · Leonidas Guibas

[ Exhibit Hall I ]

Abstract
We present SuperDec, an approach for compact 3D scene representations based on geometric primitives, namely superquadrics.While most recent works leverage geometric primitives to obtain photorealistic 3D scene representations, we propose to leverage them to obtain a compact yet expressive representation. We propose to solve the problem locally on individual objects and leverage the capabilities of instance segmentation methods to scale our solution to full 3D scenes. In doing that, we design a new architecture which efficiently decompose point clouds of arbitrary objects in a compact set of superquadrics. We train our architecture on ShapeNet and we prove its generalization capabilities on object instances extracted from the ScanNet++ dataset as well as on full Replica scenes. Finally, we show how a compact representation based on superquadrics can be useful for a diverse range of downstream applications, including robotic tasks and controllable visual content generation and editing.
Poster
Dengke Zhang · Fagui Liu · Quan Tang

[ Exhibit Hall I ]

Abstract
Open-vocabulary semantic segmentation aims to assign semantic labels to each pixel without being constrained by a predefined set of categories. While Contrastive Language-Image Pre-training (CLIP) excels in zero-shot classification, it struggles to align image patches with category embeddings because of its incoherent patch correlations. This study reveals that inter-class correlations are the main reason for impairing CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations. Specifically, CorrCLIP leverages the Segment Anything Model (SAM) to define the scope of patch interactions, reducing inter-class correlations. To mitigate the problem that SAM-generated masks may contain patches belonging to different classes, CorrCLIP incorporates self-supervised models to compute coherent similarity values, suppressing the weight of inter-class correlations. Additionally, we introduce two additional branches to strengthen patch features’ spatial details and semantic representation. Finally, we update segmentation maps with SAM-generated masks to improve spatial consistency. Based on the improvement across patch correlations, feature representations, and segmentation maps, CorrCLIP achieves superior performance across eight benchmarks.
Poster
Kang DU · Zhihao Liang · Yulin Shen · Zeyu Wang

[ Exhibit Hall I ]

Abstract
Gaussian Splatting (GS) has become an effective representation for photorealistic rendering, but the information about geometry, material, and lighting is entangled and requires illumination decomposition for editing.Current GS-based approaches face significant challenges in disentangling complex light-geometry-material interactions under non-Lambertian conditions, particularly when handling specular reflections and shadows.We present GS-ID, a novel end-to-end framework that achieves comprehensive illumination decomposition by integrating adaptive light aggregation with diffusion-based material priors.In addition to a learnable environment map that captures ambient illumination, we model complex local lighting conditions by adaptively aggregating a set of anisotropic and spatially-varying spherical Gaussian mixtures during optimization.To better model shadow effects, we associate a learnable unit vector with each splat to represent how multiple light sources cause the shadow, further enhancing lighting and material estimation.Together with intrinsic priors from diffusion models, GS-ID significantly reduces light-geometry-material ambiguity and achieves state-of-the-art illumination decomposition performance.Experiments also show that GS-ID effectively supports various downstream applications such as relighting and scene composition.
Poster
Shuangkang Fang · I-Chao Shen · Takeo Igarashi · Yufeng Wang · ZeSheng Wang · Yi Yang · Wenrui Ding · Shuchang Zhou

[ Exhibit Hall I ]

Abstract
We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.
Poster
Jiamin WU · Kenkun Liu · Xiaoke Jiang · Yuan Yao · Lei Zhang

[ Exhibit Hall I ]

Abstract
In this work, we introduce UniGS, a novel 3D Gaussian reconstruction and novel view synthesis model that predicts a high-fidelity representation of 3D Gaussians from arbitrary number of posed sparse-view images.Previous methods often regress 3D Gaussians locally on a per-pixel basis for each view and then transfer them to world space and merge them through point concatenation.In contrast, Our approach involves modeling unitary 3D Gaussians in world space and updating them layer by layer.To leverage information from multi-view inputs for updating the unitary 3D Gaussians, we develop a DETR (DEtection TRansformer)-like framework, which treats 3D Gaussians as queries and updates their parameters by performing multi-view cross-attention (MVDFA) across multiple input images, which are treated as keys and values.This approach effectively avoids 'ghosting' issue and allocates more 3D Gaussians to complex regions.Moreover, since the number of 3D Gaussians used as decoder queries is independent of the number of input views, our method allows arbitrary number of multi-view images as input without causing memory explosion or requiring retraining.Extensive experiments validate the advantages of our approach, showcasing superior performance over existing methods quantitatively (improving PSNR by 4.2 dB when trained on Objaverse and tested on the GSO benchmark) and qualitatively.
Poster
Hongyang Sun · Qinglin Yang · Jiawei Wang · Zhen Xu · Chen Liu · Yida Wang · Kun Zhan · Hujun Bao · Xiaowei Zhou · Sida Peng

[ Exhibit Hall I ]

Abstract
Recent advances in differentiable rendering have significantly improved dynamic street scene reconstruction. However, the complexity of large-scale scenarios and dynamic elements, such as vehicles and pedestrians, remains a substantial challenge. Existing methods often struggle to scale to large scenes or accurately model arbitrary dynamics. To address these limitations, we propose Hierarchy UGP, which constructs a hierarchical structure consisting of a root level, sub-scenes level, and primitive level, using Unified Gaussian Primitive (UGP) defined in 4D space as the representation. The root level serves as the entry point to the hierarchy. At the sub-scenes level, the scene is spatially divided into multiple sub-scenes, with various elements extracted. At the primitive level, each element is modeled with UGPs, and its global pose is controlled by a motion prior related to time. This hierarchical design greatly enhances the model's capacity, enabling it to model large-scale scenes. Additionally, our UGP allows for the reconstruction of both rigid and non-rigid dynamics. We conducted experiments on Dynamic City, our proprietary large-scale dynamic street scene dataset, as well as the public Waymo dataset. Experimental results demonstrate that our method achieves state-of-the-art performance. We plan to release the accompanying code and the Dynamic City dataset as open resources …
Poster
Ziyang Ren · Ping Wei · Shangqi Deng · Haowen Tang · Jiapeng Li · Huan Li

[ Exhibit Hall I ]

Abstract
Pedestrian trajectory prediction is crucial for many intelligent tasks. While existing methods predict future trajectories from fixed-frame historical observations, they are limited by the observational perspective and the need for extensive historical information, resulting in prediction delays and inflexible generalization in real-time systems. In this paper, we propose a novel task called Transferable Online Pedestrian Trajectory Prediction (TOTP), which synchronously predicts future trajectories with variable observations and enables effective task transfer under different observation constraints. To advance TOTP modeling, we propose a Temporal-Adaptive Mamba Latent Diffusion (TAMLD) model. It utilizes the Social-Implicit Mamba Synthesizer to extract motion states with social interaction and refine temporal representations through Temporal-Aware Distillation. A Trend-Conditional Mamba Decomposer generates the motion latent distribution of the future motion trends and predicts future motion trajectories through sampling decomposition. We utilize Motion-Latent Mamba Diffusion to reconstruct the latent space disturbed by imbalanced temporal noise. Our method achieves state-of-the-art results on multiple datasets and tasks, showcasing temporal adaptability and strong generalization.
Poster
Xuying Zhang · Yupeng Zhou · Kai Wang · Yikai Wang · Zhen Li · Daquan Zhou · Shaohui Jiao · Qibin Hou · Ming-Ming Cheng

[ Exhibit Hall I ]

Abstract
Multi-view synthesis serves as a fundamental component in creating high-quality 3D assets. We observe that the existing works represented by the Zero123 series typically struggle to maintain cross-view consistency, especially when handling views with significantly different camera poses. To overcome this challenge, we present AR-1-to-3, a novel paradigm to progressively generate the target views in an autoregressive manner. Rather than producing multiple discrete views of a 3D object from a single-view image and a set of camera poses or multiple views simultaneously under specified camera conditions, AR-1-to-3 starts from generating views closer to the input view, which is utilized as contextual information to prompt the generation of farther views. In addition, we propose two image conditioningstrategies, termed as Stacked-LE and LSTM-GE, to encode previously generated sequence views and provide pixel-wise spatial guidance and high-level semantic information for the generation of current target views. Extensive experiments on several publicly available 3D datasets show that our method can synthesize more consistent 3D views and produce high-quality 3D assets that closely mirror the givenimage. Code and pre-trained weights will be open-sourced.
Poster
Fabian Perez · Sara Rojas Martinez · Carlos Hinojosa · Hoover Rueda-Chacón · Bernard Ghanem

[ Exhibit Hall I ]

Abstract
Neural Radiance Field (NeRF)-based segmentation methods focus on object semantics and rely solely on RGB data, lacking intrinsic material properties. This limitation restricts accurate material perception, which is crucial for robotics, augmented reality, simulation, and other applications. We introduce UnMix-NeRF, a framework that integrates spectral unmixing into NeRF, enabling joint hyperspectral novel view synthesis and unsupervised material segmentation. Our method models spectral reflectance via diffuse and specular components, where a learned dictionary of global endmembers represents pure material signatures, and per-point abundances capture their distribution. For material segmentation, we use spectral signature predictions along learned endmembers, allowing unsupervised material clustering. Additionally, UnMix-NeRF enables scene editing by modifying learned endmember dictionaries for flexible material-based appearance manipulation. Extensive experiments validate our approach, demonstrating superior spectral reconstruction and material segmentation to existing methods. The associated data and code for reproduction will be made publicly available.
Poster
Zebin He · Mx Yang · Shuhui Yang · Yixuan Tang · Tao Wang · Kaihao Zhang · Guanying Chen · Lliu Yuhong · Jie Jiang · Chunchao Guo · Wenhan Luo

[ Exhibit Hall I ]

Abstract
Physically-based rendering (PBR) has become a cornerstone in modern computer graphics, enabling realistic material representation and lighting interactions in 3D scenes. In this paper, we present MaterialMVP, a novel end-to-end model for generating PBR textures from 3D meshes and image prompts, addressing key challenges in multi-view material synthesis. Our approach leverages Reference Attention to extract and encode informative latent from the input reference images, enabling intuitive and controllable texture generation. We also introduce a Consistency-Regularized Training strategy to enforce stability across varying viewpoints and illumination conditions, ensuring illumination-invariant and geometrically consistent results. Additionally, we propose Dual-Channel Material Generation, which separately optimizes albedo and metallic-roughness (MR) textures while maintaining precise spatial alignment with the input images through Multi-Channel Aligned Attention. Learnable material embeddings are further integrated to capture the distinct properties of albedo and MR. Experimental results demonstrate that our model generates PBR textures with realistic behavior across diverse lighting scenarios, outperforming existing methods in both consistency and quality for scalable 3D asset creation.
Poster
Jinhua Zhang · Hualian Sheng · Sijia Cai · Bing Deng · Qiao Liang · Wen Li · Ying Fu · Jieping Ye · Shuhang Gu

[ Exhibit Hall I ]

Abstract
Controllable generation is considered a potentially vital approach to address the challenge of annotating 3D data, and the precision of such controllable generation becomes particularly imperative in the context of data production for autonomous driving. Existing methods focus on the integration of diverse generative information into controlling inputs, utilizing frameworks such as GLIGEN or ControlNet, to produce commendable outcomes in controllable generation. However, such approaches intrinsically restrict generation performance to the learning capacities of predefined network architectures. In this paper, we explore the innovative integration of controlling information and introduce PerLDiff (\textbf{Per}spective-\textbf{L}ayout \textbf{Diff}usion Models), a novel method for effective street view image generation that fully leverages perspective 3D geometric information. Our PerLDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process, resulting in a more robust and controllable output. Moreover, it demonstrates superior controllability compared to alternative layout control methods. Empirical results justify that our PerLDiff markedly enhances the precision of controllable generation on the NuScenes and KITTI datasets.
Poster
Zhongpai Gao · Benjamin Planche · Meng Zheng · Anwesa Choudhuri · Terrence Chen · Ziyan Wu

[ Exhibit Hall I ]

Abstract
Real-time rendering of dynamic scenes with view-dependent effects remains a fundamental challenge in computer graphics. While recent advances in Gaussian Splatting have shown promising results separately handling dynamic scenes (4DGS) and view-dependent effects (6DGS), no existing method unifies these capabilities while maintaining real-time performance. We present 7D Gaussian Splatting (7DGS), a unified framework representing scene elements as seven-dimensional Gaussians spanning position (3D), time (1D), and viewing direction (3D). Our key contribution is an efficient conditional slicing mechanism that transforms 7D Gaussians into view- and time-conditioned 3D Gaussians, maintaining compatibility with existing 3D Gaussian Splatting pipelines while enabling joint optimization. Experiments demonstrate that 7DGS outperforms prior methods by up to 7.36 dB in PSNR while achieving real-time rendering (401 FPS) on challenging dynamic scenes with complex view-dependent effects.
Poster
Shakiba Kheradmand · Delio Vicini · George Kopanas · Dmitry Lagun · Kwang Moo Yi · Mark Matthews · Andrea Tagliasacchi

[ Exhibit Hall I ]

Abstract
3D Gaussian splatting (3DGS) is a popular radiance field method, with many application-specific extensions. Most variants rely on the same core algorithm: depth-sorting of Gaussian splats then rasterizing in primitive order. This ensures correct alpha compositing, but can cause rendering artifacts due to built-in approximations. Moreover, for a fixed representation, sorted rendering offers little control over render cost and visual fidelity. For example, and counter-intuitively, rendering a lower-resolution image is not necessarily faster. In this work, we address the above limitations by combining 3D Gaussian splatting with stochastic rasterization. Concretely, we leverage an unbiased Monte Carlo estimator of the volume rendering equation. This removes the need for sorting, and allows for accurate 3D blending of overlapping Gaussians. The number of Monte Carlo samples further imbues 3DGS with a way to trade off computation time and quality. We implement our method using OpenGL shaders, enabling efficient rendering on modern GPU hardware. At a reasonable visual quality, our method renders more than four times faster than sorted rasterization.
Poster
Guan Luo · Jianfeng Zhang

[ Exhibit Hall I ]

Abstract
High-quality textured mesh reconstruction from sparse-view images remains a fundamental challenge in computer graphics and computer vision. Traditional large reconstruction models operate in a single-scale manner, forcing the models to simultaneously capture global structure and local details, often resulting in compromised reconstructed shapes. In this work, we propose MS3D, a novel multi-scale 3D reconstruction framework. At its core, our method introduces a hierarchical structured latent representation for multi-scale modeling, coupled with a multi-scale feature extraction and integration mechanism. This enables progressive reconstruction, effectively decomposing the complex task of detailed geometry reconstruction into a sequence of easier steps. This coarse-to-fine approach effectively captures multi-frequency details, learns complex geometric patterns, and generalizes well across diverse objects while preserving fine-grained details. Extensive experiments demonstrate MS3D outperforms state-of-the-art methods and is broadly applicable to both image- and text-to-3D generation. The entire pipeline reconstructs high-quality textured meshes in under five seconds.
Poster
Jie Chen · Zhangchi Hu · Peixi Wu · Huyue Zhu · Hebei Li · Xiaoyan Sun

[ Exhibit Hall I ]

Abstract
Dynamic scene reconstruction is a long-term challenge in 3D vision. Existing plane-based methods in dynamic Gaussian splatting suffer from an unsuitable low-rank assumption, causing feature overlap and poor rendering quality. Although 4D hash encoding provides a promising explicit representation without low-rank constraints, directly applying it to the entire dynamic scene leads to substantial hash collisions and redundancy. To address these challenges, we present DASH, a real-time dynamic scene rendering framework that employs 4D hash encoding coupled with self-supervised decomposition. Our approach begins with a self-supervised decomposition mechanism that separates dynamic and static components without manual annotations or precomputed masks. Next, we introduce a multiresolution 4D hash encoder for dynamic elements, providing an explicit representation that avoids the low-rank assumption. Finally, we present a spatio-temporal smoothness regularization strategy to mitigate unstable deformation artifacts. Comprehensive experiments on real-world datasets demonstrate that DASH achieves state-of-the-art dynamic rendering performance, exhibiting significantly enhanced visual quality at real-time speeds of 264 FPS on a single 4090 GPU. The code will be made publicly available.
Poster
Yuqi Wu · Wenzhao Zheng · Sicheng Zuo · Yuanhui Huang · Jie Zhou · Jiwen Lu

[ Exhibit Hall I ]

Abstract
3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable cross-attention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through the local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task.Our …
Poster
Shaocheng Yan · Pengcheng Shi · Zhenjun Zhao · Kaixin Wang · Kuang Cao · Ji Wu · Jiayuan Li

[ Exhibit Hall I ]

Abstract
Robust estimation is essential in correspondence-based Point Cloud Registration (PCR). Existing methods using maximal clique search in compatibility graphs achieve high recall but suffer from exponential time complexity, limiting their use in time-sensitive applications. To address this challenge, we propose a fast and robust estimator, TurboReg, built upon a novel lightweight clique, TurboClique, and a highly parallelizable Pivot-Guided Search (PGS) algorithm. First, we define the TurboClique as a 3-clique within a highly-constrained compatibility graph. The lightweight nature of the 3-clique allows for efficient parallel searching, and the highly-constrained compatibility graph ensures robust spatial consistency for stable transformation estimation. Next, PGS selects matching pairs with high SC$^2$ scores as pivots, effectively guiding the search toward TurboCliques with higher inlier ratios. Moreover, the PGS algorithm has linear time complexity and is significantly more efficient than the maximal clique search with exponential time complexity. Extensive experiments show that TurboReg achieves state-of-the-art performance across multiple real-world datasets, with substantial speed improvements. For example, on the 3DMatch+FCGF dataset, TurboReg (1K) and TurboReg (0.5K) operate $208.22\times$ and $213.35\times$ faster than 3DMAC, respectively, while also enhancing recall. Our code is accessible at \href{https://anonymous.4open.science/r/TurboReg-FDB7/}{\texttt{TurboReg}}.
Poster
Benjin Zhu · Xiaogang Wang · Hongsheng Li

[ Exhibit Hall I ]

Abstract
Scene synthesis plays a crucial role in autonomous driving by addressing data scarcity and close-loop validation. Current approaches struggle to maintain temporal consistency in synthesized videos while preserving fine-grained details. We introduce ConsistentCity, a two-stage framework with a novel Semantic Flow-guided Diffusion Transformers (SF-DiT) that convert sequential BEV semantic maps into temporally consistent driving videos. Operating in a pretrained occupancy VQ-VAE latent space, our SF-DiT generates temporally consistent 3D occupancy, which provides guidance for controlled image and video diffusion for scene synthesis. To address the temporal consistency, SF-DiT enhances standard DiT blocks with temporal semantic modeling through two designs: (1) A Semantic Flow Estimation module capturing scene motions (flow, uncertainty, and classification) from sequential BEV semantic maps, and (2) A Semantic Flow-Modulated Cross-Attention module that dynamically adapts attention based on semantic flow patterns. This integration of semantic flow modeling in DiT enables consistent scene evolution understanding. Evaluations of image and video synthesis on nuScenes dataset demonstrate state-of-the-art performance with FID 8.3 and FVD 73.6, and superior temporal occupancy generation results on nuCraft and OpenOccupancy benchmarks.
Poster
Peixi Wu · Bosong Chai · Menghua Zheng · Wei Li · Zhangchi Hu · Jie Chen · Zheyu Zhang · Hebei Li · Xiaoyan Sun

[ Exhibit Hall I ]

Abstract
Bio-inspired Spiking Neural Networks (SNNs) provide an energy-efficient way to extract 3D spatio-temporal features. However, existing 3D SNNs have struggled with long-range dependencies until the recent emergence of Mamba, which offers superior computational efficiency and sequence modeling capability. In this work, we propose Spiking Point Mamba (SPM), the first Mamba-based SNN in the 3D domain.Due to the poor performance of simply transferring Mamba to 3D SNNs, SPM is designed to utilize both the sequence modeling capabilities of Mamba and the temporal feature extraction of SNNs. Specifically, we first introduce Hierarchical Dynamic Encoding (HDE), an improved direct encoding method that effectively introduces dynamic temporal mechanism, thereby facilitating temporal interactions. Then, we propose a Spiking Mamba Block (SMB), which builds upon Mamba while learning inter-time-step features and minimizing information loss caused by spikes. Finally, to further enhance model performance, we adopt an asymmetric SNN-ANN architecture for spike-based pre-training and finetune. Compared with the previous state-of-the-art SNN models, SPM improves OA by +6.2%, +6.1%, and +7.4% on three variants of ScanObjectNN, and boosts instance mIOU by +1.9% on ShapeNetPart. Meanwhile, its energy consumption is at least 3.5x lower than that of its ANN counterpart. The code will be made publicly available.
Poster
Yu Sheng · Jiajun Deng · Xinran Zhang · Yu Zhang · Bei Hua · Yanyong Zhang · Jianmin Ji

[ Exhibit Hall I ]

Abstract
A major breakthrough in 3D reconstruction is the feedforward paradigm to generate pixel-wise 3D points or Gaussian primitives from sparse, unposed images. To further incorporate semantics while avoiding the significant memory and storage costs of high-dimensional semantic features, existing methods extend this paradigm by associating each primitive with a compressed semantic feature vector.However, these methods have two major limitations: (a) the naively compressed feature compromises expressiveness, affecting the model's ability to capture fine-grained semantics, and (b) the pixel-wise primitive prediction introduces redundancy in overlapping areas, causing unnecessary memory overhead. To this end, we introduce SpatialSplat, a feedforward framework that produces redundancy-aware Gaussians and capitalizes on a dual-field semantic representation. Particularly, with the insight that primitives within the same instance exhibit high semantic consistency, we decompose the semantic representation into a coarse feature field that encodes uncompressed semantics with minimal primitives, and a fine-grained yet low-dimensional feature field that captures detailed inter-instance relationships. Moreover, we propose a selective Gaussian mechanism, which retains only essential Gaussians in the scene, effectively eliminating redundant primitives. Our proposed Spatialsplat learns accurate semantic information and detailed instances prior with more compact 3D Gaussians, making semantic 3D reconstruction more applicable. We conduct extensive experiments to evaluate our …
Poster
Jungho Lee · DongHyeong Kim · Dogyoon Lee · Suhwan Cho · Minhyeok Lee · Wonjoon Lee · Taeoh Kim · Dongyoon Wee · Sangyoun Lee

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has gained significant attention for their high-quality novel view rendering, motivating research to address real-world challenges. A critical issue is the camera motion blur caused by movement during exposure, which hinders accurate 3D scene reconstruction. In this study, we propose CoMoGaussian, a Continuous Motion-Aware Gaussian Splatting that reconstructs precise 3D scenes from motion-blurred images while maintaining real-time rendering speed. Considering the complex motion patterns inherent in real-world camera movements, we predict continuous camera trajectories using neural ordinary differential equations (ODEs). To ensure accurate modeling, we employ rigid body transformations, preserving the shape and size of the object but rely on the discrete integration of sampled frames. To better approximate the continuous nature of motion blur, we introduce a continuous motion refinement (CMR) transformation that refines rigid transformations by incorporating additional learnable parameters. By revisiting fundamental camera theory and leveraging advanced neural ODE techniques, we achieve precise modeling of continuous camera trajectories, leading to improved reconstruction accuracy. Extensive experiments demonstrate state-of-the-art performance both quantitatively and qualitatively on benchmark datasets, which include a wide range of motion blur scenarios, from moderate to extreme blur.
Poster
Hyeonjoong Jang · Dongyoung Choi · Donggun Kim · Woohyun Kang · Min H. Kim

[ Exhibit Hall I ]

Abstract
We propose a splat-based 3D scene reconstruction method from RGB-D input that effectively handles extreme motion blur, a frequent challenge in low-light environments. Under dim illumination, RGB frames often suffer from severe motion blur due to extended exposure times, causing traditional camera pose estimation methods, such as COLMAP, to fail. This results in inaccurate camera pose and blurry color input, compromising the quality of 3D reconstructions. Although recent 3D reconstruction techniques like Neural Radiance Fields and Gaussian Splatting have demonstrated impressive results, they rely on accurate camera trajectory estimation, which becomes challenging under fast motion or poor lighting conditions. Furthermore, rapid camera movement and the limited field of view of depth sensors reduce point cloud overlap, limiting the effectiveness of pose estimation with the ICP algorithm. To address these issues, we introduce a method that combines camera pose estimation and image deblurring using a Gaussian Splatting framework, leveraging both 3D Gaussian splats and depth inputs for enhanced scene representation. Our method first aligns consecutive RGB-D frames through optical flow and ICP, then refines camera poses and 3D geometry by adjusting Gaussian positions for optimal depth alignment. To handle motion blur, we model camera movement during exposure and deblur images by …
Poster
Du Chen · Liyi Chen · Zhengqiang ZHANG · Lei Zhang

[ Exhibit Hall I ]

Abstract
Implicit Neural Representation (INR) has been successfully employed for Arbitrary-scale Super-Resolution (ASR). However, INR-based models need to query the multi-layer perceptron module numerous times and render a pixel in each query, resulting in insufficient representation capability and computational efficiency. Recently, Gaussian Splatting (GS) has shown its advantages over INR in both visual quality and rendering speed in 3D tasks, which motivates us to explore whether GS can be employed for the ASR task. However, directly applying GS to ASR is exceptionally challenging because the original GS is an optimization-based method through overfitting each single scene, while in ASR we aim to learn a single model that can generalize to different images and scaling factors. We overcome these challenges by developing two novel techniques. Firstly, to generalize GS for ASR, we elaborately design an architecture to predict the corresponding image-conditioned Gaussians of the input low-resolution image in a feed-forward manner. Each Gaussian can fit the shape and direction of an area of complex textures, showing powerful representation capability. Secondly, we implement an efficient differentiable 2D GPU/CUDA-based scale-aware rasterization to render super-resolved images by sampling discrete RGB values from the predicted continuous Gaussians. Via end-to-end training, our optimized network, namely GSASR, can …
Poster
Alexander Ogren · Berthy Feng · Jihoon Ahn · Katherine Bouman · Chiara Daraio

[ Exhibit Hall I ]

Abstract
Wave propagation on the surface of a material contains information about physical properties beneath its surface. We propose a method for inferring the thickness and stiffness of a structure from just a video of waves on its surface. Our method works by extracting a dispersion relation from the video and then solving a physics-based optimization problem to find the best-fitting thickness and stiffness parameters. We validate our method on both simulated and real data, in both cases showing strong agreement with ground-truth measurements. Our technique provides a proof-of-concept for at-home health monitoring of medically-informative tissue properties, and it is further applicable to fields such as human-computer interaction.
Poster
Haiyang Bai · Jiaqi Zhu · Songru Jiang · Wei Huang · Tao Lu · Yuanqi Li · Jie Guo · Runze Fu · Yanwen Guo · Lijun Chen

[ Exhibit Hall I ]

Abstract
We propose a 3D Gaussian splatting-based framework for outdoor relighting that leverages intrinsic image decomposition to precisely integrate sunlight, sky radiance, and indirect lighting from unconstrained photo collections. Unlike prior methods that compress the per-image global illumination into a single latent vector, our approach enables simultaneously diverse shading manipulation and the generation of dynamic shadow effects. This is achieved through three key innovations: (1) a residual-based sun visibility extraction method to accurately separate direct sunlight effects, (2) a region-based supervision framework with a structural consistency loss for physically interpretable and coherent illumination decomposition, and (3) a ray-tracing-based technique for realistic shadow simulation. Extensive experiments demonstrate that our framework synthesizes novel views with competitive fidelity against state-of-the-art relighting solutions and produces more natural and multifaceted illumination and shadow effects.
Poster
Kailong Zhang · Youwei Lyu · Heng Guo · Si Li · Zhanyu Ma · Boxin Shi

[ Exhibit Hall I ]

Abstract
Polarization images facilitate image enhancement and 3D reconstruction tasks, but the limited accessibility of polarization cameras hinders their broader application. This gap drives the need for synthesizing photorealistic polarization images. The existing polarization simulator Mitsuba relies on a parametric polarization image formation model and requires extensive 3D assets covering shape and PBR materials, preventing it from generating large-scale photorealistic images. To address this problem, we propose PolarAnything, capable of synthesizing polarization images from a single RGB input with both photorealism and physical accuracy, eliminating the dependency on 3D asset collections. Drawing inspiration from the zero-shot performance of pretrained diffusion models, we introduce a diffusion-based generative framework with an effective representation strategy that preserves the fidelity of polarization properties. Extensive experiments show that our model not only generates high-quality polarization images but also effectively supports downstream tasks such as shape from polarization.
Poster
Jingjing Wang · Qirui Hu · Chong Bao · Yuke Zhu · Hujun Bao · Zhaopeng Cui · Guofeng Zhang

[ Exhibit Hall I ]

Abstract
We propose an outdoor scene dataset and propose a series of benchmarks based on it.Inverse rendering in urban scenes is pivotal for applications like autonomous driving and digital twins, yet it faces significant challenges due to complex illumination conditions, including multi-illumination and indirect light and shadow effects.However, these challenges' effects on intrinsic decomposition and 3D reconstruction are not explored due to the lack of appropriate datasets. In this paper, we present LightCity, a novel high-quality synthetic urban dataset featuring diverse illumination conditions with realistic indirect light and shadow effects.LightCity encompasses over 300 sky maps with highly controllable illumination, varying scales with both street-level and aerial perspectives over 50K images, and rich properties such as depth, normal, and material components, light and indirect light, etc.Besides, we leverage LightCity to benchmark three fundamental tasks in the urban environments and conduct a comprehensive analysis of these benchmarks, laying a robust foundation for advancing related research.
Poster
Xiaobiao Du · Yida Wang · Haiyang Sun · Zhuojie Wu · Hongwei Sheng · Shuyun Wang · Jiaying Ying · Ming Lu · Tianqing Zhu · Kun Zhan · Xin Yu

[ Exhibit Hall I ]

Abstract
3D cars are commonly used in self-driving systems, virtual/augmented reality, and games. However, existing 3D car datasets are either synthetic or low-quality, limiting their applications in practical scenarios and presenting a significant gap toward the high-quality real-world 3D car datasets. In this paper, we propose the first large-scale 3D real car dataset, termed 3DRealCar, offering three distinctive features. (1) \textbf{High-Volume}: 2,500 cars are meticulously scanned by smartphones, obtaining car images and point clouds with real-world dimensions; (2) \textbf{High-Quality}: Each car is captured in an average of 200 dense, high-resolution 360-degree RGB-D views, enabling high-fidelity 3D reconstruction; (3) \textbf{High-Diversity}: The dataset contains various cars from over 100 brands, collected under three distinct lighting conditions, including reflective, standard, and dark. Additionally, we offer detailed car parsing maps for each instance to promote research in car parsing tasks. Moreover, we remove background point clouds and standardize the car orientation to a unified axis for the reconstruction only on cars and controllable rendering without background. We benchmark 3D reconstruction results with state-of-the-art methods across each lighting condition in 3DRealCar. Extensive experiments demonstrate that the standard lighting condition part of 3DRealCar can be used to produce a large number of high-quality 3D cars, improving various …
Poster
YU WEI · Jiahui Zhang · Xiaoqin Zhang · Ling Shao · Shijian Lu

[ Exhibit Hall I ]

Abstract
COLMAP-free 3D Gaussian Splatting (3D-GS) has recently attracted increasing attention due to its remarkable performance in reconstructing high-quality 3D scenes from unposed images or videos. However, it often struggles to handle scenes with complex camera trajectories as featured by drastic rotation and translation across adjacent camera views, leading to degraded estimation of camera poses and further local minima in joint optimization of camera poses and 3D-GS. We propose PCR-GS, an innovative COLMAP-free 3DGS technique that achieves superior 3D scene modeling and camera pose estimation via camera pose co-regularization. PCR-GS achieves regularization from two perspectives. The first is feature reprojection regularization which extracts view-robust DINO features from adjacent camera views and aligns their semantic information for camera pose regularization. The second is wavelet-based frequency regularization which exploits discrepancy in high-frequency details to further optimize the rotation matrix in camera poses. Extensive experiments over multiple real-world scenes show that the proposed PCR-GS achieves superior pose-free 3D-GS scene modeling under dramatic changes of camera trajectories.
Poster
Han-Hung Lee · Qinghong Han · Angel Chang

[ Exhibit Hall I ]

Abstract
In this paper, we explore the task of generating expansive outdoor scenes, ranging from city skyscrapers to medieval castles and houses. Unlike indoor scene generation, which has been a primary focus of prior work, outdoor scene generation presents unique challenges, including the wide variation in scene heights and the need for an efficient approach capable of rapidly producing large landscapes. To address this, we introduce an efficient representation that encodes scene chunks as homogeneous vector sets, offering better compression than spatially structured latents used in prior methods. Furthermore, we train an outpainting model under four conditional patterns to generate scene chunks in a zig-zag manner, enabling more coherent generation compared to prior work that relies on inpainting methods. This provides richer context and speeds up generation by eliminating extra diffusion steps. Finally, to facilitate this task, we curate NuiScene43, a small but high-quality set of scenes and preprocess them for joint training. Interestingly, when trained on scenes of varying styles, our model can blend vastly different scenes, such as rural houses and city skyscrapers, within the same scene.
Poster
Pei An · Jiaqi Yang · Muyao Peng · You Yang · Qiong Liu · Xiaolin Wu · Liangliang Nan

[ Exhibit Hall I ]

Abstract
Image-to-point-cloud (I2P) registration is a fundamental problem in computer vision, focusing on establishing 2D-3D correspondences between an image and a point cloud. The differential perspective-n-point (PnP) has been widely used to supervise I2P registration networks by enforcing the projective constraints on 2D-3D correspondences. However, differential PnP is highly sensitive to noise and outliers in the predicted correspondences. This issue hinders the effectiveness of correspondence learning. Inspired by the robustness of blind PnP against noise and outliers in correspondences, we propose an approximated blind PnP based correspondence learning approach. To mitigate the high computational cost of blind PnP, we simplify blind PnP to an amenable task of minimizing Chamfer distance between learned 2D and 3D keypoints, called MinCD-PnP. To effectively solve MinCD-PnP, we design a lightweight multi-task learning module, named as MinCD-Net, which can be easily integrated into the existing I2P registration architectures. Extensive experiments on 7-Scenes, RGBD-V2, ScanNet, and self-collected datasets demonstrate that MinCD-Net outperforms state-of-the-art methods and achieves a higher inlier ratio (IR) and registration recall (RR) in both cross-scene and cross-dataset settings. Source code will be released soon.
Poster
Shadi Hamdan · Chonghao Sima · Zetong Yang · Hongyang Li · Fatma Guney

[ Exhibit Hall I ]

Abstract
How can we benefit from large models without sacrificing inference speed, a common dilemma in self-driving systems? A prevalent solution is a dual-system architecture, employing a small model for rapid, reactive decisions and a larger model for slower but more informative analyses. Existing dual-system designs often implement parallel architectures where inference is either directly conducted using the large model at each current frame or retrieved from previously stored inference results. However, these works still struggle to enable large models for a timely response to every online frame. Our key insight is to shift intensive computations of the current frame to previous time steps and perform a batch inference of multiple time steps to make large models respond promptly to each time step. To achieve the shifting, we introduce Efficiency through Thinking Ahead (ETA), an asynchronous system designed to: (1) propagate informative features from the past to the current frame using future predictions from the large model, (2) extract current frame features using a small model for real-time responsiveness, and (3) integrate these dual features via an action mask mechanism that emphasizes action-critical image regions. Evaluated on the Bench2Drive CARLA Leaderboard-v2 benchmark, ETA advances state-of-the-art performance by 8\% with a driving …
Poster
Zikun Xu · Shaobing Xu

[ Exhibit Hall I ]

Abstract
LiDAR-based 3D occupancy prediction algorithms evolved rapidly with the advent of large-scale datasets. However, the full potential of the existing diverse datasets remains underutilized, as they are typically employed in isolation. Models trained on a single dataset often suffer considerable performance degradation when deployed to real-world scenarios or datasets involving disparate LiDARs.To address this limitation, we introduce \emph{MergeOcc}, a generalized pipeline designed to handle different LiDARs by leveraging multiple datasets concurrently.The gaps among LiDAR datasets primarily manifest in geometric disparities and semantic inconsistencies, which correspond to the fundamental components of datasets: data and labels. In response, MergeOcc incorporates a novel model architecture that features a geometric realignment and a semantic label mapping to facilitate multiple datasets training (MDT). The effectiveness of MergeOcc is validated through extensive experiments on two prominent datasets for autonomous vehicles: OpenOccupancy-nuScenes and SemanticKITTI.The results demonstrate its enhanced robustness and performance improvements across both types of LiDARs, outperforming several SOTA methods. Additionally, despite using an identical model architecture and hyper-parameter set, MergeOcc can significantly surpass the baselines thanks to its ability to learn from diverse datasets. To the best of our knowledge, this work presents the first cross-dataset 3D occupancy prediction pipeline that effectively bridges the domain …
Poster
Chengyu Zheng · Honghua Chen · Jin Huang · Mingqiang Wei

[ Exhibit Hall I ]

Abstract
Recent research leveraging large-scale pretrained diffusion models has demonstrated the potential of using diffusion features to establish semantic correspondences in images. Inspired by advancements in diffusion-based techniques, we propose a novel zero-shot method for refining point cloud registration algorithms. Our approach leverages correspondences derived from depth images to enhance point feature representations, eliminating the need for a dedicated training dataset. Specifically, we first project the point cloud into depth maps from multiple perspectives and extract implicit knowledge from a pretrained diffusion network as depth diffusion features. These features are then integrated with geometric features obtained from existing methods to establish more accurate correspondences between point clouds. By leveraging these refined correspondences, our approach achieves significantly improved registration accuracy. Extensive experiments demonstrate that our method not only enhances the performance of existing point cloud registration techniques but also exhibits robust generalization capabilities across diverse datasets.
Poster
Chang Qiu · Feipeng Da · Zilei Zhang

[ Exhibit Hall I ]

Abstract
The pretrain-finetune paradigm of pre-training a model on large amounts of image and text data and then fine-tuning the model for a specific task has led to significant progress in many 2D image and natural language processing tasks.Similarly, the use of pre-training methods in point cloud data can also enhance the working performance and generalization ability of the model.Therefore, in this paper, we propose a pre-training framework based on a diffusion model called PreDifPoint. It is able to accomplish the pre-training of the model's backbone network through a diffusion process of gradual denoising. We aggregate the potential features extracted from the backbone network, input them as conditions into the subsequent diffusion model, and direct the point-to-point mapping relationship of the noisy point clouds at neighboring time steps, so as to generate high-quality point clouds and at the same time better perform various downstream tasks of the point clouds.We also introduce a bi-directional covariate attention (DXCA-Attention) mechanism for capturing complex feature interactions, fusing local and global features, and improving the detail recovery of point clouds.In addition, we propose a density-adaptive sampling strategy, which can help the model dynamically adjust the sampling strategy between different time steps, and guide the model to …
Poster
Ziyang Leng · Jiawei Yang · Wenlong Yi · Bolei Zhou

[ Exhibit Hall I ]

Abstract
3D occupancy becomes a promising perception representation for autonomous driving to model the surrounding environment at a fine-grained scale. However, it remains challenging to efficiently aggregate 3D occupancy over time across multiple input frames due to the high processing cost and the uncertainty and dynamics of voxels. To address this issue, we propose ST-Occ, a scene-level occupancy representation learning framework that effectively learns the spatiotemporal feature with temporal consistency. ST-Occ consists of two core designs: a spatiotemporal memory that captures comprehensive historical information and stores it efficiently through a scene-level representation, and a memory attention that conditions the current occupancy representation on the spatiotemporal memory with a model of uncertainty and dynamic awareness. Our method significantly enhances the spatiotemporal representation learned for 3D occupancy prediction tasks by exploiting the temporal dependency between multi-frame inputs. Experiments show that our approach outperforms the state-of-the-art methods by a margin of 3 mIoU and reduces the temporal inconsistency by 29%. The code and model will be made publicly available.
Poster
Feng Qiao · Zhexiao Xiong · Eric Xing · Nathan Jacobs

[ Exhibit Hall I ]

Abstract
Stereo images are fundamental to numerous applications, including extended reality (XR) devices, autonomous driving, and robotics. Unfortunately, acquiring high-quality stereo images remains challenging due to the precise calibration requirements of dual-camera setups and the complexity of obtaining accurate, dense disparity maps. Existing stereo image generation methods typically focus on either visual quality for viewing or geometric accuracy for matching, but not both. We introduce GenStereo, a diffusion-based approach, to bridge this gap. The method includes two primary innovations (1) conditioning the diffusion process on a disparity-aware coordinate embedding and a warped input image, allowing for more precise stereo alignment than previous methods, and (2) an adaptive fusion mechanism that intelligently combines the diffusion-generated image with a warped image, improving both realism and disparity consistency. Through extensive training on 11 diverse stereo datasets, GenStereo demonstrates strong generalization ability. GenStereo achieves state-of-the-art performance in both stereo image generation and unsupervised stereo matching tasks. Our framework eliminates the need for complex hardware setups while enabling high-quality stereo image generation, making it valuable for both real-world applications and unsupervised learning scenarios. The code will be made publicly available upon acceptance.
Poster
Sanghyun Son · Matheus Gadelha · Yang Zhou · Matthew Fisher · Zexiang Xu · Yi-Ling Qiao · Ming Lin · Yi Zhou

[ Exhibit Hall I ]

Abstract
Recent probabilistic methods for 3D triangular meshes have shown promise in capturing diverse shapes by managing mesh connectivity in a differentiable manner. However, these methods are often limited by high computational costs that scale disproportionately with the level of detail, restricting their applicability for complex shapes requiring high face density. In this work, we introduce a novel differentiable mesh processing method that addresses these computational challenges in both 2D and 3D. Our method reduces time complexity from $O(N)$ to $O(\log N)$ and requires significantly less memory than previous approaches, enabling us to handle far more intricate structures. Building on this innovation, we present a reconstruction algorithm capable of generating complex 2D and 3D shapes from point clouds or multi-view images. We demonstrate its efficacy on various objects exhibiting diverse topologies and geometric details.
Poster
Chang Liu · mingxuzhu mingxuzhu · Zheyuan Zhang · Linna Song · xiao zhao · Luo Qingliang · Qi Wang · Chufan Guo · Kuifeng Su

[ Exhibit Hall I ]

Abstract
End-to-end autonomous driving technology has recently become a focal point of research and application in autonomous driving. State-of-the-art (SOTA) methods are often trained and evaluated on the NuScenes dataset. However, the NuScenes dataset, introduced in 2019 for 3D perception tasks, faces several limitations—such as insufficient scale, simple scenes, and homogeneous driving behaviors—that restrict the upper-bound development of end-to-end autonomous driving algorithms. In light of these issues, we propose a novel, large-scale real-world dataset specifically designed for end-to-end autonomous driving tasks, named TAD-E2E, which is 25x larger, 1.7x scene complexity over NuScenes, and features a highly diverse range of driving behaviors. We replicated SOTA methods on the TAD-E2E dataset and observed that these methods no longer performed well, as expected. Additionally, in response to the challenging scenarios presented in the TAD-E2E dataset, we devised a multimodal sparse end-to-end method that significantly outperforms SOTA methods. Ablation studies demonstrate the effectiveness of our method, and we analyze the contributions of each module. The dataset and code will be made open source upon acceptance of the paper.
Poster
Juelin Zhu · Shuaibang Peng · Long Wang · Hanlin Tan · Yu Liu · Maojun Zhang · Shen Yan

[ Exhibit Hall I ]

Abstract
We propose a novel method for aerial visual localization over low Level-of-Detail (LoD) city models. Previous wireframe-alignment-based method LoD-Loc [97] has shown promising localization results leveraging LoD models. However, LoD-Loc mainly relies on high-LoD (LoD3 or LoD2) city models, but the majority of available models and those many countries plan to construct nationwide are low-LoD (LoD1). Consequently, enabling localization on low-LoD city models could unlock drones' potential for global urban localization. To address these issues, we introduce LoD-Loc v2, which employs a coarse-to-fine strategy using explicit silhouette alignment to achieve accurate localization over low-LoD city models in the air.Specifically, given a query image, LoD-Loc v2 first applies a building segmentation network to shape building silhouettes. Then, in the coarse pose selection stage, we construct a pose cost volume by uniformly sampling pose hypotheses around a prior pose to represent the pose probability distribution.Each cost of the volume measures the degree of alignment between the projected and predicted silhouettes. We select the pose with maximum value as the coarse pose. In the fine pose estimation stage, a particle filtering method incorporating a multi-beam tracking approach is used to efficiently explore the hypothesis space and obtain the final pose estimation. To further …
Poster
WEI-JER Chang · Masayoshi Tomizuka · Wei Zhan · Manmohan Chandraker · Francesco Pittaluga

[ Exhibit Hall I ]

Abstract
Evaluating autonomous vehicles with controllability enables scalable testing in counterfactual or structured settings, enhancing both efficiency and safety. We introduce LangTraj, a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors, generating nuanced and realistic scenarios. Unlike prior approaches that depend on domain-specific guidance functions, LangTraj incorporates language conditioning during training, facilitating more intuitive traffic simulation control. We propose a novel closed-loop training strategy for diffusion models, explicitly tailored to enhance stability and realism during closed-loop simulation. To support language-conditioned simulation, we develop Inter-Drive, a large-scale dataset with diverse and interactive labels for training language-conditioned diffusion models. Our dataset is built upon a scalable pipeline for annotating agent-agent interactions and single-agent behaviors, ensuring rich and varied supervision. Validated on the Waymo Motion Dataset, LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation, establishing a new paradigm for flexible and scalable autonomous vehicle testing.
Poster
Yongjin Lee · Hyeon-Mun Jeong · Yurim Jeon · Sanghyun Kim

[ Exhibit Hall I ]

Abstract
Multi-modal sensor fusion in Bird’s Eye View (BEV) representation has become the leading approach for 3D object detection. However, existing methods often rely on depth estimators or transformer encoders to transform image features into BEV space, which reduces robustness or introduces significant computational overhead. Moreover, the insufficient geometric guidance in view transformation results in ray-directional misalignments, limiting the effectiveness of BEV representations. To address these challenges, we propose Efficient View Transformation (EVT), a novel 3D object detection framework that constructs a well-structured BEV representation, improving both accuracy and efficiency. Our approach focuses on two key aspects. First, Adaptive Sampling and Adaptive Projection (ASAP), which utilizes LiDAR guidance to generate 3D sampling points and adaptive kernels, enables more effective transformation of image features into BEV space and a refined BEV representation. Second, an improved query-based detection framework, incorporating group-wise mixed query selection and geometry-aware cross-attention, effectively captures both the common properties and the geometric structure of objects in the transformer decoder. On the nuScenes test set, EVT achieves state-of-the-art performance of 75.3\% NDS with real-time inference speed.
Poster
Shiyong Liu · Xiao Tang · Zhihao Li · Yingfan He · Chongjie Ye · Jianzhuang Liu · Binxiao Huang · Shunbo Zhou · Xiaofei Wu

[ Exhibit Hall I ]

Abstract
In large-scale scene reconstruction using 3D Gaussian splatting, it is common to partition the scene into multiple smaller regions and reconstruct them individually. However, existing division methods are occlusion-agnostic, meaning that each region may contain areas with severe occlusions. As a result, the cameras within those regions are less correlated, leading to a low average contribution to the overall reconstruction. In this paper, we propose an occlusion-aware scene division strategy that clusters training cameras based on their positions and co-visibilities to acquire multiple regions. Cameras in such regions exhibit stronger correlations and a higher average contribution, facilitating high-quality scene reconstruction. We further propose a region-based rendering technique to accelerate large scene rendering, which culls Gaussians invisible to the region where the viewpoint is located. Such a technique significantly speeds up the rendering without compromising quality. Extensive experiments on multiple large scenes show that our method achieves superior reconstruction results with faster rendering speeds compared to existing state-of-the-art approaches.
Poster
Ziliang Miao · Runjian Chen · Yixi Cai · Buwei He · Wenquan Zhao · Wenqi Shao · Bo Zhang · Fu Zhang

[ Exhibit Hall I ]

Abstract
Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that alleviate the labeling burden for MOS. TOP explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called mIoU_obj to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that TOPoutperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.
Poster
Hanshi Wang · Jin Gao · Weiming Hu · Zhipeng Zhang

[ Exhibit Hall I ]

Abstract
We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes validation benchmark, even surpassing methods that utilize high-resolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed …
Poster
Weihong Pan · Xiaoyu Zhang · Hongjia Zhai · Xiaojun Xiang · Hanqing Jiang · Guofeng Zhang

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis and real-time rendering. However, it heavily relies on high-quality initial sparse points from Structure-from-Motion (SfM) which often struggles in textureless regions, degrading the geometry and visual quality of 3DGS. To address this limitation, we propose a novel initialization pipeline, achieving high-fidelity reconstruction from dense image sequences without relying on SfM-derived point clouds. Specifically, we first propose an effective depth alignment method to align the estimated monocular depth with depth rendered from an under-optimized coarse Gaussian model using an unbiased depth rasterization approach and ensemble them afterward. After that, to efficiently process dense image sequences, we incorporate a progressive segmented initialization process that to generate the initial points. Extensive experiments demonstrate the superiority of our method over previous approaches. Notably, our method outperforms the SfM-based method by a 14.4% improvement in LPIPS on the Mip-NeRF360 datasets and a 30.7% improvement on the Tanks and Temples datasets.
Poster
Shengpeng Wang · Yulong Xie · Qing Liao · Wei Wang

[ Exhibit Hall I ]

Abstract
Millimeter-wave radar for state estimation is gaining significant attention for its affordability and reliability in harsh conditions. Existing localization solutions typically rely on post-processed radar point clouds as landmark points. Nonetheless, the inherent sparsity of radar point clouds, ghost points from multi-path effects, and limited angle resolution in single-chirp radar severely degrade state estimation performance. To address these issues, we propose S$^3$E, a \textbf{S}elf-\textbf{S}upervised \textbf{S}tate \textbf{E}stimator that employs more richly informative radar signal spectra to bypass sparse points and fuses complementary inertial information to achieve accurate localization. S$^3$E fully explores the association between \textit{exteroceptive} radar and \textit{proprioceptive} inertial sensor to achieve complementary benefits. To deal with limited angle resolution, we introduce a novel cross-fusion technique that enhances spatial structure information by exploiting subtle rotational shift correlations across heterogeneous data. The experimental results demonstrate our method achieves robust and accurate performance without relying on localization ground truth supervision. To the best of our knowledge, this is the first attempt to achieve state estimation by fusing radar spectra and inertial data in a complementary self-supervised manner. Codes will be released on GitHub.
Poster
Yaoye Zhu · Zhe Wang · Yan Wang

[ Exhibit Hall I ]

Abstract
As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. Code will be made publicly …
Poster
Hyojun Go · Byeongjun Park · Hyelin Nam · Byung-Hoon Kim · Hyungjin Chung · Changick Kim

[ Exhibit Hall I ]

Abstract
We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily …
Poster
Guosheng Zhao · Xiaofeng Wang · Chaojun Ni · Zheng Zhu · Wenkang Qin · Guan Huang · Xingang Wang

[ Exhibit Hall I ]

Abstract
Combining reconstruction models with generative models has emerged as a promising paradigm for closed-loop simulation in autonomous driving. For example, ReconDreamer has demonstrated remarkable success in rendering large-scale maneuvers. However, a significant gap remains between the generated data and real-world sensor observations, particularly in terms of fidelity for structured elements, such as the ground surface. To address these challenges, we propose ReconDreamer++, an enhanced framework that significantly improves the overall rendering quality by mitigating the domain gap and refining the representation of the ground surface.Specifically, ReconDreamer++ introduces the Novel Trajectory Deformable Network (NTDNet), which leverages learnable spatial deformation mechanisms to bridge the domain gap between synthesized novel views and original sensor observations. Moreover, for structured elements such as the ground surface, we preserve geometric prior knowledge in 3D Gaussians, andthe optimization process focuses on refining appearance attributes while preserving the underlying geometric structure. Experimental evaluations conducted on multiple datasets (Waymo, nuScenes, PandaSet, and EUVS) confirm the superior performance of ReconDreamer++. Specifically, on Waymo, ReconDreamer++ achieves performance comparable to Street Gaussians for the original trajectory while significantly outperforming ReconDreamer on novel trajectories. In particular, it achieves substantial improvements, including a 6.1\% increase in NTA-IoU, a 23. 0\% improvement in FID, and …
Poster
JUNHONG MIN · YOUNGPIL JEON · Jimin Kim · Minyong Choi

[ Exhibit Hall I ]

Abstract
Accurate and scalable stereo matching remains a critical challenge, particularly for high-resolution images requiring both fine-grained disparity estimation and computational efficiency. While recent methods have made progress, achieving global and local consistency alongside computational efficiency remains difficult. Transformer-based models effectively capture long-range dependencies but suffer from high computational overhead, while cost volume-based iterative methods rely on local correlations, limiting global consistency and scalability to high resolutions and large disparities. To address these issues, we introduce S$^2$M$^2$, a Scalable Stereo Matching Model that achieves high accuracy, efficiency, and generalization without compromise. Our approach integrates a multi-resolution transformer framework, enabling effective information aggregation across different scales. Additionally, we propose a new loss function that enhances disparity estimation by concentrating probability on feasible matches. Beyond disparity prediction, S$^2$M$^2$ jointly estimates occlusion and confidence maps, leading to more robust and interpretable depth estimation. Unlike prior methods that rely on dataset-specific tuning, S$^2$M$^2$ is trained from scratch without dataset-specific adjustments, demonstrating strong generalization across diverse benchmarks. Extensive evaluations on Middlebury v3, ETH3D, and our high-fidelity synthetic dataset establish new state-of-the-art results.
Poster
Lukas Höllein · Aljaz Bozic · Michael Zollhöfer · Matthias Nießner

[ Exhibit Hall I ]

Abstract
We present 3DGS-LM, a new method that accelerates the reconstruction of 3D Gaussian Splatting (3DGS) by replacing its ADAM optimizer with a tailored Levenberg-Marquardt (LM). Existing methods reduce the optimization time by decreasing the number of Gaussians or by improving the implementation of the differentiable rasterizer. However, they still rely on the ADAM optimizer to fit Gaussian parameters of a scene in thousands of iterations, which can take up to an hour. To this end, we change the optimizer to LM that runs in conjunction with the 3DGS differentiable rasterizer. For efficient GPU parallelization, we propose a caching data structure for intermediate gradients that allows us to efficiently calculate Jacobian-vector products in custom CUDA kernels. In every LM iteration, we calculate update directions from multiple image subsets using these kernels and combine them in a weighted mean. Overall, our method is 20% faster than the original 3DGS while obtaining the same reconstruction quality. Our optimization is also agnostic to other methods that accelerate 3DGS, thus enabling even faster speedups compared to vanilla 3DGS.
Poster
Leonard Bruns · Axel Barroso-Laguna · Tommaso Cavallari · Áron Monszpart · Sowmya Munukutla · Victor Prisacariu · Eric Brachmann

[ Exhibit Hall I ]

Abstract
Scene coordinate regression (SCR) has established itself as a promising learning-based approach to visual relocalization. After mere minutes of scene-specific training, SCR models estimate camera poses of query images with high accuracy. Still, SCR methods fall short of the generalization capabilities of more classical feature-matching approaches. When imaging conditions of query images, such as lighting or viewpoint, are too different from the training views, SCR models fail. Failing to generalize is an inherent limitation of previous SCR frameworks, since their training objective is to encode the training views in the weights of the coordinate regressor itself. The regressor essentially overfits to the training views, by design. We propose to separate the coordinate regressor and the map representation into a generic transformer and a scene-specific map code. This separation allows us to pre-train the transformer on tens of thousands of scenes. More importantly, it allows us to train the transformer to generalize from fixed map codes to unseen query images during pre-training. We demonstrate on multiple challenging relocalization datasets that our method, ACE-G, leads to significantly increased robustness while keeping the computational footprint attractive.
Poster
Xin Wei · Qin Yang · Yijie Fang · Mingrui Zhu · Nannan Wang

[ Exhibit Hall I ]

Abstract
Test-time adaptation (TTA) methods effectively address domain shifts by dynamically adapting pre-trained models to target domain data during online inference. While effective for 2D images, TTA struggles with 3D point clouds due to their irregular and unordered nature. Existing 3D TTA methods often involve complex high-dimensional optimization tasks, such as patch reconstruction or per-point transformation learning in the spatial domain, which require access to additional training data. In contrast, we propose Graph Spectral Domain Test-Time Adaptation (GSDTTA), a novel approach for 3D point cloud classification that shifts adaptation to the graph spectral domain, enabling more efficient adaptation by capturing global structural properties with fewer parameters. Point clouds in target domain are represented as outlier-aware graphs and transformed into graph spectral domain by Graph Fourier Transform (GFT). For efficiency, we only optimize the lowest 10\% of frequency components, which capture the majority of the point cloud’s energy. An inverse GFT (IGFT) is then applied to reconstruct the adapted point cloud with the graph spectral-driven point shift. Additionally, an eigenmap-guided self-training strategy is introduced to iteratively optimize both spectral adjustment and model parameters. Experimental results and ablation studies on benchmark datasets demonstrate the effectiveness of GSDTTA, outperforming existing TTA methods for 3D …
Poster
Haiping Wang · Yuan Liu · Ziwei Liu · Wenping Wang · Zhen Dong · Bisheng Yang

[ Exhibit Hall I ]

Abstract
In this paper, we propose VistaDream, a novel framework to reconstruct a 3D scene from a single-view image. Recent diffusion models enable generating high-quality novel-view images from a single-view input image. Most existing methods only concentrate on building the consistency between the input image and the generated images while losing the consistency between the generated images. VistaDream addresses this problem by a two-stage pipeline. In the first stage, VistaDream builds a global coarse 3D scaffold by zooming out a little step with inpainted boundaries and an estimated depth map. Then, on this global scaffold, we use iterative diffusion-based RGB-D inpainting to generate novel-view images to inpaint the holes of the scaffold. In the second stage, we further enhance the consistency between the generated novel-view images by a novel training-free Multiview Consistency Sampling (MCS) that introduces multi-view consistency constraints in the reverse sampling process of diffusion models. Experimental results demonstrate that without training or fine-tuning existing diffusion models, VistaDream achieves high-quality scene reconstruction and novel view synthesis using a single-view image and outperforms baseline methods by a large margin.
Poster
Alberto Jaenal · Paula Carbó Cubero · Jose Araujo · André Mateus

[ Exhibit Hall I ]

Abstract
The growing presence of vision-based systems in the physical world comes with a major requirement: highly accurate estimation of the pose, a task typically addressed through methods based on local features. The totality of the available feature-based localization solutions are designed under the assumption of using the same feature for mapping and localization. However, as the implementation provided by each vendor is based on heterogeneous feature extraction algorithms, collaboration between different devices is not straightforward or even not possible. Although there are some alternatives, such as re-extracting the features or reconstructing the image from them, these are impractical or costly to implement in a real pipeline. To overcome this, and inspired in the seminal work Cross-Descriptor [12], we propose Cross-Feature, a method that applies a patch-based training strategy to a simple MLP which projects features to a common embedded space. As a consequence, our proposal allows to establish suitable correspondences between features computed through heterogeneous algorithms, e.g., SIFT [23] and SuperPoint [9]. We experimentally demonstrate the validity of Cross-Feature by evaluating it in tasks as Image Matching, Visual Localization and a new Collaborative Visual Localization and Mapping scenario. We believe this is the first step towards full Visual Localization interoperability. …
Poster
Anjun Hu · Richard Tomsett · Valentin Gourmet · Massimo Camplani · Jas Kandola · Hanting Xie

[ Exhibit Hall I ]

Abstract
We present MiDSummer, a two-stage framework for generating immersive Gaussian Splatting scenes that leverages multiple diffusion guidance signals to enable structured layout control, enhanced physical realism, and improved visual quality.While 3D scene generation has seen significant recent advances, current approaches could benefit from: (1) achieving precise, reliable layout control while preserving open-world generalization and physical plausibility, (2) balancing high-level semantic reasoning with low-level, directly controllable geometric constraints, and (3) effectively utilizing layout knowledge for visual refinement. Our work addresses these challenges through a structured two-stage planning-assembly framework.For planning, we introduce a dual layout diffusion guidance approach to bridge semantic reasoning and geometric controllability. Our approach uniquely integrates LLMs' open-vocabulary reasoning with Graph Diffusion Models' (GDM) geometric precision by incorporating multi-level self-consistency scores over scene graph structures and layout bounding box parameters. This fusion enables fine-grained control over scene composition while ensuring physical plausibility and faithful prompt interpretation.For assembly, we propose a layout-guided optimization technique for scene refinement. We effectively incorporate layout priors obtained during the planning stage into a Stable Diffusion (SD)-based refinement process that jointly optimizes camera trajectories and scene splats. This layout-aware joint optimization, constrained by multi-view consistency, produces visually compelling immersive scenes that are structurally coherent and …
Poster
Zhaonan Wang · Manyi Li · Changhe Tu

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has witnessed exponential adoption across diverse applications, driving a critical need for semantic-aware 3D Gaussian representations to enable scene understanding and editing tasks. Existing approaches typically attach semantic features to a collection of free Gaussians and distill the features via differentiable rendering, leading to noisy segmentation and a messy selection of Gaussians. In this paper, we introduce AG$^2$aussian, a novel framework that leverages an anchor-graph structure to organize semantic features and regulate Gaussian primitives. Our anchor-graph structure not only promotes compact and instance-aware Gaussian distributions, but also facilitates graph-based propagation, achieving a clean and accurate instance-level Gaussian selection. Extensive validation across four applications, i.e. interactive click-based query, open-vocabulary text-driven query, object removal editing, and physics simulation, demonstrates the advantages of our approach and its benefits to various applications. The experiments and ablation studies further evaluate the effectiveness of the key designs of our approach.
Poster
Jon Nyffeler · Federico Tombari · Daniel Barath

[ Exhibit Hall I ]

Abstract
Understanding and structuring outdoor environments in 3D is critical for numerous applications, including robotics, urban planning, and autonomous navigation. In this work, we propose a pipeline to construct hierarchical 3D scene graphs from outdoor data, consisting of posed images and 3D reconstructions. Our approach systematically extracts and organizes objects and their subcomponents, enabling representations that span from entire buildings to their facades and individual windows. By leveraging geometric and semantic relationships, our method efficiently groups objects into meaningful hierarchies while ensuring robust spatial consistency. We integrate efficient feature extraction, hierarchical object merging, and relationship inference to generate structured scene graphs that capture both global and local dependencies. Our approach scales to large outdoor environments while maintaining efficiency, and we demonstrate its effectiveness on real-world datasets. We also demonstrate that these constructed outdoor scene graphs are beneficial for downstream applications, such as 3D scene alignment. The code will be made public.
Poster
Shin Ishihara · Imari Sato

[ Exhibit Hall I ]

Abstract
Hyperspectral imaging has proven effective for appearance inspection because it can identify material compositions and reveal hidden features. Similarly, direct/indirect separation provides essential information about surface appearance and internal conditions, including layer structures and scattering behaviors. This paper presents a novel illumination system incorporating dispersive optics to unify both advantages for scene analyses. In general, achieving distinct direct/indirect separation requires multiple images with varying patterns. In a hyperspectral scenario, using a hyperspectral camera or tunable filters extends exposure and measurement times, hindering practical application.Our proposed system enables the illumination of a wavelength-dependent, spatially shifted pattern. With proper consideration of reflectance differences, we demonstrate robust separation of direct and indirect components for each wavelength can be achieved using a single hyperspectral image taken under one illumination pattern. Furthermore, we demonstrate analyzing the observed differences across wavelengths contributes to estimating depth.
Poster
Yujie Xue · Huilong Pi · Jiapeng Zhang · Qin Yunchuan · Zhuo Tang · Kenli Li · Ruihui Li

[ Exhibit Hall I ]

Abstract
Vision-based semantic scene completion (SSC) is able to predict complex scene information from limited 2D images, which has attracted widespread attention. Currently, SSC methods typically construct unified voxel features containing both geometry and semantics, which lead to different depth positions in occluded regions sharing the same 2D semantic information, resulting in ambiguous semantic segmentation. To address this problem, we propose SDFormer, a novel SAM-assisted Dual-channel Voxel Transformer framework for SSC. We uncouple the task based on its multi-objective nature and construct two parallel sub-networks: a semantic constructor (SC) and a geometric refiner (GR). The SC utilizes the Segment Anything Model (SAM) to construct dense semantic voxel features from reliable visible semantic information in the image. The GR accurately predicts depth positions and then further adjusts the semantic output by SAM. Additionally, we design a Semantic Calibration Affinity to enhance semantic-aware transformations in SC. Within the GR, Shape Segments Interactive and Learnable mask generation module to emphasize the spatial location of semantics to obtain fine-grained voxel information. Extensive qualitative and quantitative results on the SemanticKITTI and SSCBench-KITTI-360 datasets show that our method outperforms state-of-the-art approaches.
Poster
Sihang Li · Siqi Tan · Bowen Chang · Jing Zhang · Chen Feng · Yiming Li

[ Exhibit Hall I ]

Abstract
Visual localization, which estimates a camera's pose within a known scene, is a fundamental capability for autonomous systems. While absolute pose regression (APR) methods have shown promise for efficient inference, they often struggle with generalization. Recent approaches attempt to address this through data augmentation with varied viewpoints, yet they overlook a critical factor: appearance diversity.In this work, we identify appearance variation as the key to robust localization. Specifically, we first lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring capabilities, enabling the synthesis of diverse training data that varies not just in poses but also in environmental conditions such as lighting and weather. To fully unleash the potential of the appearance-diverse data, we build a two-branch joint training pipeline with an adversarial discriminator to bridge the syn-to-real gap.Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, reducing translation and rotation errors by 50% and 22% on indoor datasets, and 37% and 42% on outdoor datasets. Most notably, our method shows remarkable robustness in dynamic driving scenarios under varying weather conditions and in day-to-night scenarios, where previous APR methods fail.
Poster
Qi Zhang · Chi Huang · Qian Zhang · Nan Li · Wei Feng

[ Exhibit Hall I ]

Abstract
The latest advancements in scene relighting have been predominantly driven by inverse rendering with 3D Gaussian Splatting (3DGS). However, existing methods remain overly reliant on densely sampled images under static illumination conditions, which is prohibitively expensive and even impractical in real-world scenarios. In this paper, we propose a novel learning from Sparse views under Unconstrained illuminations Relightable 3D Gaussian Splatting (dubbed SU-RGS), to address this challenge by jointly optimizing 3DGS representations, surface materials, and environment illuminations (i.e., unknown and various lighting conditions in training) using only sparse input views. Firstly, SU-RGS presents a varying appearance rendering strategy, enabling each 3D Gaussian can perform inconsistent color under various lightings. Next, SU-RGS establishes the multi-view semantic consistency by constructing hierarchical semantics pseudo-labels across inter-views, to compensate for extra supervisions and facilitate sparse inverse rendering for confronting unconstrained illuminations. Additionally, we introduce an adaptive transient object perception component that integrates the scene geometry and semantics in a fine-grained manner, to quantify and eliminate the uncertainty of the foreground. Extensive experiments on both synthetic and real-world challenging datasets demonstrate the effectiveness of SU-RGS, achieving the state-of-the-art performance for scene inverse rendering by learning 3DGS from only sparse views under unconstrained illuminations.
Poster
Yusen XIE · Zhenmin Huang · Jin Wu · Jun Ma

[ Exhibit Hall I ]

Abstract
In this paper, we introduce GS-LIVM, a real-time photo-realistic LiDAR-Inertial-Visual mapping framework with Gaussian Splatting tailored for outdoor scenes. Compared to existing methods based on Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), our approach enables real-time photo-realistic mapping while ensuring high-quality image rendering in large-scale unbounded outdoor environments. In this work, Gaussian Process Regression (GPR) is employed to mitigate the issues resulting from sparse and unevenly distributed LiDAR observations. The voxel-based 3D Gaussians map representation facilitates real-time dense mapping in large outdoor environments with acceleration governed by custom CUDA kernels. Moreover, the overall framework is designed in a covariance-centered manner, where the estimated covariance is used to initialize the scale and rotation of 3D Gaussians, as well as update the parameters of the GPR. We evaluate our algorithm on several outdoor datasets, and the results demonstrate that our method achieves state-of-the-art performance in terms of mapping efficiency and rendering quality. The source code is available on GitHub.
Poster
Xin Jin · Haisheng Su · Cong Ma · Kai Liu · Wei Wu · Fei HUI · Junchi Yan

[ Exhibit Hall I ]

Abstract
Lidar-based 3D detection is one of the most popular research fields in autonomous driving. 3D detectors typically detect specific targets in a scene according to the pattern formed by the spatial distribution of point clouds. However, existing voxel-based methods usually adopt MLP and global pooling (e.g., PointNet, CenterPoint) as voxel feature encoder, which makes it less effective to extract detailed spatial structure information from raw points, leading to information loss and inferior performance. In this paper, we propose a novel graph-based transformer to encode voxel features by condensing the full and detailed point's geometry, termed as GeoFormer. We first represent points within a voxel as a graph, based on relative distances to capture its spatial geometry. Then, We introduce a geometry-guided transformer architecture to encode voxel features, where the adjacent geometric clues are used to re-weight point feature similarities, enabling more effective extraction of geometric relationships between point pairs at varying distances. We highlight that GeoFormer is a plug-and-play module which can be seamlessly integrated to enhance the performance of existing voxel-based detectors. Extensive experiments conducted on three popular outdoor datasets demonstrate that our GeoFormer achieves the start-of-the-art performance on both effectiveness and robustness comparisons.
Poster
Yuntao Chen · Yuqi Wang · Zhaoxiang Zhang

[ Exhibit Hall I ]

Abstract
World model-based searching and planning are widely recognized as a promising path toward human-level physical intelligence. However, current driving world models primarily rely on video diffusion models, which specialize in visual generation but lack the flexibility to incorporate other modalities like action. In contrast, autoregressive transformers have demonstrated exceptional capability in modeling multimodal data. Our work aims to unify both driving model simulation and trajectory planning into a single sequence modeling problem. We introduce a multimodal driving language based on interleaved image and action tokens, and develop DrivingGPT to learn joint world modeling and planning through standard next-token prediction. Our DrivingGPT demonstrates strong performance in both action-conditioned video generation and end-to-end planning, outperforming strong baselines on large-scale nuPlan and NAVSIM benchmarks.
Poster
Liuyue Xie · Jiancong Guo · Ozan Cakmakci · Andre Araujo · Laszlo A. A. Jeni · zhiheng jia

[ Exhibit Hall I ]

Abstract
Accurate camera calibration is a fundamental task for 3D perception, especially when dealing with real-world, in-the-wild environments where complex optical distortions are common. Existing methods often rely on pre-rectified images or calibration patterns, which limits their applicability and flexibility. In this work, we introduce a novel framework that addresses these challenges by jointly modeling camera intrinsic and extrinsic parameters using a generic ray camera model. Unlike previous approaches, AlignDiff shifts focus from semantic to geometric features, enabling more accurate modeling of local distortions. We propose AlignDiff, a diffusion model conditioned on geometric priors, enabling the simultaneous estimation of camera distortions and scene geometry. To enhance distortion prediction, we incorporate edge-aware attention, focusing the model on geometric features around image edges, rather than semantic content. Furthermore, to enhance generalizability to real-world captures, we incorporate a large database of ray-traced lenses containing over three thousand samples. This database characterizes the distortion inherent in a diverse variety of lens forms. Our experiments demonstrate that the proposed method significantly reduces the angular error of estimated ray bundles by $\sim 8.2^\circ$ and overall calibration accuracy, outperforming existing approaches on challenging, real-world datasets.
Poster
Shuchao Pang · Zhenghan Chen · Shen Zhang · Liming Lu · Siyuan Liang · Anan Du · Yongbin Zhou

[ Exhibit Hall I ]

Abstract
Deep neural networks for 3D point clouds have been demonstrated to be vulnerable to adversarial examples. Previous 3D adversarial attack methods often exploit certain information about the target models, such as model parameters or outputs, to generate adversarial point clouds. However, in realistic scenarios, it is challenging to obtain any information about the target models under conditions of absolute security. Therefore, we focus on transfer-based attacks, where generating adversarial point clouds does not require any information about the target models. Based on our observation that the critical features used for point cloud classification are consistent across different DNN architectures, we propose CFG, a novel transfer-based black-box attack method that improves the transferability of adversarial point clouds via the proposed **C**ritical **F**eature **G**uidance. Specifically, our method regularizes the search of adversarial point clouds by computing the importance of the extracted features, prioritizing the corruption of critical features that are likely to be adopted by diverse architectures. Further, we explicitly constrain the maximum deviation extent of the generated adversarial point clouds in the loss function to ensure their imperceptibility. Extensive experiments conducted on the ModelNet40 and ScanObjectNN benchmark datasets demonstrate that the proposed CFG outperforms the state-of-the-art attack methods by a large …
Poster
Xiaoyu Zhang · Weihong Pan · Xiaojun Xiang · Hongjia Zhai · Liyang Zhou · Hanqing Jiang · Guofeng Zhang

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has drawn significant attention for its advantages in rendering speed and quality. Most existing methods still rely on the image-wise loss and training paradigm because of its intuitive nature in the Splatting algorithm. However, image-wise loss lacks multi-view constraints, which are generally essential for optimizing 3D appearance and geometry. To address this, we propose RT-Loss along with a tile-based training paradigm, which uses randomly sampled tiles to integrate multi-view appearance and structural constraints in 3DGS. Additionally, we introduce an tile-based adaptive densification control strategy tailored for our training paradigm. Extensive experiments show that our approach consistently improves performance metrics while maintaining efficiency across various benchmark datasets.
Poster
Xuemeng Yang · Licheng Wen · Tiantian Wei · Yukai Ma · Jianbiao Mei · Xin Li · Wenjie Lei · Daocheng Fu · Pinlong Cai · Min Dou · Liang He · Yong Liu · Botian Shi · Yu Qiao

[ Exhibit Hall I ]

Abstract
This paper introduces DriveArena, the first high-fidelity closed-loop simulation system designed for driving agents navigating real-world scenarios. DriveArena comprises two core components: Traffic Manager, a traffic simulator capable of generating realistic traffic flow on any global street map, and World Dreamer, a high-fidelity conditional generative model with infinite auto-regression. DriveArena supports closed-loop simulation using road networks from cities worldwide, enabling the generation of diverse traffic scenarios with varying styles. This powerful synergy empowers any driving agent capable of processing real-world images to navigate in DriveArena's simulated environment. Furthermore, DriveArena features a flexible, modular architecture, allowing for multiple implementations of its core components and driving agents. Serving as a highly realistic arena for these players, our work provides a valuable platform for developing and evaluating driving agents across diverse and challenging scenarios. DriveArena takes a significant leap forward in leveraging generative models for driving simulation platforms, opening new avenues for closed-loop evaluation of autonomous driving systems. Codes of DriveArena are attached to the supplementary material. Project Page: https://blindpaper.github.io/DriveArena/
Poster
Jack Langerman · Denis Rozumny · Yuzhong Huang · Dmytro Mishkin

[ Exhibit Hall I ]

Abstract
What cannot be measured cannot be improved while likely never uttered by Lord Kelvin, summarizes effectively the purpose of this work. This paper presents a detailed evaluation of automated metrics for evaluating structured 3D reconstructions. Pitfalls of each metric are discussed, and a thorough analyses through the lens of expert 3D modelers' preferences is presented. A set of systematic "unit tests" are proposed to empirically verify desirable properties, and context aware recommendations as to which metric to use depending on application are provided. Finally, a learned metric distilled from human expert judgments is proposed and analyzed.
Poster
Jiaru Zhong · Jiahao Wang · Jiahui Xu · Xiaofan Li · Zaiqing Nie · Haibao Yu

[ Exhibit Hall I ]

Abstract
Cooperative perception aims to address the inherent limitations of single autonomous driving systems through information exchange among multiple agents. Previous research has primarily focused on single-frame perception tasks. However, the more challenging cooperative sequential perception tasks, such as cooperative 3D multi-object tracking, have not been thoroughly investigated.Therefore, we propose CoopTrack, a fully instance-level end-to-end framework for cooperative tracking, featuring learnable instance association, which fundamentally differs from existing approaches.CoopTrack transmits sparse instance-level features that significantly enhance perception capabilities while maintaining low transmission costs. Furthermore, the framework comprises three key components: Multi-Dimensional Feature Extraction (MDFE), Cross-Agent Alignment (CAA), and Graph-Based Association (GBA), which collectively enable comprehensive instance representation with semantic and motion features, and adaptive cross-agent association and fusion based on graph learning.Experiments on the V2X-Seq dataset demonstrate that, benefiting from its sophisticated design, CoopTrack achieves state-of-the-art performance, with 39.0\% mAP and 32.8\% AMOTA. Codes and visualization results are provided in the supplementary materials.
Poster
Fanjie Kong · Yitong Li · Weihuang Chen · Chen Min · Yizhe Li · Zhiqiang Gao · Haoyang Li · Zhongyu Guo · Hongbin Sun

[ Exhibit Hall I ]

Abstract
The rise of embodied intelligence and multi-modal large language models has led to exciting advancements in the field of autonomous driving, establishing it as a prominent research focus in both academia and industry. However, when confronted with intricate and ambiguous traffic scenarios, the lack of logical reasoning and cognitive decision-making capabilities remains the primary challenge impeding the realization of embodied autonomous driving. Although Vision Language Models (VLMs) have enhanced the deep semantic understanding of autonomous driving systems, they exhibit notable limitations in decision explainability when handling rare and long-tail traffic scenarios. In this paper, we propose VLR-Driver, a novel multi-modal Vision-Language-Reasoning (VLR) framework based on Chain of Thought (CoT) for embodied autonomous driving. The framework employs a spatiotemporal CoT reasoning approach to recursively analyze potential safety risks and driving intentions of other agents, thereby delivering an efficient and transparent decision-making process. Furthermore, we construct a multi-modal reasoning-decision dataset to support the advancement of hierarchical reasoning of VLMs in autonomous driving. Closed-loop experiments conducted in CARLA demonstrate that the VLR-Driver significantly outperforms state-of-the-art end-to-end methods. Notably, key metrics such as driving score improved by 17.5\%, while the success rate improved by 22.2\%, offering a more transparent, reliable, and secure solution for …
Poster
Yuwen Du · Anning Hu · Zichen Chao · Yifan Lu · Junhao Ge · Genjia Liu · Wei-Tao Wu · Lanjun Wang · Siheng Chen

[ Exhibit Hall I ]

Abstract
Roadside Collaborative Perception refers to a system where multiple roadside units collaborate to pool their perceptual data, assisting vehicles in enhancing their environmental awareness. Existing roadside perception methods concentrate on model design but overlook data issues like calibration errors, sparse information, and multi-view consistency, leading to poor performance on recent published datasets. To significantly enhance roadside collaborative perception and address critical data issues, we present the first simulation framework RoCo-Sim for road-side collaborative perception. RoCo-Sim is capable of generating diverse, multi-view consistent simulated roadside data through dynamic foreground editing and full-scene style transfer of a single image. RoCo-Sim consists of four components: (1) Camera Extrinsic Optimization ensures accurate 3D to 2D projection for roadside cameras; (2) A novel Multi-View Occlusion-Aware Sampler (MOAS) determines the placement of diverse digital assets within 3D space; (3) DepthSAM innovatively models foreground-background relationships from single-frame fixed-view images, ensuring multi-view consistency of foreground; and (4) Scalable Post-Processing Toolkit generates more realistic and enriched scenes through style transfer and other enhancements. RoCo-Sim significantly improves roadside 3D object detection, outperforming SOTA methods by \textbf{83.74\%} on Rcooper-Intersection and \textbf{83.12\%} on TUMTraf-V2X for AP70. RoCo-Sim fills a critical gap in roadside perception simulation. Code and pre-trained models will be released …
Poster
Sacha Ichbiah · Anshuman Sinha · Fabrice Delbary · Hervé Turlier

[ Exhibit Hall I ]

Abstract
Traditional methods for biological shape inference, such as deep learning (DL) and active contour models, face limitations in 3D. DL requires large labeled datasets, which are difficult to obtain, while active contour models rely on fine-tuned hyperparameters for intensity attraction and regularization. We introduce deltaMic, a novel 3D differentiable renderer for fluorescence microscopy. By leveraging differentiable Fourier-space convolution, deltaMic accurately models the image formation process, integrating a parameterized microscope point spread function and a mesh-based object representation. Unlike DL-based segmentation, it directly optimizes shape and microscopy parameters to fit real microscopy data, removing the need for large datasets or heuristic priors. To enhance efficiency, we develop a GPU-accelerated Fourier transform for triangle meshes, significantly improving speed. We demonstrate deltaMic’s ability to reconstruct cellular shapes from synthetic and real microscopy images, providing a robust tool for 3D segmentation and biophysical modeling. This work bridges physics-based rendering with modern optimization techniques, offering a new paradigm for microscopy image analysis and inverse biophysical modeling.
Poster
WEIMING ZHANG · Dingwen Xiao · Lei Chen · Lin Wang

[ Exhibit Hall I ]

Abstract
Entity Segmentation (ES) aims at identifying and segmenting distinct entities within an image without the need for predefined class labels. This characteristic makes ES well-suited to open-world applications with adaptation to diverse and dynamically changing environments, where new and previously unseen entities may appear frequently. Existing ES methods either require large annotated datasets or high training costs, limiting their scalability and adaptability. Recently, the Segment Anything Model (SAM), especially in its Automatic Mask Generation (AMG) mode, has shown potential for holistic image segmentation. However, it struggles with over-segmentation and under-segmentation, making it less effective for ES. In this paper, we introduce E-SAM, a novel training-free framework that exhibits exceptional ES capability. Specifically, we first propose Multi-level Mask Generation (MMG) that hierarchically processes SAM's AMG outputs to generate reliable object-level masks while preserving fine details at other levels. Entity-level Mask Refinement (EMR) then refines these object-level masks into accurate entity-level masks. That is, it separates overlapping masks to address the redundancy issues inherent in SAM's outputs and merges similar masks by evaluating entity-level consistency. Lastly, Under-Segmentation Refinement (USR) addresses under-segmentation by generating additional high-confidence masks fused with EMR outputs to produce the final ES map. These three modules are seamlessly optimized …
Poster
Yiqing Shen · Bohan Liu · Chenjia Li · Lalithkumar Seenivasan · Mathias Unberath

[ Exhibit Hall I ]

Abstract
Reasoning segmentation (RS) aims to identify and segment objects of interest based on implicit text queries. As such, RS is a catalyst for embodied AI agents, enabling them to interpret high-level commands without requiring explicit step-by-step guidance. However, current RS approaches rely heavily on the visual perception capabilities of multimodal large language models (LLMs), leading to several major limitations. First, they struggle with queries that require multiple steps of reasoning or those that involve complex spatial/temporal relationships. Second, they necessitate LLM fine-tuning, which may require frequent updates to maintain compatibility with contemporary LLMs and may increase risks of catastrophic forgetting during fine-tuning. Finally, being primarily designed for static images or offline video processing, they scale poorly to online video data. To address these limitations, we propose an agent framework that disentangles perception and reasoning for online video RS without LLM fine-tuning. Our innovation is the introduction of a just-in-time digital twin concept, where -- given an implicit query -- an LLM plans the construction of a low-level scene representation from high-level video using specialist vision models. We refer to this approach to creating a digital twin as "just-in-time" because the LLM planner will anticipate the need for specific information and …
Poster
Enyu Liu · En Yu · Sijia Chen · Wenbing Tao

[ Exhibit Hall I ]

Abstract
3D Semantic Scene Completion (SSC) has gained increasing attention due to its pivotal role in 3D perception. Recent advancements have primarily focused on refining voxel-level features to construct 3D scenes. However, treating voxels as the basic interaction units inherently limits the utilization of class-level information, which is proven critical for enhancing the granularity of completion results. To address this, we propose Disentangling Instance and Scene Contexts (DISC), a novel dual-stream paradigm that enhances learning for both instance and scene categories through separated optimization. Specifically, we replace voxel queries with discriminative class queries, which incorporate class-specific geometric and semantic priors. Additionally, we exploit the intrinsic properties of classes to design specialized decoding modules, facilitating targeted interactions and efficient class-level information flow. Experimental results demonstrate that DISC achieves state-of-the-art (SOTA) performance on both SemanticKITTI and SSCBench-KITTI-360 benchmarks, with mIoU scores of 17.35 and 20.55, respectively. Remarkably, DISC even outperforms multi-frame SOTA methods using only single-frame input and significantly improves instance category performance, surpassing both single-frame and multi-frame SOTA instance mIoU by 17.9% and 11.9%, respectively, on the SemanticKITTI hidden test.
Poster
Loick Chambon · Eloi Zablocki · Alexandre Boulch · Mickael Chen · Matthieu Cord

[ Exhibit Hall I ]

Abstract
Understanding the 3D geometry and semantics of driving scenes is critical for developing safe autonomous vehicles. Recent advances in 3D occupancy prediction have improved scene representation but often suffer from spatial inconsistencies, leading to floating artifacts and poor surface localization. Existing voxel-wise losses (e.g., cross-entropy) fail to enforce geometric coherence. In this paper, we propose GaussRender, a module that improves 3D occupancy learning by enforcing projective consistency. Our key idea is to project both predicted and ground-truth 3D occupancy into 2D camera views, where we apply supervision. Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent and geometrically plausible 3D structure. To achieve this efficiently, we leverage differentiable rendering with Gaussian splatting. GaussRender seamlessly integrates with existing architectures while maintaining efficiency and requiring no inference-time modifications. Extensive evaluations on multiple benchmarks (SurroundOcc-nuScenes, Occ3D nuScenes, SSCBench-KITTI360) demonstrate that GaussRender significantly improves geometric fidelity across various 3D occupancy models (TPVFormer, SurroundOcc, Symphonies), achieving state-of-the-art results, particularly on surface-sensitive metrics. The code and models will be open-sourced.
Poster
Chen Chen · Zhirui Wang · Taowei Sheng · Yi Jiang · Yundu Li · Peirui Cheng · Luning Zhang · Kaiqiang Chen · Yanfeng Hu · Xue Yang · Xian Sun

[ Exhibit Hall I ]

Abstract
Existing vision-based 3D occupancy prediction methods are inherently limited in accuracy due to their exclusive reliance on street-view imagery, neglecting the potential benefits of incorporating satellite views. We propose **SA-Occ**, **the first** Satellite-Assisted 3D occupancy prediction model, which leverages GPS & IMU to integrate historical yet readily available satellite imagery into real-time applications, effectively mitigating limitations of ego-vehicle perceptions, involving occlusions and degraded performance in distant regions. To address the core challenges of cross-view perception, we propose: 1) **Dynamic-Decoupling Fusion**, which resolves inconsistencies in dynamic regions caused by the temporal asynchrony between satellite and street views; 2) **3D-Proj Guidance**, a module that enhances 3D feature extraction from inherently 2D satellite imagery; and 3) **Uniform Sampling Alignment**, which aligns the sampling density between street and satellite views. Evaluated on Occ3D-nuScenes, SA-Occ achieves state-of-the-art performance, especially among single-frame methods, with a 39.05% mIoU (a 6.97% improvement), while incurring only 6.93 ms of additional latency per frame. Our code and newly curated dataset will be publicly available.
Poster
Jin Cao · Hongrui Wu · Ziyong Feng · Hujun Bao · Xiaowei Zhou · Sida Peng

[ Exhibit Hall I ]

Abstract
This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. The code will be released for the reproducibility.
Poster
Minsu Kim · Subin Jeon · In Cho · Mijin Yoo · Seon Joo Kim

[ Exhibit Hall I ]

Abstract
Recent advances in novel view synthesis (NVS) have enabled real-time rendering with 3D Gaussian Splatting (3DGS). However, existing methods struggle with artifacts and missing regions when rendering unseen viewpoints, limiting seamless scene exploration. To address this, we propose a 3DGS-based pipeline that generates additional training views to enhance reconstruction. We introduce an information-gain-driven virtual camera placement strategy to maximize scene coverage, followed by video diffusion priors to refine rendered results. Fine-tuning 3D Gaussians with these enhanced views significantly improves reconstruction quality. To evaluate our method, we present Wild-Explore, a benchmark designed for challenging scene exploration. Experiments demonstrate that our approach outperforms existing 3DGS-based methods, enabling high-quality, artifact-free rendering from arbitrary viewpoints.
Poster
Zijie Wang · Weiming Zhang · Wei Zhang · Xiao Tan · hongxing liu · Yaowei Wang · Guanbin Li

[ Exhibit Hall I ]

Abstract
Centerline graphs, crucial for path planning in autonomous driving, are traditionally learned using deterministic methods. However, these methods often lack spatial reasoning and struggle with occluded or invisible centerlines. Generative approaches, despite their potential, remain underexplored in this domain. We introduce LaneDiffusion, a novel generative paradigm for centerline graph learning. LaneDiffusion innovatively employs diffusion models to generate lane centerline priors at the Bird's Eye View (BEV) feature level, instead of directly predicting vectorized centerlines. Our method integrates a Lane Prior Injection Module (LPIM) and a Lane Prior Diffusion Module (LPDM) to effectively construct diffusion targets and manage the diffusion process. Furthermore, vectorized centerlines and topologies are then decoded from these prior-injected BEV features. Extensive evaluations on the nuScenes and Argoverse2 datasets demonstrate that LaneDiffusion significantly outperforms existing methods, achieving improvements of 4.2%, 4.6%, 4.7%, 6.4% and 1.8% on fine-grained point-level metrics (GEO F1, TOPO F1, JTOPO F1, APLS and SDA) and 2.3%, 6.4%, 6.8% and 2.1% on segment-level metrics (IoU, mAP_{cf}, DET_{l} and TOP_{ll}). These results establish state-of-the-art performance in centerline graph learning, offering new insights into generative models for this task.
Poster
Bozhong Zheng · Jinye Gan · Xiaohao Xu · Xintao Chen · Wenqiao Li · Xiaonan Huang · Na Ni · Yingna Wu

[ Exhibit Hall I ]

Abstract
3D point cloud anomaly detection is essential for robust vision systems but is challenged by pose variations and complex geometric anomalies. Existing patch-based methods often suffer from geometric fidelity issues due to discrete voxelization or projection-based representations, limiting fine-grained anomaly localization.We introduce Pose-Aware Signed Distance Field (PASDF), a novel framework that integrates 3D anomaly detection and repair by learning a continuous, pose-invariant shape representation. PASDF leverages a Pose Alignment Module for canonicalization and a SDF Network to dynamically incorporate pose, enabling implicit learning of high-fidelity anomaly repair templates from the continuous SDF. This facilitates precise pixel-level anomaly localization through an Anomaly-Aware Scoring Module.Crucially, the continuous 3D representation in PASDF extends beyond detection, facilitating in-situ anomaly repair. Experiments on Real3D-AD and Anomaly-ShapeNet demonstrate state-of-the-art performance, achieving high object-level AUROC scores of 80.2% and 90.0%, respectively. These results highlight the effectiveness of continuous geometric representations in advancing 3D anomaly detection and facilitating practical anomaly region repair. Our code will be released to drive further research.
Poster
Hao Ju · Shaofei Huang · Si Liu · Zhedong Zheng

[ Exhibit Hall I ]

Abstract
Existing approaches to drone visual geo-localization predominantly adopt the image-based setting, where a single drone-view snapshot is matched with images from other platforms. Such task formulation, however, underutilizes the inherent video output of the drone and is sensitive to occlusions and viewpoint disparity. To address these limitations, we formulate a new video-based drone geo-localization task and propose the Video2BEV paradigm. This paradigm transforms the video into a Bird's Eye View (BEV), simplifying the subsequent inter-platform matching process. In particular, we employ Gaussian Splatting to reconstruct a 3D scene and obtain the BEV projection. Different from the existing transform methods, e.g., polar transform, our BEVs preserve more fine-grained details without significant distortion.To facilitate the discriminative intra-platform representation learning, our Video2BEV paradigm also incorporates a diffusion-based module for generating hard negative samples. To validate our approach, we introduce UniV, a new video-based geo-localization dataset that extends the image-based University-1652 dataset. UniV features flight paths at $30^\circ$ and $45^\circ$ elevation angles with increased frame rates of up to 10 frames per second (FPS). Extensive experiments on the UniV dataset show that our Video2BEV paradigm achieves competitive recall rates and outperforms conventional video-based methods. Compared to other competitive methods, our proposed approach exhibits robustness …
Poster
Caner Korkmaz · Brighton Nuwagira · Baris Coskunuzer · Tolga Birdal

[ Exhibit Hall I ]

Abstract
We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the *vectorization* of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP into a combination of individual, learnable single-parameter persistence, where the bifiltration functions are jointly learned. Thanks to the differentiability, its robust topological feature vectors can be seamlessly used within state-of-the-art architectures such as Swin Transformers. We establish theoretical guarantees for the stability of our vectorization under generalized Wasserstein metrics. Our experiments on benchmark medical imaging datasets show the benefit CuMPerLay on classification performance, particularly in limited-data scenarios. Overall, CuMPerLay offers a promising direction for integrating global structural information into deep networks for structured image analysis.
Poster
Xiangzeng Liu · CHI WANG · Guanglu Shi · Xiaodong Zhang · Qiguang Miao · Miao Fan

[ Exhibit Hall I ]

Abstract
Local feature matching remains a fundamental challenge in computer vision. Recent Area to Point Matching (A2PM) methods have improved matching accuracy. However, existing research based on this framework relies on inefficient pixel-level comparisons and complex graph matching that limit scalability. In this work, we introduce the Semantic and Geometric-aware Descriptor Network (SGAD), which fundamentally rethinks area-based matching by generating highly discriminative area descriptors that enable direct matching without complex graph optimization. This approach significantly improves both accuracy and efficiency of area matching. We further improve the performance of area matching through a novel supervision strategy that decomposes the area matching task into classification and ranking subtasks. Finally, we introduce the Hierarchical Containment Redundancy Filter (HCRF) to eliminate overlapping areas by analyzing containment graphs. SGAD demonstrates remarkable performance gains, reducing runtime by 60$\times$ (0.82s vs. 60.23s) compared to MESA. Extensive evaluations show consistent improvements across multiple point matchers: SGAD+LoFTR reduces runtime compared to DKM, while achieving higher accuracy (0.82s vs. 1.51s, 65.98 vs. 61.11) in outdoor pose estimation, and SGAD+ROMA delivers +7.39\% AUC@5$^\circ$ in indoor pose estimation, establishing a new state-of-the-art.
Poster
Mohamed El Amine Boudjoghra · Ivan Laptev · Angela Dai

[ Exhibit Hall I ]

Abstract
With the growing ease of capture of real-world 3D scenes, effective editing becomes essential for the use of captured 3D scan data in various graphics applications.We present ScanEdit, which enables functional editing of complex, real-world 3D scans from natural language text prompts.By leveraging the high-level reasoning capabilities of large language models (LLMs), we construct a hierarchical scene graph representation for an input 3D scan given its instance decomposition. We develop a hierarchically-guided, multi-stage prompting approach using LLMs to decompose general language instructions (that can be vague, without referencing specific objects) into specific, actionable constraints that are applied to our scene graph. Our scene optimization integrates LLM-guided constraints along with 3D-based physical plausibility objectives, enabling the generation of edited scenes that align with a variety of input prompts, from abstract, functional-based goals to more detailed, specific instructions. This establishes a foundation for intuitive, text-driven 3D scene editing in real-world scenes.
Poster
Yuval Grader · Hadar Averbuch-Elor

[ Exhibit Hall I ]

Abstract
Floorplans provide a compact representation of the building’s structure, revealing not only layout information but also detailed semantics such as the locations of windows and doors. However, contemporary floorplan localization techniques mostly focus on matching depth-based structural cues, ignoring the rich semantics communicated within floorplans. In this work, we introduce a semantic-aware localization framework that jointly estimates depth and semantic rays, consolidating over both for predicting a structural-semantic probability volume. Our probability volume is constructed in a coarse-to-fine manner: We first sample a small set of rays to obtain an initial low-resolution probability volume. We then refine these probabilities by performing a denser sampling only in high-probability regions and process the refined values for predicting a 2D location and orientation angle. We conduct an evaluation on two standard floorplan localization benchmarks. Our experiments demonstrate that our approach substantially outperforms state-of-the-art methods, achieving significant improvements in recall metrics compared to prior works. Moreover, we demonstrate that our framework can easily incorporate additional metadata such as room labels, enabling additional gains in both accuracy and efficiency. We will release our code and trained models.
Poster
Chuanyu Fu · Yuqi Zhang · Kunbin Yao · Guanying Chen · Yuan Xiong · Chuan Huang · Shuguang Cui · Xiaochun Cao

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling. However, existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images. We identify that the Gaussian densification process, while enhancing scene detail capture, unintentionally contributes to these artifacts by growing additional Gaussians that model transient disturbances. To address this, we propose RobustSplat, a robust solution based on two critical designs. First, we introduce a delayed Gaussian growth strategy that prioritizes optimizing static scene structure before allowing Gaussian splitting/cloning, mitigating overfitting to transient objects in early optimization. Second, we design a scale-cascaded mask bootstrapping approach that first leverages lower-resolution feature similarity supervision for reliable initial transient mask estimation, taking advantage of its stronger semantic consistency and robustness to noise, and then progresses to high-resolution supervision to achieve more precise mask prediction. Extensive experiments on multiple challenging datasets show that our method outperforms existing methods, clearly demonstrating the robustness and effectiveness of our method. Our code will be made publicly available.
Poster
Yingyan Li · Yuqi Wang · Yang Liu · Jiawei He · Lue Fan · Zhaoxiang Zhang

[ Exhibit Hall I ]

Abstract
End-to-end autonomous driving has achieved remarkable progress by integrating perception, prediction, and planning into a fully differentiable framework. Yet, to fully realize its potential, an effective online trajectory evaluation is indispensable to ensure safety. By forecasting the future outcomes of a given trajectory, trajectory evaluation becomes much more effective. This goal can be achieved by employing a world model to capture environmental dynamics and predict future states. Therefore, we propose an end-to-end driving framework **WoTE**, which leverages a BEV **Wo**rld model to predict future BEV states for **T**rajectory **E**valuation. The proposed BEV world model is latency-efficient compared to image-level world models and can be seamlessly supervised using off-the-shelf BEV-space traffic simulators. We validate our framework on both the NAVSIM benchmark and the closed-loop Bench2Drive benchmark based on the CARLA simulator, achieving state-of-the-art performance. The code will be released.
Poster
Yuval Nissan · Marc Pollefeys · Daniel Barath

[ Exhibit Hall I ]

Abstract
We propose a method for affine rectification of an image plane by leveraging changes in local scales and orientations under projective distortion. Specifically, we derive a novel linear constraint that directly relates pairs of points with orientations to the parameters of a projective transformation. This constraint is combined with an existing linear constraint on local scales, leading to highly robust rectification. The method reduces to solving a system of linear equations, enabling an efficient algebraic least-squares solution. It requires only two local scales and two local orientations, which can be extracted from, e.g., SIFT features. Unlike prior approaches, our method does not impose restrictions on individual features, does not require class segmentation, and makes no assumptions about feature interrelations. It is compatible with any feature detector that provides local scale or orientation. Furthermore, combining scaled and oriented points with line segments yields a highly robust algorithm that outperforms baselines. Extensive experiments show the effectiveness of our approach on real-world images, including repetitive patterns, building facades, and text-based content.
Poster
Guangzhao He · Yuxi Xiao · Zhen Xu · Xiaowei Zhou · Sida Peng

[ Exhibit Hall I ]

Abstract
Registering an object shape to a sequence of point clouds undergoing non-rigid deformation is a long-standing challenge. The key difficulties stem from two factors: (i) the presence of local minima due to the non-convexity of registration objectives, especially under noisy or partial inputs, which hinders accurate and robust deformation estimation, and (ii) error accumulation over long sequences, leading to tracking failures. To address these challenges, we introduce to adopt a scalable data-driven approach and propose \methodname, an efficient feed-forward model trained on large deformation datasets.It is designed to handle noisy and partial inputs while effectively leveraging temporal information for accurate and consistent sequential registration. The key to our design is predicting a sequence of deformation graphs through a two-stage pipeline, which first estimates frame-wise coarse graph nodes for robust initialization, before refining their trajectories over time in a sliding-window fashion. Extensive experiments show that our proposed approach (i) outperforms previous state of the art on both the DeformingThings4D and D-FAUST datasets, and (ii) achieves more than 4x speedup compared to the previous best, offering significant efficiency improvement.
Poster
Mengwei Xie · Shuang Zeng · Xinyuan Chang · Xinran Liu · Zheng Pan · Mu Xu · Xing Wei

[ Exhibit Hall I ]

Abstract
Accurate lane topology is essential for autonomous driving, yet traditional methods struggle to model the complex, non-linear structures—such as loops and bidirectional lanes—prevalent in real-world road structure. We present SeqGrowGraph, a novel framework that learns lane topology as a chain of graph expansions, inspired by human map-drawing processes. Representing the lane graph as a directed graph $G=(V,E)$, with intersections ($V$) and centerlines ($E$), SeqGrowGraph incrementally constructs this graph by introducing one node at a time. At each step, an adjacency matrix ($A$) expands from $n \times n$ to $(n+1) \times (n+1)$ to encode connectivity, while a geometric matrix ($M$) captures centerline shapes as quadratic Bézier curves. The graph is serialized into sequences, enabling a transformer model to autoregressively predict the chain of expansions, guided by a depth-first search ordering. Evaluated on nuScenes and Argoverse 2 datasets, SeqGrowGraph achieves state-of-the-art performance.
Poster
Xiaoxue Chen · Bhargav Chandaka · Chih-Hao Lin · Ya-Qin Zhang · David Forsyth · Hao Zhao · Shenlong Wang

[ Exhibit Hall I ]

Abstract
We present InvRGB+L, a novel inverse rendering model that reconstructs large, relightable, and dynamic scenes from a single RGB+LiDAR sequence. Conventional inverse graphics methods rely primarily on RGB observations and use LiDAR mainly for geometric information, often resulting in suboptimal material estimates due to visible light interference. We find that LiDAR’s intensity values—captured with active illumination in a different spectral range—offer complementary cues for robust material estimation under variable lighting. Inspired by this, InvRGB+L leverages LiDAR intensity cues to overcome challenges inherent in RGB-centric inverse graphics through two key innovations: (1) a novel physics-based LiDAR shading model and (2) RGB–LiDAR material consistency losses. The model produces novel-view RGB and LiDAR renderings of urban and indoor scenes and supports relighting, night simulations, and dynamic object insertions—achieving results that surpass current state-of-the-art methods in both scene-level urban inverse rendering and LiDAR simulation.
Poster
Yuanyuan Gao · Hao Li · Jiaqi Chen · Zhihang Zhong · Zhengyu Zou · Dingwen Zhang · Xiao Sun · Junwei Han

[ Exhibit Hall I ]

Abstract
Despite its significant achievements in large-scale scene reconstruction, 3D Gaussian Splatting still faces substantial challenges, including slow processing, high computational costs, and limited geometric accuracy.These core issues arise from its inherently unstructured design and the absence of efficient parallelization.To overcome these challenges simultaneously, we introduce \textbf{CityGS-\(\mathcal{X}\)}, a scalable architecture built on a novel parallelized hybrid hierarchical 3D representation (PH$^2$-3D).As an early attempt, CityGS-\(\mathcal{X}\) abandons the cumbersome merge-and-partition process and instead adopts a newly-designed batch-level multi-task rendering process. This architecture enables efficient multi-GPU rendering through dynamic Level-of-Detail voxel allocations, significantly improving scalability and performance.%To further enhance both overall quality and geometric accuracy, CityGS-\(\mathcal{X}\) presents a progressive RGB-Depth-Normal training strategy.This approach enhances 3D consistency by jointly optimizing appearance and geometry representation through multi-view constraints and off-the-shelf depth priors within batch-level training.%Through extensive experiments, CityGS-\(\mathcal{X}\) consistently outperforms existing methods in terms of faster training times, larger rendering capacities, and more accurate geometric details in large-scale scenes. %Notably, CityGS-\(\mathcal{X}\) can train and render a 5,000+ image scene with only 4\(\times\)4090 GPUs in just 5 hours, while many other methods even encounter Out-Of-Memory (OOM) issues during rendering, making CityGS-\(\mathcal{X}\) a more accessible and scalable solution for this task.Notably, CityGS-\(\mathcal{X}\) can train and render a scene …
Poster
Yujeong Chae · Heejun Park · Hyeonseong Kim · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
Robust 3D object detection across diverse weather conditions is crucial for safe autonomous driving, and RADAR is increasingly leveraged for its resilience in adverse weather. Recent advancements have explored 4D RADAR and LiDAR-RADAR fusion to enhance 3D perception capabilities, specifically targeting weather robustness. However, existing methods often handle Doppler in ways that are not well-suited for multi-modal settings or lack tailored encoding strategies, hindering effective feature fusion and performance. To address these shortcomings, we propose a novel Doppler-aware LiDAR-4D RADAR fusion (DLRFusion) framework for robust 3D object detection. We introduce a multi-path iterative interaction module that integrates LiDAR, RADAR power, and Doppler, enabling a structured feature fusion process. Doppler highlights dynamic regions, refining RADAR power and enhancing LiDAR features across multiple stages, improving detection confidence. Extensive experiments on the K-RADAR dataset demonstrate that our approach effectively exploits Doppler information, achieving state-of-the-art performance in both normal and adverse weather conditions. The code will be made publicly available.
Poster
Mingfang Zhang · Ryo Yonetani · Yifei Huang · Liangyang Ouyang · Ruicong Liu · Yoichi Sato

[ Exhibit Hall I ]

Abstract
This paper presents a novel inertial localization framework named Egocentric Action-aware Inertial Localization (EAIL), which leverages egocentric action cues from head-mounted IMU signals to localize the target individual within a 3D point cloud. Human inertial localization is challenging due to IMU sensor noise that causes trajectory drift over time. The diversity of human actions further complicates IMU signal processing by introducing various motion patterns. Nevertheless, we observe that some actions observed through the head-mounted IMU correlate with spatial environmental structures (e.g., bending down to look inside an oven, washing dishes next to a sink), thereby serving as spatial anchors to compensate for the localization drift. The proposed EAIL framework learns such correlations via hierarchical multi-modal alignment. By assuming that the 3D point cloud of the environment is available, it contrastively learns modality encoders that align short-term egocentric action cues in IMU signals with local environmental features in the point cloud. These encoders are then used in reasoning the IMU data and the point cloud over time and space to perform inertial localization. Interestingly, these encoders can further be utilized to recognize the corresponding sequence of actions as a by-product. Extensive experiments demonstrate the effectiveness of the proposed framework over state-of-the-art …
Poster
Kaiwen Zhang · Zhenyu Tang · Xiaotao Hu · Xingang Pan · Xiaoyang Guo · Yuan Liu · Jingwei Huang · Li Yuan · Qian Zhang · XIAOXIAO LONG · Xun Cao · Wei Yin

[ Exhibit Hall I ]

Abstract
Diffusion models have demonstrated exceptional visual quality in video generation, making them promising for autonomous driving world modeling. However, existing video diffusion-based world models struggle with flexible-length, long-horizon predictions and integrating trajectory planning. This is because conventional video diffusion models rely on global joint distribution modeling of fixed-length frame sequences rather than sequentially constructing localized distributions at each timestep. In this work, we propose Epona, an autoregressive diffusion world model that enables localized spatiotemporal distribution modeling through two key innovations: 1) Decoupled spatiotemporal factorization that separates temporal dynamics modeling from fine-grained future world generation, and 2) Modular trajectory and video prediction that seamlessly integrate motion planning with visual modeling in an end-to-end framework. Our architecture enables high-resolution, long-duration generation while introducing a novel chain-of-forward training strategy to address error accumulation in autoregressive loops. Experimental results demonstrate state-of-the-art performance with 7.4\% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks.
Poster
Jiazhe Guo · Yikang Ding · Xiwu Chen · Shuo Chen · Bohan Li · Yingshuang Zou · Xiaoyang Lyu · Feiyang Tan · Xiaojuan Qi · Zhiheng Li · Hao Zhao

[ Exhibit Hall I ]

Abstract
Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. A key challenge lies in finding an efficient and generalizable geometric representation that seamlessly connects temporal and spatial synthesis. To address this, we propose DiST-4D, the first disentangled spatiotemporal diffusion framework for 4D driving scene generation, which leverages metric depth as the core geometric representation. DiST-4D decomposes the problem into two diffusion processes: DiST-T, which predicts future metric depth and multi-view RGB sequences directly from past observations, and DiST-S, which enables spatial NVS by training only on existing viewpoints while enforcing cycle consistency. This cycle consistency mechanism introduces a forward-backward rendering constraint, reducing the generalization gap between observed and unseen viewpoints. Metric depth is essential for both accurate reliable forecasting and accurate spatial NVS, as it provides a view-consistent geometric representation that generalizes well to unseen perspectives. Experiments demonstrate that DiST-4D achieves state-of-the-art performance in both temporal prediction and NVS tasks, while also delivering competitive performance in planning-related evaluations. The code is available in the supplementary.
Poster
Zefu Lin · Wenbo Chen · Xiaojuan Jin · Yuran Yang · Lue Fan · YIXIN ZHANG · Yufeng Zhang · Zhaoxiang Zhang

[ Exhibit Hall I ]

Abstract
Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird's Eye View (BEV)-based approaches exhibit two main limitations: bounding-box representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects.To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead.Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.
Poster
Lilika Makabe · Hiroaki Santo · Fumio Okura · Michael Brown · Yasuyuki Matsushita

[ Exhibit Hall I ]

Abstract
This paper introduces a practical and accurate calibration method for camera spectral sensitivity using a diffraction grating. Accurate calibration of camera spectral sensitivity is crucial for various computer vision tasks, including color correction, illumination estimation, and material analysis. Unlike existing approaches that require specialized narrow-band filters or reference targets with known spectral reflectances, our method only requires an uncalibrated diffraction grating sheet, readily available off-the-shelf. By capturing images of the direct illumination and its diffracted pattern through the grating sheet, our method estimates both the camera's spectral sensitivity and the diffraction grating parameters in a closed-form manner. Experiments on synthetic and real-world data demonstrate that our approach outperforms reference target-based methods, underscoring its effectiveness and practicality.
Poster
Tianli Liao · Chenyang Zhao · Lei Li · Heling Cao

[ Exhibit Hall I ]

Abstract
Seam cutting has shown significant effectiveness in the composition phase of image stitching, particularly for scenarios involving parallax. However, conventional implementations typically position seam-cutting as a downstream process contingent upon successful image alignment. This approach inherently assumes the existence of locally aligned regions where visually plausible seams can be established. Current alignment methods frequently fail to satisfy this prerequisite in large parallax scenarios despite considerable research efforts dedicated to improving alignment accuracy. In this paper, we propose an alignment-compensation paradigm that dissociates seam quality from initial alignment accuracy by integrating a Local Patch Alignment Module (LPAM) into the seam-cutting pipeline. Concretely, given the aligned images with an estimated initial seam, our method first identifies low-quality pixels along the seam through a seam quality assessment, then performs localized SIFT-flow alignment on the critical patches enclosing these pixels. Finally, we recomposite the aligned patches using adaptive seam-cutting and merge them into the original aligned images to generate the final mosaic. Comprehensive experiments on large parallax stitching datasets demonstrate that LPAM significantly enhances stitching quality while maintaining computational efficiency.
Poster
Yifan Lu · Xuanchi Ren · Jiawei Yang · Tianchang Shen · Jay Zhangjie Wu · Jun Gao · Yue Wang · Siheng Chen · Mike Chen · Sanja Fidler · Jiahui Huang

[ Exhibit Hall I ]

Abstract
We present InfiniCube, a scalable and controllable method to generate unbounded and dynamic 3D driving scenes with high fidelity.Previous methods for scene generation are constrained either by their applicability to indoor scenes or by their lack of controllability.In contrast, we take advantage of recent advances in 3D and video generative models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions.First, we construct a map-conditioned 3D voxel generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of pixel-aligned guidance buffers, synthesizing a consistent appearance on long-video generation for large-scale scenes.Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift videos to dynamic 3D Gaussians with controllable objects.Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness of our model design. Code will be released upon acceptance.
Poster
Kangan Qian · Jinyu Miao · Xinyu Jiao · Ziang Luo · Zheng Fu · Yining Shi · Yunlong Wang · Kun Jiang · Diange Yang

[ Exhibit Hall I ]

Abstract
Reliable spatial and motion perception is essential for safe autonomous navigation. Recently, class-agnostic motion prediction on bird's-eye view (BEV) cell grids derived from LiDAR point clouds has gained significant attention. However, existing frameworks typically perform cell classification and motion prediction on a per-pixel basis, neglecting important motion field priors such as rigidity constraints, temporal consistency, and future interactions between agents. These limitations lead to degraded performance, particularly in sparse and distant regions.To address these challenges, we introduce $\textbf{PriorMotion}$, an innovative generative framework designed for class-agnostic motion prediction that integrates essential motion priors by modeling them as distributions within a structured latent space. Specifically, our method captures structured motion priors using raster-vector representations and employs a variational autoencoder with distinct dynamic and static components to learn future motion distributions in the latent space. Experiments on the nuScenes dataset demonstrate that $\textbf{PriorMotion}$ outperforms state-of-the-art methods across both traditional metrics and our newly proposed evaluation criteria. Notably, we achieve improvements of approximately 15.24\% in accuracy for fast-moving objects, an 3.59\% increase in generalization, a reduction of 0.0163 in motion stability, and a 31.52\% reduction in prediction errors in distant regions. Further validation on FMCW LiDAR sensors confirms the robustness of our approach.
Poster
Qingyuan Zhou · Yuehu Gong · Weidong Yang · Jiaze Li · Yeqi Luo · Baixin Xu · Shuhao Li · Ben Fei · Ying He

[ Exhibit Hall I ]

Abstract
Novel view synthesis (NVS) and surface reconstruction (SR) are essential tasks in 3D Gaussian Splatting (3D-GS). Despite recent progress, these tasks are often addressed independently, with GS-based rendering methods struggling under diverse light conditions and failing to produce accurate surfaces, while GS-based reconstruction methods frequently compromise rendering quality. This raises a central question: must rendering and reconstruction always involve a trade-off? To address this, we propose $MGSR$, a 2D/3D $M$utual-boosted $G$aussian Splatting for $S$urface $R$econstruction that enhances both rendering quality and 3D reconstruction accuracy. MGSR introduces two branches—one based on 2D-GS and the other on 3D-GS. The 2D-GS branch excels in surface reconstruction, providing precise geometry information to the 3D-GS branch. Leveraging this geometry, the 3D-GS branch employs a geometry-guided illumination decomposition module that captures reflected and transmitted components, enabling realistic rendering under varied light conditions. Using the transmitted component as supervision, the 2D-GS branch also achieves high-fidelity surface reconstruction. Throughout the optimization process, the 2D-GS and 3D-GS branches undergo alternating optimization, providing mutual supervision. Prior to this, each branch completes an independent warm-up phase, with an early stopping strategy implemented to reduce computational costs. We evaluate MGSR on a diverse set of synthetic and real-world datasets, at both object …
Poster
Zhenxin Li · Shihao Wang · Shiyi Lan · Zhiding Yu · Zuxuan Wu · Jose M. Alvarez

[ Exhibit Hall I ]

Abstract
End-to-end autonomous driving research currently faces a critical challenge in bridging the gap between open-loop training and closed-loop deployment. Current approaches are trained to predict trajectories in an open-loop environment, which struggle with quick reactions to other agents in closed-loop environments and risk generating kinematically infeasible plans due to the gap between open-loop training and closed-loop driving. In this paper, we introduce Hydra-NeXt, a novel multi-branch planning framework that unifies trajectory prediction, control prediction, and a trajectory refinement network in one model. Unlike current open-loop trajectory prediction models that only handle general-case planning, Hydra-NeXt further utilizes a control decoder to focus on short-term actions, which enables faster responses to dynamic situations and reactive agents. Moreover, we propose the Trajectory Refinement module to augment and refine the planning decisions by effectively adhering to kinematic constraints in closed-loop environments. This unified approach bridges the gap between open-loop training and closed-loop driving, demonstrating superior performance of 65.89 Driving Score (DS) and 48.20\% Success Rate (SR) on the Bench2Drive dataset without relying on external experts for data collection. Hydra-NeXt surpasses the previous state-of-the-art by 22.98 DS and 17.49 SR, marking a significant advancement in autonomous driving.
Poster
Jiayuan Lu · Rengan Xie · Zixuan Xie · Zhizhen Wu · Dianbing Xi · Qi Ye · Rui Wang · Hujun Bao · Yuchi Huo

[ Exhibit Hall I ]

Abstract
Realistic images are usually produced by simulating light transportation results of 3D scenes using rendering engines. This framework can precisely control the output but is usually weak at producing photo-like images. Alternatively, diffusion models have seen great success in photorealistic image generation by leveraging priors from large datasets of real-world images but lack affordance controls. Promisingly, the recent ControlNet enables flexible control of the diffusion model without degrading its generation quality. In this work, we introduce IntrinsicControlNet, an intrinsically controllable image generation framework that enables easily generating photorealistic images from precise and explicit control, similar to a rendering engine, by using intrinsic images such as material properties, geometric details, and lighting as network inputs. Beyond this, we notice that there is a domain gap between the synthetic and real-world datasets, and therefore, naively blending these datasets yields domain confusion. To address this problem, we present a cross-domain control architecture that extracts control information from synthetic datasets, and control and content information from real-world datasets. This bridges the domain gap between real-world and synthetic datasets, enabling the blending or editing of 3D assets and real-world photos to support various interesting applications. Experiments and user studies demonstrate that our method can generate …
Poster
Byeongjun Park · Hyojun Go · Hyelin Nam · Byung-Hoon Kim · Hyungjin Chung · Changick Kim

[ Exhibit Hall I ]

Abstract
Recent progress in 3D/4D scene generation emphasizes the importance of physical alignment throughout video generation and scene reconstruction. However, existing methods improve the alignment separately at each stage, making it difficult to manage subtle misalignments arising from another stage. Here, we present SteerX, a zero-shot inference-time steering method that unifies scene reconstruction into the generation process, tilting data distributions toward better geometric alignment. To this end, we introduce two geometric reward functions for 3D/4D scene generation by using pose-free feed-forward scene reconstruction models. Through extensive experiments, we demonstrate the effectiveness of SteerX in improving 3D/4D scene generation.
Poster
Wenhao Zhang · Hao Zhu · Delong Wu · Di Kang · Linchao Bao · Xun Cao · Zhan Ma

[ Exhibit Hall I ]

Abstract
Pursuing a continuous visual representation that offers flexible frequency modulation and fast rendering speed has recently garnered increasing attention in the fields of 3D vision and graphics. However, existing representations often rely on frequency guidance or complex neural network decoding, leading to spectrum loss or slow rendering. To address these limitations, we propose **WIPES**, a universal **W**avelet-based v**I**sual **P**rimitiv**ES** for representing multi-dimensional visual signals. Building on the spatial-frequency localization advantages of wavelets, WIPES effectively captures both the low-frequency "forest" and the high-frequency "trees." Additionally, we develop a wavelet-based differentiable rasterizer to achieve fast visual rendering. Experimental results on various visual tasks, including 2D image representation, 5D static and 6D dynamic novel view synthesis, demonstrate that WIPES, as a visual primitive, offers higher rendering quality and faster inference than INR-based methods, and outperforms Gaussian-based representations in rendering quality.
Poster
tianyu zhang · Haobo Jiang · jian Yang · Jin Xie

[ Exhibit Hall I ]

Abstract
Point cloud interpolation aims to recover intermediate frames for temporally smoothing a point cloud sequence. However, real-world challenges, such as uneven or large scene motions, cause existing methods to struggle with limited interpolation precision. To address this, we introduce DiffPCI, a novel diffusion interpolation model that formulates the frame interpolation task as a progressive denoising diffusion process. Training DiffPCI involves two key stages: a forward interpolation diffusion process and a reverse interpolation denoising process. In the forward process, the clean intermediate frame is progressively transformed into a noisy one through continuous Gaussian noise injection. The reverse process then focuses on training a denoiser to gradually refine this noisy frame back to the ground-truth frame. In particular, we derive a point cloud interpolation-specific variational lower bound as our optimization objective for denoiser training. Furthermore, to alleviate the interpolation error especially in highly dynamic scenes, we develop a novel full-scale, dual-branch denoiser that enables more comprehensive front-back frame information fusion for robust bi-directional interpolation. Extensive experiments demonstrate that DiffPCI significantly outperforms current state-of-the-art frame interpolation methods (e.g. 27\% and 860\% reduction in Chamfer Distance and Earth Mover’s Distance in Nuscenes).
Poster
Zixiang Ai · Zhenyu Cui · Yuxin Peng · Jiahuan Zhou

[ Exhibit Hall I ]

Abstract
Pre-trained point cloud analysis models have shown promising advancements in various downstream tasks, yet their effectiveness is typically suffering from low-quality point cloud (i.e., noise and incompleteness), which is a common issue in real-world data due to casual object occlusions and unsatisfactory data collected by 3D sensors. To this end, existing methods focus on enhancing point cloud quality by developing dedicated denoising and completion models. However, due to the isolation between the point cloud enhancement tasks and downstream tasks, these methods fail to work in various real-world domains. In addition, the conflicting objectives between point cloud denoising and completing tasks further limit the ensemble paradigm to preserve critical geometric features in real scenarios. To tackle the above challenges, we propose a unified point-level prompting method that reformulates point cloud denoising and completion as a prompting mechanism, enabling robust analysis in a parameter-efficient manner. We start by introducing a Rectification Prompter to adapt to noisy points through the predicted rectification vector prompts, effectively filtering noise while preserving intricate geometric features essential for accurate analysis. Sequentially, we further incorporate a Completion Prompter to generate auxiliary point prompts based on the rectified point clouds, facilitating their robustness and adaptability. Finally, a Shape-Aware Unit …
Poster
Yuxin Deng · Kaining Zhang · Linfeng Tang · Jiaqi Yang · Jiayi Ma

[ Exhibit Hall I ]

Abstract
Establishing dense correspondences is crucial yet computationally expensive in many multi-view tasks. Although the state-of-the-art dense matchers typically adopt a coarse-to-fine scheme to mitigate the computational cost, their efficiency is often compromised by the use of heavy models with redundant feature representations, which are essential for desirable results. In this work, we introduce adaptive refinement gathering that significantly alleviates the demand on such computational burdens without sacrificing too much accuracy. The pipeline consists of (i) context-aware offset estimator: exploiting content information for rough features to enhance the offset decoding accuracy. (ii) Locally consistent match rectifier: correcting erroneous initial matches with local consistency. (iii) Locally consistent upsampler: mitigating over-smoothing at depth-discontinuous edges. Additionally, we propose an adaptive gating strategy, combined with the nature of local consistency, to dynamically modulate the contribution of different components and pixels, enabling adaptive gradient backpropagation and fully unleashing the network's capacity. Compared to the state-of-the-art, our lightweight network, termed ArgMatch, achieves competitive performance on MegaDepth, while using 90% fewer parameters, 73% less computation time, and 84% lower memory cost.
Poster
Baihui Xiao · Chengjian Feng · Zhijian Huang · Feng yan · Yujie Zhong · Lin Ma

[ Exhibit Hall I ]

Abstract
Collecting real-world data for rare high-risk scenarios, long-tailed driving events, and complex interactions remains challenging, leading to poor performance of existing autonomous driving systems in these critical situations. In this paper, we propose SimBoost that improves real-world driving in critical situations by utilizing simulated hard cases. First, we develop a simulated dataset called Hard-case Augmented Synthetic Scenarios (HASS), which covers 13 high-risk edge-case categories, as well as balanced environmental conditions such as day/night and sunny/rainy. Secondly, we introduce Scenario-aware Prompt Engineering (SPE) and an Image-to-Ego Encoder (I2E Encoder) to enable multimodal large language models to effectively learn real-world challenging driving skills from HASS, via adapting to environmental deviations and hardware differences between real-world and simulated scenarios. Extensive experiments are conducted on nuScenes, where SimBoost improves driving performance in challenging scenarios by about 50%, achieving state-of-the-art results in real-world open-loop planning. Qualitative results further demonstrate the effectiveness of SimBoost in better managing rare high-risk driving scenarios.
Poster
Takahiro Kushida · Kenichiro Tanaka

[ Exhibit Hall I ]

Abstract
This paper introduces a novel method for detailed 3D shape reconstruction utilizing thermal polarization cues. Unlike state-of-the-art methods, the proposed approach is independent of illumination, material properties, and heating processes. In this paper, we formulate a general theory of polarization observation and show that long-wave infrared (LWIR) polarimetric imaging is free from the ambiguities that affect visible polarization analyses. Subsequently, we propose a method for recovering detailed 3D shapes using thermal polarimetric images, showing that our approach effectively reconstructs fine details on heterogeneous materials and outperforms existing techniques.
Poster
Bo-Hsu Ke · You-Zhe Xie · Yu-Lun Liu · Wei-Chen Chiu

[ Exhibit Hall I ]

Abstract
3D scene representation methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have significantly advanced novel view synthesis. As these methods become prevalent, addressing their vulnerabilities becomes critical. We analyze 3DGS robustness against image-level poisoning attacks and propose a novel density-guided poisoning method. Our method strategically injects Gaussian points into low-density regions identified via Kernel Density Estimation (KDE), embedding viewpoint-dependent illusory objects clearly visible from poisoned views while minimally affecting innocent views. Additionally, we introduce an adaptive noise strategy to disrupt multi-view consistency, further enhancing attack effectiveness. We propose a KDE-based evaluation protocol to assess attack difficulty systematically, enabling objective benchmarking for future research. Extensive experiments demonstrate our method's superior performance compared to state-of-the-art techniques.
Poster
Chin-Yang Lin · Cheng Sun · Fu-En Yang · Min-Hung Chen · Yen-Yu Lin · Yu-Lun Liu

[ Exhibit Hall I ]

Abstract
LongSplat addresses critical challenges in novel view synthesis (NVS) from casually captured long videos characterized by irregular camera motion, unknown camera poses, and expansive scenes. Current methods often suffer from pose drift, inaccurate geometry initialization, and severe memory limitations. To address these issues, we introduce LongSplat, a robust unposed 3D Gaussian Splatting framework featuring: (1) Incremental Joint Optimization that concurrently optimizes camera poses and 3D Gaussians to avoid local minima and ensure global consistency; (2) a Tracking and Alignment Module leveraging learned 3D priors, which combines correspondence-guided PnP initialization with photometric refinement for accurate camera tracking; and (3) an adaptive Octree Anchor Formation mechanism that dynamically adjusts anchor densities, significantly reducing memory usage. Extensive experiments on challenging benchmarks (Tanks and Temples, Free, and Hike datasets) demonstrate that LongSplat achieves state-of-the-art results, substantially improving rendering quality, pose accuracy, and computational efficiency compared to prior approaches.
Poster
Chaojun Ni · Xiaofeng Wang · Zheng Zhu · Weijie Wang · Haoyun Li · Guosheng Zhao · Jie Li · Wenkang Qin · Guan Huang · Wenjun Mei

[ Exhibit Hall I ]

Abstract
Interactive 3D generation is gaining momentum and capturing extensive attention for its potential to create immersive virtual experiences. However, a critical challenge in current 3D generation technologies lies in achieving real-time interactivity. To address this issue, we introduce WonderTurbo, the first real-time interactive 3D scene generation framework capable of generating novel perspectives of 3D scenes within 0.72 seconds. Specifically, WonderTurbo accelerates both geometric and appearance modeling in 3D scene generation. In terms of geometry, we propose StepSplat, an innovative method that constructs efficient 3D geometric representations through dynamic updates, each taking only 0.26 seconds. Additionally, we design QuickDepth, a lightweight depth completion module that provides consistent depth input for StepSplat, further enhancing geometric accuracy. For appearance modeling, we develop FastPaint, a 2-steps diffusion model tailored for instant inpainting, which focuses on maintaining spatial appearance consistency. Experimental results demonstrate that WonderTurbo achieves a remarkable 15$\times$ speedup compared to baseline methods, while preserving excellent spatial consistency and delivering high-quality output.
Poster
Jingqiao Xiu · Yicong Li · Na Zhao · Han Fang · Xiang Wang · Angela Yao

[ Exhibit Hall I ]

Abstract
View-Guided Point Cloud Completion (VG-PCC) aims to reconstruct complete point clouds from partial inputs by referencing single-view images. While existing VG-PCC models perform well on in-class predictions, they exhibit significant performance drops when generalizing to unseen categories. We identify two key limitations underlying this challenge: (1) Current encoders struggle to bridge the substantial modality gap between images and point clouds. Consequently, their learned representations often lack robust cross-modal alignment and over-rely on superficial class-specific patterns. (2) Current decoders refine global structures holistically, overlooking local geometric patterns that are class-agnostic and transferable across categories. To address these issues, we present a novel generalizable VG-PCC framework for unseen categories based on Geometric Alignment and Prior Modulation (GAPM). First, we introduce a Geometry Aligned Encoder that lifts reference images into 3D space via depth maps for natural alignment with the partial point clouds. This reduces dependency on class-specific RGB patterns that hinder generalization to unseen classes. Second, we propose a Prior Modulated Decoder that incorporates class-agnostic local priors to reconstruct shapes on a regional basis. This allows the adaptive reuse of learned geometric patterns that promote generalization to unseen classes. Extensive experiments validate that GAPM consistently outperforms existing models on both seen and, …
Poster
Mao Mao · Xujie Shen · Guyuan Chen · Boming Zhao · Jiarui Hu · Hujun Bao · Zhaopeng Cui

[ Exhibit Hall I ]

Abstract
Neural 3D modeling and novel view synthesis with Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) typically requires the multi-view images with wide baselines and accurate camera poses as input. However, scenarios with accidental camera motions are rarely studied. In this paper, we propose AccidentalGS , the first method for neural 3D modeling and novel view synthesis from accidental camera motions. To achieve this, we present a novel joint optimization framework that considers geometric and photometric errors, using a simplified camera model for stability. We also introduce a novel online adaptive depth-consistency loss to prevent the overfitting of the Gaussian model to input images. Extensive experiments on both synthetic and real-world datasets show that AccidentalGS achieves more accurate camera poses and realistic novel views compared to existing methods, and supports 3D modeling and neural rendering even for the Moon with telescope-like images.
Poster
Zhixuan Liu · Haokun Zhu · Rui Chen · Jonathan Francis · Soonmin Hwang · Ji Zhang · Jean Oh

[ Exhibit Hall I ]

Abstract
We introduce a novel diffusion-based approach for generating privacy-preserving digital twins of multi-room indoor environments from depth images only. Central to our approach is a novel Multi-view Overlapped Scene Alignment with Implicit Consistency (MOSAIC) model that explicitly considers cross-view dependencies within the same scene in the probabilistic sense.MOSAIC operates through a novel inference-time optimization that avoids error accumulation common in sequential or single-room constraint in panorama-based approaches.MOSAIC scales to complex scenes with zero extra training and provably reduces the variance during denoising processes when more overlapping views are added, leading to improved generation quality.Experiments show that MOSAIC outperforms state-of-the-art baselines on image fidelity metrics in reconstructing complex multi-room environments.
Poster
Tianao Li · Manxiu Cui · Cheng Ma · Emma Alexander

[ Exhibit Hall I ]

Abstract
Photoacoustic computed tomography (PACT) is a non-invasive imaging modality, similar to ultrasound, with wide-ranging medical applications. Conventional PACT images are degraded by wavefront distortion caused by the heterogeneous speed of sound (SOS) in tissue. Accounting for these effects can improve image quality and provide medically useful information, but measuring the SOS directly is burdensome and the existing joint reconstruction method is computationally expensive. Traditional supervised learning techniques are currently inaccessible in this data-starved domain. In this work, we introduce an efficient, self-supervised joint reconstruction method that recovers SOS and high-quality images using a differentiable physics model to solve the semi-blind inverse problem. The SOS, parametrized by either a pixel grid or a neural field (NF), is updated directly by backpropagation. Our method removes SOS aberrations more accurately and 35x faster than the current SOTA. We demonstrate the success of our method quantitatively in simulation and qualitatively on experimentally-collected and in-vivo data.
Poster
Shubhendu Jena · Amine Ouasfi · Mae Younes · Adnane Boukhayma

[ Exhibit Hall I ]

Abstract
We present a method for Sparse view reconstruction with surface element splatting that runs within 2 minutes on a consumer grade GPU. While few methods address sparse radiance field learning from noisy or unposed sparse cameras, shape recovery remains relatively underexplored in this setting. Several radiance and shape learning test-time optimization methods address the sparse posed setting by learning data priors or using combinations of external monocular geometry priors. Differently, we propose an efficient and simple pipeline harnessing a single recent 3D foundation model. We leverage its various task heads, notably point maps and camera initializations to instantiate a bundle adjusting 2D Gaussian Splatting (2DGS) model, and image correspondences to guide camera optimization midst 2DGS training. Key to our contribution is a novel formulation of splatted color variance along rays, which can be computed efficiently. Reducing this moment in training leads to more accurate shape reconstructions. We demonstrate stat-of-the-art performances in the sparse uncalibrated setting in reconstruction and novel view Benchmarks based on established multi-view datasets.
Poster
Hanlin Li · Wenming Weng · Yueyi Zhang · Zhiwei Xiong

[ Exhibit Hall I ]

Abstract
Scene flow provides the fundamental information of the scene dynamics. Existing scene flow estimation methods typically rely on the correlation between only a consecutive point cloud pair, making them limited to the instantaneous state of the scene and face challenge in real-world scenarios with factors like occlusion, noise, and diverse motion of background and foreground. In this paper, we study the joint sequential scene flow estimation and future scene flow prediction on point cloud sequences. The expanded sequential input introduces long-term and high-order motion information. We propose GenFlow3D, a recurrent neural network model which integrates diffusion in the decoder to better incorporate the two tasks and enhance the ability to extract general motion patterns. A transformer-based denoising network is adopted to help capture useful information. Depending on the input point clouds, discriminative condition signals are generated to guide the diffusion decoder to switch among different modes specific for scene flow estimation and prediction in a multi-scale manner. GenFlow3D is evaluated on the real-world datasets Nuscenes and Argoverse 2, and demonstrates superior performance compared with the existing methods.
Poster
Edoardo Palladin · Samuel Brucker · Filippo Ghilotti · Praveen Narayanan · Mario Bijelic · Felix Heide

[ Exhibit Hall I ]

Abstract
Outside of urban hubs, autonomous cars and trucks have to master driving on intercity highways. Safe, long-distance highway travel at speeds exceeding 100 km/h demands perception distances of at least 250 m, which is about five times the 50–100m typically addressed in city driving, to allow sufficient planning and braking margins. Increasing the perception ranges also allows to extend autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks, which need a longer planning horizon due to their high inertia.However, most existing perception approaches focus on shorter ranges and rely on Bird’s Eye View (BEV) representations, which incur quadratic increases in memory and compute costs as distance grows. To overcome this limitation, we built on top of a sparse representation and introduced an efficient 3D encoding of multi-modal and temporal features, along with a novel self-supervised pre-training scheme that enables large-scale learning from unlabeled camera-LiDAR data. Our approach extends perception distances to 250 meters and achieves an 26.6% improvement in mAP in object detection and a decrease of 30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods, reaching distances up to 250 meters.
Poster
Katja Schwarz · Norman Müller · Peter Kontschieder

[ Exhibit Hall I ]

Abstract
Synthesizing consistent and photorealistic 3D scenes is an open problem in computer vision. Video diffusion models generate impressive videos but cannot directly synthesize 3D representations, i.e., lack 3D consistency in the generated sequences. In addition, directly training generative 3D models is challenging due to a lack of 3D training data at scale. In this work, we present Generative Gaussian Splatting (GGS) – a novel approach that integrates a 3D rep resentation with a pre-trained latent video diffusion model. Specifically, our model synthesizes a feature field parameterized via 3D Gaussian primitives. The feature field is then either rendered to feature maps and decoded into multi-view images, or directly upsampled into a 3D radiance field. We evaluate our approach on two common benchmark datasets for scene synthesis, RealEstate10K and ScanNet++, and find that our proposed GGS model significantly improves both the 3D consistency of the generated multi-view images, and the quality of the generated 3D scenes over all relevant baselines. Compared to a similar model without 3D representation, GGS improves FID on the generated 3D scenes by ~20% on both RealEstate10K and ScanNet++.
Poster
hao si · Ehsan Javanmardi · Manabu Tsukada

[ Exhibit Hall I ]

Abstract
Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.
Poster
Zhirui Gao · Renjiao Yi · YaQiao Dai · Xuening Zhu · Wei Chen · Kai Xu · Chenyang Zhu

[ Exhibit Hall I ]

Abstract
This paper presents an end-to-end framework for reconstructing 3D parametric curves directly from multi-view edge maps. Contrasting with existing two-stage methods that follow a sequential ``edge point cloud reconstruction and parametric curve fitting'' pipeline, our one-stage approach optimizes 3D parametric curves directly from 2D edge maps, eliminating error accumulation caused by the inherent optimization gap between disconnected stages. However, parametric curves inherently lack suitability for rendering-based multi-view optimization, necessitating a complementary representation that preserves their geometric properties while enabling differentiable rendering. We propose a novel bi-directional coupling mechanism between parametric curves and edge-oriented Gaussian components. This tight correspondence formulates a curve-aware Gaussian representation, CurveGaussian, that enables differentiable rendering of 3D curves, allowing direct optimization guided by multi-view evidence. Furthermore, we introduce a dynamically adaptive topology optimization framework during training to refine curve structures through linearization, merging, splitting, and pruning operations. Comprehensive evaluations on the ABC dataset and real-world benchmarks demonstrate our one-stage method's superiority over two-stage alternatives, particularly in producing cleaner and more robust reconstructions. Additionally, by directly optimizing parametric curves, our method significantly reduces the parameter count during training, achieving both higher efficiency and superior performance compared to existing approaches.
Poster
Ruiqian Li · Siyuan Shen · Suan Xia · Ziheng Wang · Xingyue Peng · Chengxuan Song · Yingsheng Zhu · Tao Wu · Shiying Li · Jingyi Yu

[ Exhibit Hall I ]

Abstract
High quality and high speed videography using Non-Line-of-Sight (NLOS) imaging benefit autonomous navigation, collision prevention, and post-disaster search and rescue tasks. Current solutions have to balance between the frame rate and image quality. High frame rates, for example, can be achieved by reducing either per-point scanning time or scanning density, but at the cost of lowering the information density at individual frames. Fast scanning process further reduces the signal-to-noise ratio and different scanning systems exhibit different distortion characteristics. In this work, we design and employ a new Transient Transformer architecture called TransiT to achieve real-time NLOS recovery under fast scans. TransiT directly compresses the temporal dimension of input transients to extract features, reducing computation costs and meeting high frame rate requirements. It further adopts a feature fusion mechanism as well as employs a spatial-temporal Transformer to help capture features of NLOS transient videos. Moreover, TransiT applies transfer learning to bridge the gap between synthetic and real-measured data. In real experiments, TransiT manages to reconstruct from sparse transients of $16 \times 16$ measured at an exposure time of 0.4 ms per point to NLOS videos at a $64 \times 64$ resolution at 10 frames per second. We will make our code …
Poster
Xu Cao · Takafumi Taketomi

[ Exhibit Hall I ]

Abstract
We propose a neural inverse rendering approach to reconstruct 3D shape, spatially varying BRDF, and lighting parameters from multi-view images captured under varying lighting conditions.Unlike conventional multi-view photometric stereo (MVPS) methods, our approach does not rely on geometric, reflectance, or lighting cues derived from single-view photometric stereo. Instead, we jointly optimize all scene properties end-to-end to directly reproduce raw image observations.We represent both geometry and SVBRDF as neural implicit fields and incorporate shadow-aware volume rendering with physics-based shading. Experiments show that our method outperforms MVPS methods guided by high-quality normal maps and enables photorealistic rendering from novel viewpoints under novel lighting conditions. Our method reconstructs intricate surface details for objects with challenging reflectance properties using view-unaligned OLAT images, which conventional MVPS methods cannot handle.
Poster
songru Yang · Zhenwei Shi · Zhengxia Zou

[ Exhibit Hall I ]

Abstract
Understanding movements in multi-agent scenarios is a fundamental problem in intelligent systems. Previous research assumes complete and synchronized observations. However, real-world partial observation caused by occlusions leads to inevitable model failure, which demands a unified framework for coexisting trajectory prediction, imputation, and recovery. Unlike previous attempts that handled observed and unobserved behaviors in a coupled manner, we explore a decoupled denoising diffusion modeling paradigm with a unidirectional information valve to separate the interference from uncertain behaviors. Building on this, we proposed a Unified Masked Trajectory Diffusion model (UniMTD) for arbitrary levels of missing observations. We designed a unidirectional attention as a valve unit to control the direction of information flow between the observed and masked areas, gradually refining the missing observations toward a real-world distribution. We construct it into a unidirectional MoE structure to handle varying proportions of missing observations. A Cached Diffusion model is further designed to improve generation quality while reducing computation and time overhead. Our method has achieved a great leap across human motions and vehicle traffic. UniMTD efficiently achieves 65% improvement in minADE20 and reaches SOTA with advantages of 98%, 50%, 73%, and 29% across 4 fidelity metrics on out-of-boundary, velocity, and trajectory length. Our code …
Poster
Hugo Blanc · Jean-Emmanuel Deschaud · Alexis Paljic

[ Exhibit Hall I ]

Abstract
RayGauss has recently achieved state-of-the-art results on synthetic and indoor scenes, representing radiance and density fields with irregularly distributed elliptical basis functions rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that significantly accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5× to 12× faster training and 50× to 80× higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. The code will soon be publicly available on GitHub.
Poster
Paul Engstler · Aleksandar Shtedritski · Iro Laina · Christian Rupprecht · Andrea Vedaldi

[ Exhibit Hall I ]

Abstract
In this paper, we address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training-free and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most current 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based grid approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.
Poster
Pou-Chun Kung · Skanda Harisha · Ram Vasudevan · Aline Eid · Katherine A. Skinner

[ Exhibit Hall I ]

Abstract
High-Fidelity 3D scene reconstruction plays a crucial role in autonomous driving by enabling novel data generation from existing datasets. This allows simulating safety-critical scenarios and augmenting training datasets without incurring further data collection costs.While recent advances in radiance fields have demonstrated promising results in 3D reconstruction and sensor data synthesis using cameras and LiDAR, their potential for radar remains largely unexplored. Radar is crucial for autonomous driving due to its robustness in adverse weather conditions like rain, fog, and snow, where optical sensors often struggle. Although the state-of-the-art radar-based neural representation shows promise for 3D driving scene reconstruction, it performs poorly in scenarios with significant radar noise, including receiver saturation and multipath reflection. Moreover, it is limited to synthesizing preprocessed, noise-excluded radar images, failing to address realistic radar data synthesis. To address these limitations, this paper proposes RadarSplat, which integrates Gaussian Splatting with novel radar noise modeling to enable realistic radar data synthesis and enhanced 3D reconstruction. Compared to the state-of-the-art, RadarSplat achieves superior radar image synthesis (+3.5 PSNR / 2.3x SSIM) and improved geometric reconstruction (-48% RMSE / 2.3x Accuracy), demonstrating its effectiveness in generating high-fidelity radar data and scene reconstruction.
Poster
Elias Marks · Lucas Nunes · Federico Magistri · Matteo Sodano · Rodrigo Marcuzzi · Lars Zimmermann · Jens Behley · Cyrill Stachniss

[ Exhibit Hall I ]

Abstract
The natural world presents complex organic structures, such as tree canopies, that humans can interpret even when only partially visible.Understanding tree structures is key for forest monitoring, orchard management, and automated harvesting applications.However, reconstructing tree topologies from sensor data, called tree skeletonization, remains a challenge for computer vision approaches. Traditional methods for tree skeletonization rely on handcrafted features, regression, or generative models, whereas recent advances focus on deep learning approaches. Existing methods often struggle with occlusions caused by dense foliage, limiting their applicability over the annual vegetation cycle. Furthermore, the lack of real-world data with reference information limits the evaluation of these methods to synthetic datasets, which does not validate generalization to real environments.In this paper, we present a novel approach for tree skeletonization that combines a generative denoising diffusion probabilistic model for predicting node positions and branch directions with a classical minimum spanning tree algorithm to infer tree skeletons from 3D point clouds, even with strong occlusions. %, enabling robust topology estimation even with strong occlusions. Additionally, we provide a dataset of an apple orchard with 280 trees scanned 10 times during the growing season with corresponding reference skeletons, enabling quantitative evaluation. Experiments show the superior performance of our …
Poster
Zhimin Chen · Xuewei Chen · Xiao Guo · Yingwei Li · Longlong Jing · Liang Yang · Bing Li

[ Exhibit Hall I ]

Abstract
Recently, multi-modal masked autoencoders (MAE) has been introduced in 3D self-supervised learning, offering enhanced feature learning by leveraging both 2D and 3D data to capture richer cross-modal representations. However, these approaches have two limitations: (1) they inefficiently require both 2D and 3D modalities as inputs, even though the inherent multi-view properties of 3D point clouds already contain 2D modality. (2) input 2D modality causes the reconstruction learning to unnecessarily rely on visible 2D information, hindering 3D geometric representation learning. To address these challenges, we propose a 3D to Multi-View Learner (Multi-View ML) that only utilizes 3D modalities as inputs and effectively capture rich spatial information in 3D point clouds. Specifically, we first project 3D point clouds to multi-view 2D images at the feature level based on 3D-based pose. Then, we introduce two components: (1) a 3D to multi-view autoencoder that reconstructs point clouds and multi-view images from 3D and projected 2D features; (2) a multi-scale multi-head (MSMH) attention mechanism that facilitates local-global information interactions in each decoder transformer block through attention heads at various scales. Additionally, a novel two-stage self-training strategy is proposed to align 2D and 3D representations. Empirically, our method significantly outperforms state-of-the-art counterparts across various downstream tasks, …
Poster
Emanuele Giacomini · Luca Di Giammarino · Lorenzo De Rebotti · Giorgio Grisetti · Martin R. Oswald

[ Exhibit Hall I ]

Abstract
LiDARs provide accurate geometric measurements, making them valuable for ego-motion estimation and reconstruction tasks.Although its success, managing an accurate and lightweight representation of the environment still poses challenges.Both classic and NeRF-based solutions have to trade off accuracy over memory and processing times.In this work, we build on recent advancements in Gaussian Splatting methods to develop a novel \lidar~odometry and mapping pipeline that exclusively relies on Gaussian primitives for its scene representation.Leveraging spherical projection, we drive the refinement of the primitives uniquely from LiDAR measurements.Experiments show that our approach matches the current registration performance, while achieving SOTA results for mapping tasks with minimal GPU requirements. This efficiency makes it a strong candidate for further exploration and potential adoption in real-time robotics estimation tasks.
Poster
Moslem Yazdanpanah · Ali Bahri · Mehrdad Noori · Sahar Dastani · Gustavo Vargas Hakim · David OSOWIECHI · Ismail Ayed · Christian Desrosiers

[ Exhibit Hall I ]

Abstract
Test-time adaptation (TTA) is crucial for mitigating performance degradation caused by distribution shifts in 3D point cloud classification. In this work, we introduce Token Purging (PG), a novel backpropagation-free approach that removes tokens highly affected by domain shifts before they reach attention layers. Unlike existing TTA methods, PG operates at the token level, ensuring robust adaptation without iterative updates. We propose two variants: PG-SP, which leverages source statistics, and PG-SF, a fully source-free version relying on CLS-token-driven adaptation. Extensive evaluations on ModelNet40-C, ShapeNet-C, and ScanObjectNN-C demonstrate that PG-SP achieves an average of +10.3\% higher accuracy than state-of-the-art backpropagation-free methods, while PG-SF sets new benchmarks for source-free adaptation. Moreover, PG is 12.4× faster and 5.5× more memory efficient than our baseline, making it suitable for real-world deployment.
Poster
Michael Steiner · Thomas Köhler · Lukas Radl · Felix Windisch · Dieter Schmalstieg · Markus Steinberger

[ Exhibit Hall I ]

Abstract
Although 3D Gaussian Splatting (3DGS) has revolutionized 3D reconstruction, it still faces challenges such as aliasing, projection artifacts, and view inconsistencies, primarily due to the simplification of treating splats as 2D entities. We argue that incorporating full 3D evaluation of Gaussians throughout the 3DGS pipeline can effectively address these issues while preserving rasterization efficiency. Specifically, we introduce an adaptive 3D smoothing filter to mitigate aliasing and present a stable view-space bounding method that eliminates popping artifacts when Gaussians extend beyond the view frustum. Furthermore, we promote tile-based culling to 3D with screen-space planes, accelerating rendering and reducing sorting costs for hierarchical rasterization. Our method achieves state-of-the-art quality on in-distribution evaluation sets and significantly outperforms other approaches for out-of-distribution views. Our qualitative evaluations further demonstrate the effective removal of aliasing, distortions, and popping artifacts, ensuring real-time, artifact-free rendering.
Poster
David Stotko · Reinhard Klein

[ Exhibit Hall I ]

Abstract
The reconstruction of three-dimensional dynamic scenes is a well-established yet challenging task within the domain of computer vision. In this paper, we propose a novel approach that combines the domains of 3D geometry reconstruction and appearance estimation for physically based rendering and present a system that is able to perform both tasks for fabrics, utilizing only a single monocular RGB video sequence as input. In order to obtain realistic and high-quality deformations and renderings, a physical simulation of the cloth geometry and differentiable rendering are employed. In this paper, we introduce two novel regularization terms for the 3D reconstruction task that improve the plausibility of the reconstruction. In comparison with the most recent methods in the field, we have reduced the error in the 3D reconstruction by a factor of $2.64$ while requiring a medium runtime of $30$ min per scene. Furthermore, the optimized motion achieves sufficient quality to perform an appearance estimation of the deforming object, recovering sharp details from this single monocular RGB video.
Poster
Junkai Deng · Hanting Niu · Jiaze Li · Fei Hou · Ying He

[ Exhibit Hall I ]

Abstract
Reconstruction from multi-view images is a fundamental challenge in computer vision that has been extensively studied over the past decades. Recently, neural radiance fields have driven significant advancements, especially through methods using implicit functions and volume rendering, achieving high levels of accuracy. A core component of these methods is the mapping that transforms an implicit function's output into corresponding volume densities. Despite its critical role, this mapping has received limited attention in existing literature. In this paper, we provide a comprehensive and systematic study of mapping functions, examining their properties and representations. We first outline the necessary conditions for the mapping function and propose a family of functions that meet these criteria, to ensure first-order unbiasedness. We further demonstrate that the mappings employed by NeuS and VolSDF, two representative neural implicit surface techniques, are special cases within this broader family. Building on our theoretical framework, we introduce several new mapping functions and evaluate their effectiveness through numerical experiments. Our approach offers a fresh perspective on this well-established problem, opening avenues for the development of new techniques in the field.
Poster
Tongfan Guan · Jiaxin Guo · Chen Wang · Yun-Hui Liu

[ Exhibit Hall I ]

Abstract
Monocular and stereo depth estimation offer complementary strengths: monocular methods capture rich contextual priors but lack geometric precision, while stereo approaches leverage epipolar geometry but struggle with ambiguities such as reflective or textureless surfaces.Despite their synergies, these paradigms remain largely disjoint in practice.We introduce OmniDepth, a unified framework that bridges both through iterative bidirectional alignment of their latent representations.At its core, a novel cross-attentive alignment mechanism dynamically synchronizes monocular contextual cues with disparity hypothesis representations during stereo reasoning.This mutual alignment resolves stereo ambiguities (e.g., specular surfaces) by injecting monocular structure priors while refining monocular depth with stereo geometry.Extensive experiments demonstrate state-of-the-art results: OmniDepth reduces zero-shot generalization error by $\!>\!40\%$ on Middlebury and ETH3D compared to leading stereo methods, while addressing longstanding failure cases on transparent and reflective surfaces.By harmonizing multi-view geometry with monocular context, OmniDepth advances robust 3D perception that transcends modality-specific limitations.Code and models will be released.
Poster
Yuki Urakawa · Yoshihiro Watanabe

[ Exhibit Hall I ]

Abstract
Among structured-light methods, the phase-shifting approach enables high-resolution and high-accuracy measurements using a minimum of three patterns. However, its performance is significantly affected when dynamic and complex-shaped objects are measured, as motion artifacts and phase inconsistencies can degrade accuracy. In this study, we propose an enhanced phase-shifting method that incorporates neural inverse rendering to enable the 3D measurement of moving objects. To effectively capture object motion, we introduce a displacement field into the rendering model, which accurately represents positional changes and mitigates motion-induced distortions. Additionally, to achieve high-precision reconstruction with fewer phase-shifting patterns, we designed a multiview-rendering framework that utilizes multiple cameras in conjunction with a single projector. Comparisons with state-of-the-art methods and various ablation studies demonstrated that our method accurately reconstructs the shapes of moving objects, even with a small number of patterns, using only simple, well-known phase-shifting patterns.
Poster
Tianshu Huang · Akarsh Prabhakara · Chuhan Chen · Jay Karhade · Deva Ramanan · Matthew O'Toole · Anthony Rowe

[ Exhibit Hall I ]

Abstract
mmWave radars are compact, inexpensive, and durable sensors that are robust to occlusions and work regardless of environmental conditions, such as weather and darkness. However, this comes at the cost of poor angular resolution, especially for inexpensive single-chip radars, which are typically used in automotive and indoor sensing applications. Although many have proposed learning-based methods to mitigate this weakness, no standardized foundational models or large datasets for the mmWave radar have emerged, and practitioners have largely trained task-specific models from scratch using relatively small datasets.In this paper, we collect (to our knowledge) the largest available raw radar dataset with 1M samples (29 hours) and train a foundational model for 4D single-chip radar, which can predict 3D occupancy and semantic segmentation with quality that is typically only possible with much higher resolution sensors. We demonstrate that our Generalizable Radar Transformer (GRT) generalizes across diverse settings, can be fine-tuned for different tasks, and shows logarithmic data scaling of 20\% per $10\times$ data. We also run extensive ablations on common design decisions, and find that using raw radar data significantly outperforms widely-used lossy representations, equivalent to a $10\times$ increase in training data. Finally, we estimate a total data requirement of $\approx$100M samples (3000 …
Poster
Hamadi Chihaoui · Paolo Favaro

[ Exhibit Hall I ]

Abstract
Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly. To handle this general case, we introduce the Diffusion Image Prior(DIIP). We take inspiration from the Deep Image Prior (DIP). since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate \methodnameacr on various degradation-blind IR tasks, including JPEG artifact removal, deblurring, denoising and super-resolution with state-of-the-art results.
Poster
Tobias Fischer · Samuel Rota Bulò · Yung-Hsu Yang · Nikhil Keetha · Lorenzo Porzi · Norman Müller · Katja Schwarz · Jonathon Luiten · Marc Pollefeys · Peter Kontschieder

[ Exhibit Hall I ]

Abstract
3D Gaussian splatting enables high-quality novel view synthesis (NVS) at real-time frame rates. However, its quality drops sharply as we depart from the training views. Thus, very dense captures involving many images are needed to match the high-quality expectations of some applications, e.g. Virtual Reality (VR). However, dense captures are very laborious and expensive to obtain. Existing works have explored using 2D generative models to alleviate this requirement by distillation or generating additional training views. These methods are often conditioned only on a handful of reference input views and thus do not fully exploit the available 3D information, leading to inconsistent generation results and reconstruction artifacts. To tackle this problem, we propose a multi-view, flow-matching model that learns a flow to connect novel views generated from possibly-sparse reconstructions to renderings that we expect from dense reconstructions. This enables augmenting scene captures with generated novel views to improve the overall reconstruction quality.Our model is trained on a novel dataset of 3.6M image pairs and can process up to 45 views at 540x960 resolution (91K tokens) on one H100 GPU in a single forward pass. Our pipeline consistently improves NVS in few-view and many-view scenarios, leading to higher-quality reconstructions than prior works …
Poster
Haoyi Duan · Hong-Xing Yu · Sirui Chen · Li Fei-Fei · Jiajun Wu

[ Exhibit Hall I ]

Abstract
We introduce WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: indoor and outdoor, static and dynamic, photorealistic and stylized. The WorldScore metric evaluates generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source implementations, we reveal key insights and challenges for each category of models. We will open-source WorldScore, including evaluation metrics, datasets, and generated videos.
Poster
Alan Liang · Lingdong Kong · Dongyue Lu · Youquan Liu · Jian Fang · Huaici Zhao · Wei Tsang Ooi

[ Exhibit Hall I ]

Abstract
With the rise of robotics, LiDAR-based 3D object detection has garnered significant attention in both academia and industry. However, existing datasets and methods predominantly focus on vehicle-mounted platforms, leaving other autonomous platforms underexplored. To bridge this gap, we introduce Pi3DET, the first benchmark featuring LiDAR data and 3D bounding box annotations collected from multiple platforms: vehicle, quadruped, and drone, thereby facilitating research in 3D object detection for non-vehicle platforms as well as cross-platform 3D detection. Based on Pi3DET, we propose a novel cross-platform adaptation framework that transfers knowledge from the well-studied vehicle platform to other platforms. This framework achieves perspective-invariant 3D detection through robust alignment at both geometric and feature levels. Additionally, we establish a benchmark to evaluate the resilience and robustness of current 3D detectors in cross-platform scenarios, providing valuable insights for developing adaptive 3D perception systems. Extensive experiments validate the effectiveness of our approach on challenging cross-platform tasks, demonstrating substantial gains over existing adaptation methods. We hope this work paves the way for generalizable and unified 3D perception system across diverse and complex environments. Our Pi3DET dataset, cross-platform benchmark suite, and annotation toolkit will be made publicly available.
Poster
Zhixin Cheng · Jiacheng Deng · Xinjun Li · Xiaotian Yin · Bohao Liao · Baoqun Yin · Wenfei Yang · Tianzhu Zhang

[ Exhibit Hall I ]

Abstract
Detection-free methods typically follow a coarse-to-fine pipeline, extracting image and point cloud features for patch-level matching and refining dense pixel-to-point correspondences. However, differences in feature channel attention between images and point clouds may lead to degraded matching results, ultimately impairing registration accuracy.Furthermore, similar structures in the scene could lead to redundant correspondences in cross-modal matching.To address these issues, we propose Channel Adaptive Adjustment Module (CAA) and Global Optimal Selection Module (GOS). CAA enhances intra-modal features and suppresses cross-modal sensitivity, while GOS replaces local selection with global optimization. Experiments on RGB-D Scenes V2 and 7-Scenes demonstrate the superiority of our method, achieving state-of-the-art performance in image-to-point cloud registration.
Poster
Yehonathan Litman · Fernando De la Torre · Shubham Tulsiani

[ Exhibit Hall I ]

Abstract
Recent approaches for 3D relighting have shown promise in integrating 2D image relighting generative priors to alter the appearance of a 3D representation while preserving the underlying structure. Nevertheless, generative priors used for 2D relighting that directly relight from an input image do not take advantage of intrinsic properties of the subject that can be inferred or cannot consider multi-view data at scale, leading to subpar relighting. In this paper, we propose Lightswitch, a novel finetuned material-relighting diffusion framework that efficiently relights an arbitrary number of input images to a target lighting condition while incorporating cues from inferred intrinsic properties. By using multi-view and material information cues together with a scalable denoising scheme, our method consistently and efficiently relights dense multi-view data of objects with diverse material compositions. We show that our 2D relighting prediction quality exceeds previous state-of-the-art relighting priors that directly relight from images. We further demonstrate that LightSwitch matches or outperforms state-of-the-art diffusion inverse rendering methods in relighting synthetic and real objects in as little as 2 minutes. We will publicly release our model and code.
Poster
Yunsong Zhou · Naisheng Ye · William Ljungbergh · Tianyu Li · Jiazhi Yang · Zetong Yang · Hongzi Zhu · Christoffer Petersson · Hongyang Li

[ Exhibit Hall I ]

Abstract
Controllable scene generation could reduce the cost of diverse data collection substantially for autonomous driving. Prior works formulate the traffic layout generation as predictive progress, either by denoising entire sequences at once or by iteratively predicting the next frame. However, full sequence denoising hinders online reaction, while the latter's short-sighted next-frame prediction lacks precise goal-state guidance. Further, the learned model struggles to generate complex or challenging scenarios due to a large number of safe and ordinal driving behaviors from open datasets. To overcome these, we introduce Nexus, a decoupled scene generation framework that improves reactivity and goal conditioning by simulating both ordinal and challenging scenarios from fine-grained tokens with independent noise states. At the core of the decoupled pipeline is the integration of a partial noise-masking training strategy and a noise-aware schedule that ensures timely environmental updates throughout the denoising process. To complement challenging scenario generation, we collect a dataset consisting of complex corner cases. It covers 540 hours of simulated data, including high-risk interactions such as cut-in, sudden braking, and collision. Nexus achieves superior generation realism while preserving reactivity and goal orientation, with a 40\% reduction in displacement error. We further demonstrate that Nexus improves closed-loop planning by 20\% …
Poster
Renzhi He · Haowen Zhou · Yubei Chen · Yi Xue

[ Exhibit Hall I ]

Abstract
Volumetric reconstruction of label-free living cells from non-destructive optical microscopic images reveals cellular metabolism in native environments. However, current optical tomography techniques require hundreds of 2D images to reconstruct a 3D volume, hindering them from intravital imaging of biological samples undergoing rapid dynamics. This poses a challenge of reconstructing the entire volume of semi-transparent biological samples from sparse views due to the restricted viewing angles of microscopes and the limited number of measurements. In this work, we develop Neural Volumetric Prior (NVP) for high-fidelity volumetric reconstruction of semi-transparent biological samples from sparse-view microscopic images. NVP integrates explicit and implicit neural representations and incorporates the physical prior of diffractive optics. We validate NVP on both simulated data and experimentally captured microscopic images. Compared to previous methods, NVP significantly reduces the required number of images by nearly 50-fold and processing time by 3-fold while maintaining state-of-the-art performance.NVP is the first technique to enable volumetric reconstruction of label-free biological samples from sparse-view microscopic images, paving the way for real-time 3D imaging of dynamically changing biological samples.
Poster
Yuhang Lu · Jiadong Tu · Yuexin Ma · Xinge Zhu

[ Exhibit Hall I ]

Abstract
End-to-end autonomous driving has emerged as a promising approach to unify perception, prediction, and planning within a single framework, reducing information loss and improving adaptability. However, existing methods often rely on fixed and sparse trajectory supervision, limiting their ability to capture the hierarchical reasoning process that human drivers naturally employ. To bridge this gap, we propose ReAL-AD, a Reasoning-Augmented Learning framework that structures decision-making in autonomous driving based on the three-tier human cognitive model: \textbf{Driving Strategy}, \textbf{Driving Decision}, and \textbf{Driving Operation}, where Vision-Language Models (VLMs) are incorporated to enhance situational awareness and structured reasoning across these levels. Specifically, we introduce: (1) the \textbf{Strategic Reasoning Injector}, which formulates high-level driving strategies by interpreting complex traffic contexts from VLM-generated insights; (2) the \textbf{Tactical Reasoning Integrator}, which refines strategic intent into interpretable tactical choices such as lane changes, overtaking, and speed adjustments; and (3) the \textbf{Hierarchical Trajectory Decoder}, which progressively translates tactical decisions into precise control actions for smooth and human-like trajectory execution. Extensive evaluations show that integrating our framework improves planning accuracy and safety by over 30\%, making end-to-end autonomous driving more interpretable and aligned with human-like hierarchical reasoning.
Poster
Songchun Zhang · Huiyao Xu · Sitong Guo · Zhongwei Xie · Hujun Bao · Weiwei Xu · Changqing Zou

[ Exhibit Hall I ]

Abstract
Novel view synthesis (NVS) boosts immersive experiences in computer vision and graphics. Existing techniques, though progressed, rely on dense multi-view observations, restricting their application. This work takes on the challenge of reconstructing photorealistic 3D scenes from sparse or single-view inputs.We introduce SpatialCrafter, a framework that leverages the rich knowledge in video diffusion models to generate plausible additional observations, thereby alleviating reconstruction ambiguity. Through a trainable camera encoder and an epipolar attention mechanism for explicit geometric constraints, we achieve precise camera control and 3D consistency, further reinforced by a unified scale estimation strategy to handle scale discrepancies across datasets.Furthermore, by integrating monocular depth priors with semantic features in the video latent space, our framework directly regresses 3D Gaussian primitives and efficiently processes long-sequence features using a hybrid network structure. Extensive experiments show our method enhances sparse view reconstruction and restores the realistic appearance of 3D scenes.
Poster
Jianfei Jiang · Qiankun Liu · Haochen Yu · Hongyuan Liu · Liyong Wang · Jiansheng Chen · Huimin Ma

[ Exhibit Hall I ]

Abstract
Learning-based Multi-View Stereo (MVS) methods aim to predict depth maps for a sequence of calibrated images to recover dense point clouds. However, existing MVS methods often struggle with challenging regions, such as textureless regions and reflective surfaces, where feature matching fails. In contrast, monocular depth estimation inherently does not require feature matching, allowing it to achieve robust relative depth estimation in these regions. To bridge this gap, we propose MonoMVSNet, a novel monocular feature and depth guided MVS network that integrates powerful priors from a monocular foundation model into multi-view geometry. Firstly, the monocular feature of the reference view is integrated into source view features by the attention mechanism with a newly designed cross-view position embedding. Then, the monocular depth of the reference view is aligned to dynamically update the depth candidates for edge regions during the sampling procedure. Finally, a relative consistency loss is further designed based on the monocular depth to supervise the depth prediction. Extensive experiments demonstrate that MonoMVSNet achieves state-of-the-art performance on the DTU and Tanks-and-Temples datasets, ranking first on the Tanks-and-Temples Intermediate and Advanced benchmarks among all published methods.
Poster
Xin Zhou · DINGKANG LIANG · Sifan Tu · Xiwu Chen · Yikang Ding · Dingyuan Zhang · Feiyang Tan · Hengshuang Zhao · Xiang Bai

[ Exhibit Hall I ]

Abstract
Driving World Models (DWMs) have become essential for autonomous driving by enabling future scene prediction. However, existing DWMs are limited to scene generation and fail to incorporate scene understanding, which involves interpreting and reasoning about the driving environment. In this paper, we present a unified Driving World Model named HERMES. We seamlessly integrate 3D scene understanding and future scene evolution (generation) through a unified framework in driving scenarios. Specifically, HERMES leverages a Bird's-Eye View (BEV) representation to consolidate multi-view spatial information while preserving geometric relationships and interactions. We also introduce world queries, which incorporate world knowledge into BEV features via causal attention in the Large Language Model, enabling contextual enrichment for understanding and generation tasks. We conduct comprehensive studies on nuScenes and OmniDrive-nuScenes datasets to validate the effectiveness of our method. HERMES achieves state-of-the-art performance, reducing generation error by 32.4% and improving understanding metrics such as CIDEr by 8.0%. The model and code will be made available.
Poster
XINJIE ZHANG · Zhening Liu · Yifan Zhang · Xingtong Ge · Dailan He · Tongda Xu · Yan Wang · Zehong Lin · Shuicheng YAN · Jun Zhang

[ Exhibit Hall I ]

Abstract
4D Gaussian Splatting (4DGS) has recently emerged as a promising technique for capturing complex dynamic 3D scenes with high fidelity. It utilizes a 4D Gaussian representation and a GPU-friendly rasterizer, enabling rapid rendering speeds. Despite its advantages, 4DGS faces significant challenges, notably the requirement of millions of 4D Gaussians, each with extensive associated attributes, leading to substantial memory and storage cost. This paper introduces a memory-efficient framework for 4DGS. We streamline the color attribute by decomposing it into a per-Gaussian direct color component with only 3 parameters and a shared lightweight alternating current color predictor. This approach eliminates the need for spherical harmonics coefficients, which typically involve up to 144 parameters in classic 4DGS, thereby creating a memory-efficient 4D Gaussian representation. Furthermore, we introduce an entropy-constrained Gaussian deformation technique that uses a deformation field to expand the action range of each Gaussian and integrates an opacity-based entropy loss to limit the number of Gaussians, thus forcing our model to use as few Gaussians as possible to fit a dynamic scene well. With simple half-precision storage and zip compression, our framework achieves a storage reduction by approximately 190$\times$ and 125$\times$ on the Technicolor and Neural 3D Video datasets, respectively, compared to …
Poster
Daehee Park · Monu Surana · Pranav Desai · Ashish Mehta · Reuben John · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
Predicting future trajectories of dynamic traffic agents is crucial in autonomous systems. While data-driven methods enable large-scale training, they often underperform on rarely observed tail samples, yielding a long-tail problem. Prior works have tackled this by modifying model architectures, such as using a hypernetwork.In contrast, we propose refining the training procedure to unlock each model’s potential without altering its structure.To this end, we introduce the Generative Active Learning for Trajectory prediction (GALTraj), which iteratively identifies tail samples and augments them via a controllable generative diffusion model.By incorporating the augmented samples in each iteration, we directly mitigate dataset imbalance.To ensure effective augmentation, we design a new tail-aware generation method that categorizes agents (tail, head, relevant) and applies tailored guidance of the diffusion model.It enables producing diverse and realistic trajectories that preserve tail characteristics while respecting traffic constraints. Unlike prior traffic simulation methods focused on producing diverse scenarios, ours is the first to show how simulator-driven augmentation can benefit long-tail learning for trajectory prediction. Experiments on multiple trajectory datasets (WOMD, Argoverse2) with popular backbones (QCNet, MTR) confirm that our method significantly boosts performance on tail samples and also enhances accuracy on head samples.
Poster
Yueh-Cheng Liu · Lukas Höllein · Matthias Nießner · Angela Dai

[ Exhibit Hall I ]

Abstract
Surface reconstruction is fundamental to computer vision and graphics, enabling applications in 3D modeling, mixed reality, robotics, and more. Existing approaches based on volumetric rendering obtain promising results, but optimize on a per-scene basis, resulting in a slow optimization that can struggle to model under-observed or textureless regions. We introduce QuickSplat, which learns data-driven priors to generate dense initializations for 2D gaussian splatting optimization of large-scale indoor scenes. This provides a strong starting point for the reconstruction, which accelerates the convergence of the optimization and improves the geometry of flat wall structures. We further learn to jointly estimate the densification and update of the scene parameters during each iteration; our proposed densifier network predicts new Gaussians based on the rendering gradients of existing ones, removing the needs of heuristics for densification. Extensive experiments on large-scale indoor scene reconstruction demonstrate the superiority of our data-driven optimization. Concretely, we accelerate runtime by 8x, while decreasing depth errors by 48% in comparison to state of the art methods.
Poster
Yingde Song · Zongyuan Yang · Baolin Liu · yongping xiong · Sai Chen · Lan Yi · Zhaohe Zhang · Xunbo Yu

[ Exhibit Hall I ]

Abstract
Light Field Displays (LFDs), despite significant advances in hardware technology supporting larger fields of view and multiple viewpoints, still face a critical challenge of limited content availability. Producing autostereoscopic 3D content on these displays requires refracting multi-perspective images into different spatial angles, with strict demands for spatial consistency across views, which is technically challenging for non-experts. Existing image/video generation models and radiance field-based methods cannot directly generate display content that meets the strict requirements of light field display hardware from a single 2D resource.We introduces the first generative framework ${\rm \bf EYE}^{3}$ specifically designed for 3D light field displays, capable of converting any 2D images, videos, or texts into high-quality display content tailored for these screens. The framework employs a point-based representation rendered through off-axis perspective, ensuring precise light refraction and alignment with the hardware's optical requirements. To maintain consistent 3D coherence across multiple viewpoints, we finetune a video diffusion model to fill occluded regions based on the rendered masks.Experimental results demonstrate that our approach outperforms state-of-the-art methods, significantly simplifying content creation for LFDs. With broad potential in industries such as entertainment, advertising, and immersive display technologies, our method offers a robust solution to content scarcity and greatly enhances the …
Poster
Rongqing Li · Changsheng Li · Ruilin Lv · Yuhang Li · Yang Gao · Xiaolu Zhang · JUN ZHOU

[ Exhibit Hall I ]

Abstract
Trajectory prediction aims to forecast an agent's future trajectories based on its historical observed trajectories, which is a critical task for various applications such as autonomous driving, robotics, and surveillance systems. Most existing trajectory prediction methods assume that the observed trajectories collected for forecasting are clean. However, in real-world scenarios, noise is inevitably introduced into the observations, resulting in the collapse of the existing approaches. Therefore, it is essential to perform robust trajectory prediction based on noisy observations, which is a more practical scenario. In this paper, we propose **NATRA**, a **N**oise-**A**gnostic framework capable of tackling the problem of **TRA**jectory prediction with arbitrary types of noisy observations. Specifically, we put forward a mutual information-based mechanism to denoise the original noisy observations. It optimizes the produced trajectories to exhibit a pattern that closely resembles the clean trajectory pattern while deviating from the noisy one.Considering that the trajectory structure may be destroyed through the only optimization of mutual information, we introduce an additional reconstruction loss to preserve the structure information of the produced observed trajectories. Moreover, we further propose a ranking loss to further enhance the performance. Because NATRA does not rely on any specific module tailored to particular noise distributions, it …
Poster
Jiaxu Wan · Hong Zhang · Ziqi He · Yangyan Deng · Qishu Wang · Ding Yuan · Yifan Yang

[ Exhibit Hall I ]

Abstract
Point transformers have demonstrated remarkable progress in 3D understanding through expanded receptive fields (RF), but further expanding the RF leads to dilution in group attention and decreases detailed feature extraction capability. Proxy, which serves as abstract representations for simplifying feature maps, enables global RF. However, existing proxy-based approaches face critical limitations: Global proxies incur quadratic complexity for large-scale point clouds and suffer positional ambiguity, while local proxy alternatives struggle with 1) Unreliable sampling from the geometrically diverse point cloud, 2) Inefficient proxy interaction computation, and 3) Imbalanced local-global information fusion; To address these challenges, we propose Sparse Proxy Point Transformer (SP$^{2}$T) -- a local proxy-based dual-stream point transformer with three key innovations: First, for reliable sampling, spatial-wise proxy sampling with vertex-based associations enables robust sampling on geometrically diverse point clouds. Second, for efficient proxy interaction, sparse proxy attention with a table-based relative bias effectively achieves the interaction with efficient map-reduce computation. Third, for local-global information fusion, our dual-stream architecture maintains local-global balance through parallel branches. Comprehensive experiments reveal that SP$^{2}$T sets state-of-the-art results with acceptable latency on indoor and outdoor 3D comprehension benchmarks, demonstrating marked improvement (+3.8\% mIoU vs. SPoTr@S3DIS, +22.9\% mIoU vs. PointASNL@Sem.KITTI) compared to other proxy-based point cloud …
Poster
Zhaojie Zeng · Yuesong Wang · Chao Yang · Tao Guan · Lili Ju

[ Exhibit Hall I ]

Abstract
Implicit Neural Representation (INR) has demonstrated remarkable advances in the field of image representation but demands substantial GPU resources. GaussianImage recently pioneered the use of Gaussian Splatting to mitigate this cost, however, the slow training process limits its practicality, and the fixed number of Gaussians per image limits its adaptability to varying information entropy. To address these issues, we propose in this paper a generalizable and self-adaptive image representation framework based on 2D Gaussian Splatting. Our method employs a network to quickly generate a coarse Gaussian representation, followed by minimal fine-tuning steps, achieving comparable rendering quality of GaussianImage while significantly reducing training time. Moreover, our approach dynamically adjusts the number of Gaussian points based on image complexity to further enhance flexibility and efficiency in practice. Experiments on DIV2K and Kodak datasets show that our method matches or exceeds GaussianImage’s rendering performance with far fewer iterations and shorter training times. Specifically, our method reduces the training time by up to one order of magnitude while achieving superior rendering performance with the same number of Gaussians.
Poster
Hyunjoon Lee · Joonkyu Min · Jaesik Park

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has begun incorporating rich information from 2D foundation models. However, most approaches rely on a bottom-up optimization process that treats raw 2D features as ground truth, incurring increased computational costs and an excessive number of Gaussians. We propose a top-down pipeline for constructing compact and fast 3D feature fields, namely, \Ours{}. We first perform a weighted fusion of multi-view features with a pre-trained 3DGS. The aggregated feature captures spatial cues by integrating information across views, mitigating the ambiguity in 2D features. This top-down design enables a per-Gaussian autoencoder strategy to compress high-dimensional features into a 3D latent space, significantly balancing feature expressiveness and memory efficiency. Finally, we introduce an adaptive sparsification method that merges Gaussians to reduce complexity, ensuring efficient representation without unnecessary detail. Our approach produces a competitive 3D feature field using only about 10\% of the Gaussians compared to existing feature-embedded 3DGS methods.
Poster
Bo-Lun Huang · Tzu-Hsiang Ni · Feng-Kai Huang · Hong-Han Shuai · Wen-Huang Cheng

[ Exhibit Hall I ]

Abstract
Accurate and stable lane detection is crucial for the reliability of autonomous driving systems. A core challenge lies in predicting lane positions in complex scenarios, such as curved roads or when markings are ambiguous or absent.Conventional approaches leverage deep learning techniques to extract both high-level and low-level visual features, aiming to achieve a comprehensive understanding of the driving environment. However, these methods often rely on predefined anchors within a single-pass model, limiting their adaptability. The one-shot prediction paradigm struggles with precise lane estimation in challenging scenarios, such as curved roads or adverse conditions like low visibility at night.To address these limitations, we propose a novel cold diffusion-based framework that initializes lane predictions with predefined anchors and iteratively refines them. This approach retains the flexibility and progressive refinement capabilities of diffusion models while overcoming the constraints of traditional hot diffusion techniques.To further enhance the model’s coarse-to-fine refinement capabilities, we introduce a multi-resolution image processing strategy, where images are analyzed at different timesteps to capture both global and local lane structure details. Besides, we incorporate a learnable noise variance schedule, enabling the model to dynamically adjust its learning process based on multi-resolution inputs.Experimental results demonstrate that our method significantly improves detection accuracy …
Poster
Jeongyun Kim · Seunghoon Jeong · Giseop Kim · Myung-Hwan Jeon · Eunji Jun · Ayoung Kim

[ Exhibit Hall I ]

Abstract
Understanding the 3D geometry of transparent objects from RGB images is challenging due to their inherent physical properties, such as reflection and refraction. To address these difficulties, especially in scenarios with sparse views and dynamic environments, we introduce a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects.Our key insight lies in separating transparent objects from the background, enabling focused optimization of Gaussians corresponding to the object. We mitigate artifacts with an object‐aware loss that places Gaussians in obscured regions, ensuring coverage of invisible surfaces while reducing overfitting. Furthermore, we incorporate a physics-based simulation that refines the reconstruction in just a few seconds, effectively handling object removal and chain‐reaction movement of remaining objects without the need for rescanning.Our model was evaluated on both synthetic and real-world sequences, and it consistently demonstrated robust improvements over existing GS-based state-of-the-art methods. In comparison with baselines, our model reduces the mean absolute error by over 45\% for the synthetic TRansPose sequences. Furthermore, despite being updated using only one image, our model reaches a $\delta < 2.5$ cm accuracy of 48.46\%, nearly double that of baselines, which uses six images.
Poster
Wenjie Chang · Hanzhi Chang · Yueyi Zhang · Wenfei Yang · Tianzhu Zhang

[ Exhibit Hall I ]

Abstract
Indirect Time-of-Flight (iToF) cameras are popular for 3D perception because they are cost-effective and easy to deploy. They emit modulated infrared signals to illuminate the scene and process the received signals to generate amplitude and phase images. The depth is calculated from the phase using the modulation frequency. However, the obtained depth often suffers from noise caused by multi-path interference (MPI), low signal-to-noise ratio (SNR), and depth wrapping.Building on recent advancements in neural scene representations, which have shown great potential in 3D modeling from multi-view RGB images, we propose leveraging this approach to reconstruct 3D representations from noisy iToF data. Our method utilizes the multi-view consistency of amplitude and phase maps, averaging information from all input views to generate an accurate scene representation.Considering the impact of infrared illumination, we propose a new rendering scheme for amplitude maps based on signed distance function (SDF) and introduce a neural lighting function to model the appearance variations caused by active illumination.We also incorporate a phase-guided sampling strategy and a wrapping-aware phase-to-depth loss to utilize raw phase information and mitigate depth wrapping.Additionally, we add a noise-weight loss to prevent excessive smoothing information across noisy multi-view measurements.Experiments conducted on synthetic and real-world datasets demonstrate that …
Poster
Ranran Huang · Krystian Mikolajczyk

[ Exhibit Hall I ]

Abstract
We introduce SPFSplat, an efficient framework for 3D Gaussian Splatting from sparse multi-view images, requiring no ground-truth poses during both training and inference. Our method simultaneously predicts Gaussians and camera poses from unposed images in a canonical space within a single feed-forward step. During training, the pose head estimate the poses at target views, which are supervised through the image rendering loss. Additionally, a reprojection loss is introduced to ensure alignment between Gaussians and the estimated poses of input views, reinforcing geometric consistency. This pose-free training paradigm and efficient one-step feed-forward inference makes SPFSplat well-suited for practical applications. Despite the absence of pose supervision, our self-supervised SPFSplat achieves state-of-the-art performance in novel view synthesis, even under significant viewpoint changes. Furthermore, it surpasses recent methods trained with geometry priors in relative pose estimation, demonstrating its effectiveness in both 3D scene reconstruction and camera pose learning.
Poster
Chen Chen · Kangcheng Bin · Hu Ting · Jiahao Qi · Xingyue Liu · Tianpeng Liu · Zhen Liu · Yongxiang Liu · Ping Zhong

[ Exhibit Hall I ]

Abstract
Unmanned aerial vehicles (UAV)-based object detection with visible (RGB) and infrared (IR) images facilitates robust around-the-clock detection, driven by advancements in deep learning techniques and the availability of high-quality dataset. However, the existing dataset struggles to fully capture real-world complexity for limited imaging conditions. To this end, we introduce a high-diversity dataset ATR-UMOD covering varying scenarios, spanning altitudes from 80m to 300m, angles from 0° to 75°, and all-day, all-year time variations in rich weather and illumination conditions. Moreover, each RGB-IR image pair is annotated with 6 condition attributes, offering valuable high-level contextual information. To meet the challenge raised by such diverse conditions, we propose a novel prompt-guided condition-aware dynamic fusion (PCDF) to adaptively reassign multimodal contributions by leveraging annotated condition cues. By encoding imaging conditions as text prompts, PCDF effectively models the relationship between conditions and multimodal contributions through a task-specific soft-gating transformation. A prompt-guided condition-decoupling module further ensures the availability in practice without condition annotations. Experiments on ATR-UMOD dataset reveal the effectiveness of PCDF.
Poster
Chengbo Wang · Guozheng Ma · Yizhen Lao · Yifei Xue

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, demonstrating remarkable capability in high-fidelity scene reconstruction through its Gaussian primitive representations. However, the computational overhead induced by the massive number of primitives poses a significant bottleneck to training efficiency. To overcome this challenge, we propose Group Training, a simple yet effective strategy that organizes Gaussian primitives into manageable groups, optimizing training efficiency and improving rendering quality. This approach shows universal compatibility with existing 3DGS frameworks, including vanilla 3DGS and Mip-Splatting, consistently achieving accelerated training while maintaining superior synthesis quality. Extensive experiments reveal that our straightforward Group Training strategy achieves up to 30% faster convergence and improved rendering quality across diverse scenarios.
Poster
Tongyan Hua · Lutao Jiang · Ying-Cong Chen · Wufan Zhao

[ Exhibit Hall I ]

Abstract
Recent advancements in generative models have enabled 3D urban scene generation from satellite imagery, unlocking promising applications in gaming, digital twins, and beyond.However, most existing methods rely heavily on neural rendering techniques, which hinder their ability to produce detailed 3D structures on a broader scale, largely due to the inherent structural ambiguity derived from relatively limited 2D observations.To address this challenge, we propose Sat2City, a novel framework that synergizes the representational capacity of sparse voxel grids with latent diffusion models, tailored specifically for our novel 3D city dataset. Our approach is enabled by three key components: (1) A cascaded latent diffusion framework that progressively recovers 3D city structures from satellite imagery, (2) a Re-Hash operation at its Variational Autoencoder (VAE) bottleneck to compute multi-scale feature grids for stable appearance optimization and (3) an inverse sampling strategy enabling implicit supervision for smooth appearance transitioning.To overcome the challenge of collecting real-world city-scale 3D models with high-quality geometry and appearance, we introduce a dataset of synthesized large-scale 3D cities paired with satellite-view height maps. Validated on this dataset, our framework generates detailed 3D structures from a single satellite image, achieving superior fidelity compared to existing city generation models.
Poster
Carter Sifferman · Yiquan Li · Yiming Li · Fangzhou Mu · Michael Gleicher · Mohit Gupta · Yin Li

[ Exhibit Hall I ]

Abstract
We aim to recover the geometry of 3D parametric scenes using very few depth measurements from low-cost, commercially available time-of-flight sensors. These sensors offer very low spatial resolution (i.e., a single pixel), but image a wide field-of-view per pixel and capture detailed time-of-flight data in the form of time-resolved photon counts. This time-of-flight data encodes rich scene information and thus enables recovery of simple scenes from sparse measurements. We investigate the feasibility of using a distributed set of few measurements (e.g., as few as 15 pixels) to recover the geometry of simple parametric scenes with a strong prior, such as estimating the 6D pose of a known object. To achieve this, we design a method that utilizes both feed-forward prediction to infer scene parameters, and differentiable rendering within an analysis-by-synthesis framework to refine the scene parameter estimate. We develop hardware prototypes and demonstrate that our method effectively recovers object pose given an untextured 3D model in both simulations and controlled real-world captures, and show promising initial results for other parametric scenes. We additionally conduct experiments to explore the limits and capabilities of our imaging solution.
Poster
Ying-Tian Liu · Jiajun Li · Yu-Tao Liu · Xin Yu · Yuan-Chen Guo · Yanpei Cao · Ding Liang · Ariel Shamir · Song-Hai Zhang

[ Exhibit Hall I ]

Abstract
Quad meshes play a crucial role in computer graphics applications, yet automatically generating high-quality quad meshes remains challenging. Traditional quadrangulation approaches rely on local geometric features and manual constraints, often producing suboptimal mesh layouts that fail to capture global shape semantics. We introduce NeuFrameQ, a novel learning-based framework for scalable and generalizable mesh quadrangulation via frame field prediction. We first create a large-scale dataset of high-quality quad meshes with various shapes to serve as priors of domain knowledge. Empowered by this dataset, we employ a connectivity-agnostic learning approach that operates on point clouds with normals, enabling robust processing of complex mesh geometries. By decomposing frame field prediction into direction regression and magnitude estimation tasks, we effectively handle the ill-posed nature in frame field estimation. We also employ the polyvector representation and computing mechanism in both tasks to handle the inherent ambiguities in frame field representation. Extensive experiments demonstrate that NeuFrameQ produces high-quality quad meshes with superior semantic alignment, also for geometries derived from neural fields. Our method significantly advances the state of the art in automatic quad mesh generation, bridging the gap between neural content creation and production-ready geometric assets.
Poster
Xin You · Runze Yang · Chuyan Zhang · Zhongliang Jiang · JIE YANG · Nassir Navab

[ Exhibit Hall I ]

Abstract
The temporal interpolation task for 4D medical imaging, plays a crucial role in clinical practice of respiratory motion modeling. Following the simplified linear-motion hypothesis, existing approaches adopt optical flow-based models to interpolate intermediate frames. However, realistic respiratory motions should be nonlinear and quasi-periodic with specific frequencies. Intuited by this property, we resolve the temporal interpolation task from the frequency perspective, and propose a Fourier basis-guided Diffusion model, termed FB-Diff. Specifically, due to the regular motion discipline of respiration, physiological motion priors are introduced to describe general characteristics of temporal data distributions. Then a Fourier motion operator is elaborately devised to extract Fourier bases by incorporating physiological motion priors and case-specific spectral information in the feature space of Variational Autoencoder. Well-learned Fourier bases can better simulate respiratory motions with motion patterns of specific frequencies. Conditioned on starting and ending frames, the diffusion model further leverages well-learned Fourier bases via the basis interaction operator, which promotes the temporal interpolation task in a generative manner. Extensive results demonstrate that FB-Diff achieves state-of-the-art (SOTA) perceptual performance with better temporal consistency while maintaining promising reconstruction metrics. Anonymous codes are available.
Poster
Yuheng Du · Sheng Yang · Lingxuan Wang · Zhenghua.Hou Zhenghua.Hou · Chengying Cai · Zhitao Tan · Mingxia Chen · Shi-Sheng Huang · Qiang Li

[ Exhibit Hall I ]

Abstract
While recent online HD mapping methods relieve burdened offline pipelines and solve map freshness, they remain limited by perceptual inaccuracies, occlusion in dense traffic, and an inability to fuse multi-agent observations. We propose RTMap to enhance these single-traversal methods by persistently crowdsourcing a multi-traversal HD map as a self-evolutional memory. On onboard agents, RTMap simultaneously addresses three core challenges in an end-to-end fashion: (1) Uncertainty-aware positional modeling for HD map elements, (2) probabilistic-aware localization w.r.t. the crowdsourced prior-map, and (3) real-time detection for possible road structural changes. Experiments on several public autonomous driving datasets demonstrate our solid performance on both the prior-aided map quality and the localization accuracy, demonstrating our effectiveness of robustly serving downstream prediction and planning modules while gradually improving the accuracy and freshness of the crowdsourced prior-map asynchronously.
Poster
Rui Song · Chenwei Liang · Yan Xia · Walter Zimmer · Hu Cao · Holger Caesar · Andreas Festag · Alois Knoll

[ Exhibit Hall I ]

Abstract
Dynamic scene rendering opens new avenues in autonomous driving by enabling closed-loop simulations with photorealistic data, which is crucial for validating end-to-end algorithms. However, the complex and highly dynamic nature of traffic environments presents significant challenges in accurately rendering these scenes. In this paper, we introduce a novel 4D Gaussian Splatting (4DGS) approach, which incorporates context and temporal deformation awareness to improve dynamic scene rendering. Specifically, we employ a 2D semantic segmentation foundation model to self-supervise the 4D semantic features of Gaussians, ensuring meaningful contextual embedding. Simultaneously, we track the temporal deformation of each Gaussian across adjacent frames. By aggregating and encoding both semantic and temporal deformation features, each Gaussian is equipped with cues for potential deformation compensation within 3D space, facilitating a more precise representation of dynamic scenes. Experimental results show that our method improves 4DGS's ability to capture fine details in dynamic scene rendering for autonomous driving and outperforms other self-supervised methods in 4D reconstruction and novel view synthesis. Furthermore, CoDa-4DGS deforms semantic features with each Gaussian, enabling broader applications.
Poster
Chi-Jui Ho · Yash Belhe · Steve Rotenberg · Ravi Ramamoorthi · Tzu-Mao Li · Nicholas Antipa

[ Exhibit Hall I ]

Abstract
End-to-end optimization, which integrates differentiable optics simulators with computational algorithms, enables the joint design of hardware and software in data-driven imaging systems. However, existing methods usually compromise physical accuracy by neglecting wave optics or off-axis effects due to the high computational cost of modeling both aberration and diffraction. This limitation raises concerns about the robustness of optimized designs. In this paper, we propose a differentiable optics simulator that accurately and efficiently models aberration and diffraction in compound optics and allows us to analyze the role and impact of diffraction in end-to-end optimization. Experimental results demonstrate that compared with ray-optics-based optimization, diffraction-aware optimization improves system robustness to diffraction blur. Through accurate wave optics modeling, we also apply the simulator to optimize the Fizeau interferometer and free form optics elements. These findings underscore the importance of accurate wave optics modeling in robust end-to-end optimization.
Poster
Yuheng Liu · Xinke Li · Yuning Zhang · Lu Qi · Xin Li · Wenping Wang · Chongshou Li · Xueting Li · Ming-Hsuan Yang

[ Exhibit Hall I ]

Abstract
Three-dimensional scene generation is crucial in computer vision, with applications spanning autonomous driving, gaming and the metaverse. Current methods either lack user control or rely on imprecise, non-intuitive conditions. In this work, we propose a method that uses scene graphs—an accessible, user-friendly control format—to generate outdoor 3D scenes. We develop an interactive system that transforms a sparse scene graph into a dense BEV (Bird's Eye View) Embedding Map, which guides a conditional diffusion model to generate 3D scenes that match the scene graph description. During inference, users can easily create or modify scene graphs to generate large-scale outdoor scenes. We create a large-scale dataset with paired scene graphs and 3D semantic scenes to train the BEV embedding and diffusion models. Experimental results show that our approach consistently produces high-quality 3D urban scenes closely aligned with the input scene graphs. To the best of our knowledge, this is the first approach to generate 3D outdoor scenes conditioned on scene graphs. Code and dataset will be released upon acceptance.
Poster
Peiqi Chen · Lei Yu · Yi Wan · Yingying Pei · Xinyi Liu · YongxiangYao YongxiangYao · Yingying Zhang · Lixiang Ru · Liheng Zhong · Jingdong Chen · Ming Yang · Yongjun Zhang

[ Exhibit Hall I ]

Abstract
Semi-dense feature matching methods have shown strong performance in challenging scenarios. However, the existing pipeline relies on a global search across the entire feature map to establish coarse matches, limiting further improvements in accuracy and efficiency. Motivated by this limitation, we propose a novel pipeline, CasP, which leverages cascaded correspondence priors for guidance. Specifically, the matching stage is decomposed into two progressive phases, bridged by a region-based selective cross-attention mechanism designed to enhance feature discriminability. In the second phase, one-to-one matches are determined by restricting the search range to the one-to-many prior areas identified in the first phase. Additionally, this pipeline benefits from incorporating high-level features, which helps reduce the computational costs of low-level feature extraction. The acceleration gains of CasP increase with higher resolution, and our lite model achieves a speedup of $\sim2.2\times$ at a resolution of 1152 compared to the most efficient method, ELoFTR. Furthermore, extensive experiments demonstrate its superiority in geometric estimation, particularly with impressive cross-domain generalization. These advantages highlight its potential for latency-sensitive and high-robustness applications, such as SLAM and UAV systems.
Poster
Yufei Han · Bowen Tie · Heng Guo · Youwei Lyu · Si Li · Boxin Shi · Yunpeng Jia · Zhanyu Ma

[ Exhibit Hall I ]

Abstract
Efficient shape reconstruction for surfaces with complex reflectance properties is crucial for real-time virtual reality. While 3D Gaussian Splatting (3DGS)-based methods offer fast novel view rendering by leveraging their explicit surface representation, their reconstruction quality lags behind that of implicit neural representations, particularly in the case of recovering surfaces with complex reflective reflectance. To address these problems, we propose PolGS, a $\underline{Pol}$arimetric $\underline{G}$aussian $\underline{S}$platting model allowing fast reflective surface reconstruction in 10 minutes. By integrating polarimetric constraints into the 3DGS framework, PolGS effectively separates specular and diffuse components, enhancing reconstruction quality for challenging reflective materials. Experimental results on the synthetic and real-world dataset validate the effectiveness of our method.
Poster
Zeyu Yang · Zijie Pan · Yuankun Yang · Xiatian Zhu · Li Zhang

[ Exhibit Hall I ]

Abstract
Driving view synthesis along free-form trajectories is essential for realistic driving simulations, enabling closed-loop evaluation of end-to-end driving policies. Existing methods excel at view interpolation along recorded paths but struggle to generalize to novel trajectories due to limited viewpoints in driving videos. To tackle this challenge, we propose DriveX, a novel free-form driving view synthesis framework, that progressively distills generative prior into the 3D Gaussian model during its optimization. Within this framework, we utilize a video diffusion model to refine the degraded novel trajectory renderings from the in-training Gaussian model, while the restored videos in turn serve as additional supervision for optimizing the 3D Gaussian. Concretely, we craft an inpainting-based video restoration task, which can disentangle the identification of degraded regions from the generative capability of the diffusion model and remove the need of simulating specific degraded pattern in the training of the diffusion model. To further enhance the consistency and fidelity of generated contents, the pseudo ground truth is progressively updated with gradually improved novel trajectory rendering, allowing both components to co-adapt and reinforce each other while minimizing the disruption on the optimization. By tightly integrating 3D scene representation with generative prior, DriveX achieves high-quality view synthesis beyond recorded …
Poster
Yanzhe Lyu · Kai Cheng · Kang Xin · Xuejin Chen

[ Exhibit Hall I ]

Abstract
Recently, 3D Gaussian Splatting (3D-GS) has prevailed in novel view synthesis, achieving high fidelity and efficiency. However, it often struggles to capture rich details and complete geometry. Our analysis reveals that the 3D-GS densification operation lacks adaptiveness and faces a dilemma between geometry coverage and detail recovery. To address this, we introduce a novel densification operation, residual split, which adds a downscaled Gaussian as a residual. Our approach is capable of adaptively retrieving details and complementing missing geometry. To further support this method, we propose a pipeline named ResGS. Specifically, we integrate a Gaussian image pyramid for progressive supervision and implement a selection scheme that prioritizes the densification of coarse Gaussians over time. Extensive experiments demonstrate that our method achieves SOTA rendering quality. Consistent performance improvements can be achieved by applying our residual split on various 3D-GS variants, underscoring its versatility and potential for broader application in 3D-GS-based applications.
Poster
Tianci Wen · Zhiang Liu · Yongchun Fang

[ Exhibit Hall I ]

Abstract
3D Gaussian splatting (3D-GS) has recently revolutionized novel view synthesis in the simultaneous localization and mapping (SLAM) problem. However, most existing algorithms fail to fully capture the underlying structure, resulting in structural inconsistency. Additionally, they struggle with abrupt appearance variations, leading to inconsistent visual quality. To address these problems, we propose SEGS-SLAM, a structure-enhanced 3D Gaussian Splatting SLAM, which achieves high-quality photorealistic mapping. Our main contributions are two-fold. First, we propose a structure-enhanced photorealistic mapping (SEPM) framework that, for the first time, leverages highly structured point cloud to initialize structured 3D Gaussians, leading to significant improvements in rendering quality. Second, we propose Appearance-from-Motion embedding (AfME), enabling 3D Gaussians to better model image appearance variations across different camera poses. Extensive experiments on monocular, stereo, and RGB-D datasets demonstrate that SEGS-SLAM significantly outperforms state-of-the-art (SOTA) methods in photorealistic mapping quality, e.g., an improvement of $19.86\%$ in PSNR over MonoGS on the TUM RGB-D dataset for monocular cameras. The project page is available at https://segs-slam.github.io/.
Poster
Xi Cheng · Ruiqi Lei · Di Huang · Zhichao Liao · Fengyuan Piao · Yan Chen · Pingfa Feng · Long ZENG

[ Exhibit Hall I ]

Abstract
Parametric point clouds are sampled from CAD shapes and are becoming increasingly common in industrial manufacturing. Most existing CAD-specific deep learning methods only focus on geometric features, while overlooking constraints which are inherent and important in CAD shapes. This limits their ability to discern CAD shapes with similar appearance but different constraints. To tackle this challenge, we first analyze the constraint importance via a simple validation experiment. Then, we introduce a deep learning-friendly constraints representation with three vectorized components, and design a constraint-aware feature learning network (CstNet), which includes two stages. Stage 1 extracts constraint feature from B-Rep data or point cloud based on shape local information. It enables better generalization ability to unseen dataset after model pre-training. Stage 2 employs attention layers to adaptively adjust the weights of three constraints' components. It facilitates the effective utilization of constraints. In addition, we built the first multi-modal parametric-purpose dataset, i.e. Param20K, comprising about 20K shape instances of 75 classes. On this dataset, we performed the classification and rotation robustness experiments, and CstNet achieved 3.52\% and 26.17\% absolute improvements in instance accuracy over the state-of-the-art methods, respectively. To the best of our knowledge, CstNet is the first constraint-aware deep learning method tailored …
Poster
Hengzhe Jin · Lang Nie · Chunyu Lin · Xiaomei Feng · Yao Zhao

[ Exhibit Hall I ]

Abstract
We propose $\textit{PixelStitch}$, a pixel-wise bidirectional warp that learns to stitch images as well as preserve structure in an unsupervised paradigm. To produce natural stitched images, we first determine the middle plane through homography decomposition and globally project the original images toward the desired plane. Compared with unidirectional homography transformation, it evenly spreads projective distortion across two views and decreases the proportion of invalid pixels. Then, the bidirectional optical flow fields are established to carry out residual pixel-wise deformation with projection-weighted natural coefficients, encouraging pixel motions to be as unnoticeable as possible in non-overlapping regions while smoothly transitioning into overlapping areas. Crucially, this flexible deformation enables $\textit{PixelStitch}$ to align large-parallax images and preserve the structural integrity of non-overlapping contents. To obtain high-quality stitched images in the absence of labels, a comprehensive unsupervised objective function is proposed to simultaneously encourage content alignment, structure preservation, and bidirectional consistency. Finally, extensive experiments are conducted to show our superiority to existing state-of-the-art (SoTA) methods in the quantitative metric, qualitative appearance, and generalization ability. The code will be available.
Poster
Ruiyuan Gao · Kai Chen · Bo Xiao · Lanqing HONG · Zhenguo Li · Qiang Xu

[ Exhibit Hall I ]

Abstract
The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is vital for applications like autonomous driving. Although DiT with 3D VAE has become a standard framework for video generation, it introduces challenges in controllable driving video generation, especially for frame-wise geometric control, rendering existing methods ineffective. To address these issues, we propose MagicDrive-V2, a novel approach that integrates the MVDiT block and spatial-temporal conditional encoding to enable multi-view video generation and precise geometric control. Additionally, we introduce an efficient method for obtaining contextual descriptions for videos to support diverse textual control, along with a progressive training strategy using mixed video data to enhance training efficiency and generalizability. Consequently, MagicDrive-V2 enables multi-view driving video synthesis with 3.3× resolution and 4× frame count (compared to current SOTA), rich contextual control, and geometric controls. Extensive experiments demonstrate MagicDrive-V2’s ability, unlocking broader applications in autonomous driving. Project page: [magicdrive-v2.github.io](https://magicdrive-v2.github.io/)
Poster
Junjie Zhang · Haisheng Su · Feixiang Song · Sanping Zhou · Wei Wu · Junchi Yan · Nanning Zheng

[ Exhibit Hall I ]

Abstract
Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth …
Poster
Anik Sarker · Alan Asbeck

[ Exhibit Hall I ]

Abstract
Existing methods for rotation estimation between two spherical ($\mathbb{S}^2$) patterns typically rely on spherical cross-correlation maximization between two spherical function. However, these approaches exhibit computational complexities greater than cubic $O(n^3)$ with respect to rotation space discretization and lack extensive evaluation under significant outlier contamination.To this end, we propose a rotation estimation algorithm between two spherical patterns with linear time complexity $O(n)$. Unlike existing spherical-function-based methods, we explicitly represent spherical patterns as discrete 3D point sets on the unit sphere, reformulating rotation estimation as a spherical point-set alignment (i.e., Wahba problem for 3D unit vectors). Given the geometric nature of our formulation, our spherical pattern alignment algorithm naturally aligns with the Wahba problem framework for 3D unit vectors. Specifically, we introduce three novel algorithms: (1) SPMC (Spherical Pattern Matching by Correlation), (2) FRS (Fast Rotation Search), and (3) a hybrid approach (SPMC+FRS) that combines the advantages of the previous two methods. Our experiments demonstrate that in the $\mathbb{S}^2$ domain and in correspondence-free settings, our algorithms are over 10x faster and over 10x more accurate than current state-of-the-art methods for the Wahba problem with outliers. We validate our approach through extensive simulations on a new dataset of spherical patterns, the ``Robust Vector …
Poster
Yang Yang · Dongni Mao · Hiroaki Santo · Yasuyuki Matsushita · Fumio Okura

[ Exhibit Hall I ]

Abstract
We develop a neural parametric model for 3D plant leaves for modeling and reconstruction of plants that are essential for agriculture and computer graphics. While parametric modeling has been actively studied for human and animal shapes, plant leaves present unique challenges due to their diverse shapes and flexible deformation, making common approaches inapplicable. To this problem, we introduce a learning-based parametric model, NeuraLeaf, disentangling the leaves' geometry into their 2D base shapes and 3D deformations. Since the base shapes represent flattened 2D leaves, it allows learning from rich sources of 2D leaf image datasets, and also has the advantage of simultaneously learning texture aligned with the geometry. To model the 3D deformation, we propose a novel skeleton-free skinning model and a newly captured 3D leaf dataset called DeformLeaf. We establish a parametric deformation space by converting the sample-wise skinning parameters into a compact latent representation, allowing for flexible and efficient modeling of leaf deformations. We show that NeuraLeaf successfully generates a wide range of leaf shapes with deformation, resulting in accurate model fitting to 3D observations like depth maps and point clouds. Our implementation and datasets will be released upon acceptance.
Poster
Xianqi Wang · Hao Yang · Gangwei Xu · Junda Cheng · Min Lin · Yong Deng · Jinliang Zang · Yurui Chen · Xin Yang

[ Exhibit Hall I ]

Abstract
State-of-the-art supervised stereo matching methods have achieved remarkable performance on various benchmarks. However, their generalization to real-world scenarios remains challenging due to the scarcity of annotated real-world stereo data. In this paper, we propose ZeroStereo, a novel stereo image generation pipeline for zero-shot stereo matching. Our approach synthesizes high-quality right images from arbitrary single images by leveraging pseudo disparities generated by a monocular depth estimation model. Unlike previous methods that address occluded regions by filling missing areas with neighboring pixels or random backgrounds, we fine-tune a diffusion inpainting model to recover missing details while preserving semantic structure. Additionally, we propose Training-Free Confidence Generation, which mitigates the impact of unreliable pseudo labels without additional training, and Adaptive Disparity Selection, which ensures a diverse and realistic disparity distribution while preventing excessive occlusion and foreground distortion. Experiments demonstrate that models trained with our pipeline achieve state-of-the-art zero-shot generalization across multiple datasets with only a dataset volume comparable to Scene Flow.
Poster
Hanzhi Zhong · Zhiyu Xiang · Ruoyu Xu · Jingyun Fu · Peng Xu · Shaohong Wang · Zhihao Zhihao · Tianyu Pu · Eryun Liu

[ Exhibit Hall I ]

Abstract
4D radar has received significant attention in autonomous driving thanks to its robustness under adverse weathers. Due to the sparse points and noisy measurements of the 4D radar, most of the research finish the 3D object detection task by integrating images from camera and perform modality fusion in BEV space. However, the potential of the radar and the fusion mechanism is still largely unexplored, hindering the performance improvement. In this study, we propose a cross-view two-stage fusion network called CVFusion. In the first stage, we design a radar guided iterative (RGIter) BEV fusion module to generate high-recall 3D proposal boxes. In the second stage, we aggregate features from multiple heterogeneous views including points, image, and BEV for each proposal. These comprehensive instance level features greatly help refine the proposals and generate high-quality predictions. Extensive experiments on public datasets show that our method outperforms the previous state-of-the-art methods by a large margin, with 9.10\% and 3.68\% mAP improvements on View-of-Delft (VoD) and TJ4DRadSet, respectively. Our code will be made publicly available.
Poster
Zican Wang · Michael Fischer · Tobias Ritschel

[ Exhibit Hall I ]

Abstract
We derive methods to compute higher order differentials (Hessians and Hessian-vector products) of the rendering operator. Our approach is based on importance sampling of a convolution that represents the differentials of rendering parameters and shows to be applicable to both rasterization and path tracing. We demonstrate that this information improves convergence when used in higher-order optimizers such as Newton or Conjugate Gradient relative to a gradient descent baseline in several inverse rendering tasks.
Poster
Yan Li · Yang Xu · Changhao Chen · Zhongchen Shi · Wei Chen · Liang Xie · Hongbo Chen · Erwei Yin

[ Exhibit Hall I ]

Abstract
Inertial tracking (IT), independent of the environment and external infrastructure, has long been the ideal solution for providing location services to humans. Despite significant strides in inertial tracking empowered by deep learning, prevailing neural inertial tracking predominantly utilizes conventional spatial-temporal features from inertial measurements. Unfortunately, the frequency domain dimension is usually overlooked in the current literature. To this end, in this paper, we propose a Multi-Domain Mixture of Experts model for Neural Inertial Tracking, named M$^2$EIT. Specifically, M$^2$EIT first leverages ResNet as a spatial decomposition expert to capture spatial relationships between multivariate timeseries, and State Space Model (SSM)-based Bi-Mamba, the other expert to focus on learning temporal correlations. In the frequency domain mapping, we then introduce the Wavelet-based frequency decomposition expert, which decomposes IMU samples into low-frequency bands and high-frequency bands using the Haar wavelet transform for simulating motion patterns at different temporal scales. To bridge the semantic gap across multiple domains and integrate them adaptively, we design the Multi-Representation Alignment Router (MAR), which consists of a dual cross-domain translation layer, followed by a dynamic router, to achieve multi-domain semantic alignment and optimize expert contributions. Extensive experiments conducted on three real-world datasets demonstrate that the proposed M$^2$EIT can achieve SOTA …
Poster
Akshay Krishnan · Xinchen Yan · Vincent Casser · Abhijit Kundu

[ Exhibit Hall I ]

Abstract
We introduce Orchid, a unified latent diffusion model that learns a joint appearance-geometry learned prior to generate color, depth, and surface normal images in a single diffusion process. This unified approach is more efficient and coherent than current pipelines that use separate models for appearance and geometry. Orchid is versatile—it directly generates color, depth, and normal images from text, supports joint monocular depth and normal estimation with color-conditioned finetuning, and seamlessly inpaints large 3D regions by sampling from the joint distribution. It leverages a novel Variational Autoencoder (VAE) that jointly encodes RGB, relative depth, and surface normals into a shared latent space, combined with a latent diffusion model that denoises these latents. Our extensive experiments demonstrate that Orchid delivers competitive performance against SOTA task-specific geometry prediction methods, even surpassing them in normal-prediction accuracy and depth-normal consistency. It also inpaints color-depth-normal images jointly, with more qualitative realism than existing multi-step methods.
Poster
Wonseok Roh · Hwanhee Jung · JongWook Kim · Seunggwan Lee · Innfarn Yoo · Andreas Lugmayr · Seunggeun Chi · Karthik Ramani · Sangpil Kim

[ Exhibit Hall I ]

Abstract
Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources.These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass.Unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area.In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings.First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from single-view image features.By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under monocular settings.With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques.Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.
Poster
QingleiCao QingleiCao · Ziyao Tang · Xiaoqin Tang

[ Exhibit Hall I ]

Abstract
X-ray imaging, based on penetration, enables detailed visualization of internal structures. Building on this capability, existing implicit 3D reconstruction methods have adapted the NeRF model and its variants for internal CT reconstruction. However, these approaches often neglect the significance of objects' anatomical priors for implicit learning, limiting both reconstruction precision and learning efficiency, particularly in ultra-sparse view scenarios. To address these challenges, we propose a novel 3D CT reconstruction framework that employs a 'target prior' derived from the object's projection data to enhance implicit learning. Our approach integrates positional and structural encoding to facilitate voxel-wise implicit reconstruction, utilizing the target prior to guide voxel sampling and enrich structural encoding. This dual strategy significantly boosts both learning efficiency and reconstruction quality. Additionally, we introduce a CUDA-based algorithm for rapid estimation of high-quality 3D target priors from sparse-view projections. Experiments utilizing projection data from a complex abdominal dataset demonstrate that the proposed model substantially enhances learning efficiency, outperforming the current leading model, NAF, by a factor of ten. In terms of reconstruction quality, it also exceeds the most accurate model, NeRP, achieving PSNR improvements of 3.57 dB, 5.42 dB, and 5.70 dB with 10, 20, and 30 projections, respectively. The code is …
Poster
Mengmeng Wang · Haonan Wang · Yulong Li · Xiangjie Kong · Jiaxin Du · Feng Xia · Guojiang Shen

[ Exhibit Hall I ]

Abstract
3D LiDAR-based single object tracking (SOT) relies on sparse and irregular point clouds, posing challenges from geometric variations in scale, motion patterns, and structural complexity across object categories. Current category-specific approaches achieve good accuracy but are impractical for real-world use, requiring separate models for each category and showing limited generalization.To tackle these issues, we propose TrackAny3D, the first framework to transfer large-scale pretrained 3D models for category-agnostic 3D SOT. We first integrate parameter-efficient adapters to bridge the gap between pretraining and tracking tasks while preserving geometric priors. Then, we introduce a Mixture-of-Geometry-Experts (MoGE) architecture that adaptively activates specialized subnetworks based on distinct geometric characteristics. Additionally, we design a temporal context optimization strategy that incorporates learnable temporal tokens and a dynamic mask weighting module to propagate historical information and mitigate temporal drift.Experiments on three commonly-used benchmarks show that TrackAny3D establishes new state-of-the-art performance on category-agnostic 3D SOT, demonstrating strong generalization and competitiveness. We hope this work will enlighten the community on the importance of unified models and further expand the use of large-scale pretrained models in this field. The source code will be released.
Poster
ziyu zhang · Binbin Huang · Hanqing Jiang · Liyang Zhou · Xiaojun Xiang · Shuhan Shen

[ Exhibit Hall I ]

Abstract
We propose Quadratic Gaussian Splatting (QGS), a novel representation that replaces static primitives with deformable quadric surfaces (e.g., ellipse, paraboloids) to capture intricate geometry. Unlike prior works that rely on Euclidean distance for primitive density modeling—a metric misaligned with surface geometry under deformation—QGS introduces geodesic distance-based density distributions. This innovation ensures that density weights adapt intrinsically to the primitive’s curvature, preserving consistency during shape changes (e.g., from planar disks to curved paraboloids). By solving geodesic distances in closed form on quadric surfaces, QGS enables surface-aware splatting, where a single primitive can represent complex curvature that previously required dozens of planar surfels, potentially reducing memory usage while maintaining real-time rendering via efficient ray-quadric intersection. Experiments on DTU, Tanks and Temples, and MipNeRF360 datasets demonstrate state-of-the-art surface reconstruction, with QGS reducing geometric error (chamfer distance) by 33% over 2DGS and 27% over GOF on the DTU dataset. Crucially, QGS retains competitive appearance quality, bridging the gap between geometric precision and visual fidelity for applications like robotics and immersive reality.
Poster
Sarosij Bose · Arindam Dutta · Sayak Nag · Junge Zhang · Jiachen Li · Konstantinos Karydis · Amit Roy-Chowdhury

[ Exhibit Hall I ]

Abstract
Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, particularly in unseen regions far away from the input camera, existing single image to 3D reconstruction methods render incoherent and blurry views. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image’s view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module which calculates the per-pixel entropy and yields uncertainty maps which are used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic …
Poster
Jian Shi · Peter Wonka

[ Exhibit Hall I ]

Abstract
We present \textit{VoxelKP}, a novel fully sparse network architecture tailored for human keypoint estimation in LiDAR data.The key challenge is that objects are distributed sparsely in 3D space, while human keypoint detection requires detailed local information wherever humans are present.First, we introduce a dual-branch \textit{fully sparse spatial-context block} where the spatial branch focuses on learning the local spatial correlations between keypoints within each human instance, while the context branch aims to retain the global spatial information. Second, we use a \textit{spatially aware multi-scale BEV fusion} technique to leverage absolute 3D coordinates when projecting 3D voxels to a 2D grid encoding a bird's eye view for better preservation of the global context of each human instance.We evaluate our method on the Waymo dataset and achieve an improvement of $27\%$ on the MPJPE metric compared to the state-of-the-art, \textit{HUM3DIL}, trained on the same data, and $12\%$ against the state-of-the-art, \textit{GC-KPL}, pretrained on a $25\times$ larger dataset.To the best of our knowledge, \textit{VoxelKP} is the first single-staged, fully sparse network that is specifically designed for addressing the challenging task of 3D keypoint estimation from LiDAR data, achieving state-of-the-art performance. Our code is available at\url{https://}.
Poster
Weikang Wang · Tobias Weißberg · Nafie El Amrani · Florian Bernard

[ Exhibit Hall I ]

Abstract
Chirality information (i.e., information that allows distinguishing left from right) is ubiquitous for various data modes in computer vision, including images, videos, point clouds, and meshes. Contrary to symmetry, for which there has been a lot of research in the image domain, chirality information in shape analysis (point clouds and meshes) has remained underdeveloped. Although many shape vertex descriptors have shown appealing properties (e.g. robust to rigid-body pose transformations), they are not able to disambiguate between left and right symmetric parts. Considering the ubiquity of chirality information in different shape analysis problems and the lack of chirality-aware features within current shape descriptors, developing a chirality feature extractor becomes necessary and urgent. In this paper, based on the recent framework Diff3f, we proposed an unsupervised chirality feature extraction pipeline to decorate shape vertices with chirality-aware information, extracted from 2D foundation models. Quantitative and qualitative results of various experiments and downstream tasks include left-right disentanglement, shape matching, and part segmentation conducted on a variety of datasets proving the effectiveness and usefulness of our extracted chirality features. The code will be available once this work is accepted.
Poster
Muleilan Pei · Shaoshuai Shi · Xuesong Chen · Xu Liu · Shaojie Shen

[ Exhibit Hall I ]

Abstract
Motion forecasting for on-road traffic agents presents both a significant challenge and a critical necessity for ensuring safety in autonomous driving systems. In contrast to most existing data-driven approaches that directly predict future trajectories, we rethink this task from a planning perspective, advocating a "First Reasoning, Then Forecasting" strategy that explicitly incorporates behavior intentions as spatial guidance for trajectory prediction. To achieve this, we introduce an interpretable, reward-driven intention reasoner grounded in a novel query-centric Inverse Reinforcement Learning (IRL) scheme. Our method first encodes traffic agents and scene elements into a unified vectorized representation, then aggregates contextual features through a query-centric paradigm. This enables the derivation of a reward distribution, a compact yet informative representation of the target agent's behavior within the given scene context via IRL. Guided by this reward heuristic, we perform policy rollouts to reason about multiple plausible intentions, providing valuable priors for subsequent trajectory generation. Finally, we develop a hierarchical DETR-like decoder integrated with bidirectional selective state space models to produce accurate future trajectories along with their associated probabilities. Extensive experiments on the large-scale Argoverse and nuScenes motion forecasting datasets demonstrate that our approach significantly enhances trajectory prediction confidence, achieving highly competitive performance relative to state-of-the-art …
Poster
ChangWon Kang · Jisong Kim · Hongjae Shin · Junseo Park · Jun Won Choi

[ Exhibit Hall I ]

Abstract
Multi-task learning (MTL) has emerged as a promising approach to jointly optimize multiple perception tasks in autonomous driving, but existing methods suffer from feature interference and inefficient task-specific learning. In this paper, we introduce MAESTRO, a novel query-based framework that explicitly generates task-specific features to mitigate feature interference and improve efficiency in multi-task 3D perception. Our model consists of three key components: Semantic Query Generator (SQG), Task-Specific Feature Generator (TSFG), and Scene Query Aggregator (SQA). SQG generates query features and decomposes them into foreground and background queries to facilitate selective feature sharing. TSFG refines task-specific features by integrating decomposed queries with voxel features while suppressing irrelevant information. The detection and map heads generate task-aware queries, which SQA aggregates with the initially extracted queries from SQG to enhance semantic occupancy prediction. Extensive evaluations on the nuScenes benchmark show that MAESTRO achieves state-of-the-art performance across all tasks. Our model overcomes the performance trade-off among tasks in multi-task learning, where improving one task often hinders others, and sets a new benchmark in multi-task 3D perception.
Poster
Pei He · Lingling Li · Licheng Jiao · Ronghua Shang · Fang Liu · Shuang Wang · Xu Liu · wenping ma

[ Exhibit Hall I ]

Abstract
Domain generalization in 3D segmentation is a critical challenge in deploying models to unseen environments. Current methods mitigate the domain shift by augmenting the data distribution of point clouds. However, the model learns global geometric patterns in point clouds while ignoring the category-level distribution and alignment. In this paper, a category-level geometry learning framework is proposed to explore the domain-invariant geometric features for domain generalized 3D semantic segmentation. Specifically, Category-level Geometry Embedding (CGE) is proposed to perceive the fine-grained geometric properties of point cloud features, which constructs the geometric properties of each class and couples geometric embedding to semantic learning. Secondly, Geometric Consistent Learning (GCL) is proposed to simulate the latent 3D distribution and align the category-level geometric embeddings, allowing the model to focus on the geometric invariant information to improve generalization. Experimental results verify the effectiveness of the proposed method, which has very competitive segmentation accuracy compared with the state-of-the-art domain generalized point cloud methods. The code will be available.
Poster
Wenhao Xu · Wenming Weng · Yueyi Zhang · Ruikang Xu · Zhiwei Xiong

[ Exhibit Hall I ]

Abstract
Deformable 3D Gaussian Splatting (3D-GS) is limited by missing intermediate motion information due to the low temporal resolution of RGB cameras. To address this, we introduce the first approach combining event cameras, which capture high-temporal-resolution, continuous motion data, with deformable 3D-GS for dynamic scene reconstruction. We observe that threshold modeling for events plays a crucial role in achieving high-quality reconstruction. Therefore, we propose a GS-Threshold Joint Modeling strategy, creating a mutually reinforcing process that greatly improves both 3D reconstruction and threshold modeling. Moreover, we introduce a Dynamic-Static Decomposition strategy that first identifies dynamic areas by exploiting the inability of static Gaussians to represent motions, then applies a buffer-based soft decomposition to separate dynamic and static areas. This strategy accelerates rendering by avoiding unnecessary deformation in static areas, and focuses on dynamic areas to enhance fidelity. Additionally, we contribute the first event-inclusive 4D benchmark with synthetic and real-world dynamic scenes, on which our method achieves state-of-the-art performance.
Poster
Andrea Conti · Matteo Poggi · Valerio Cambareri · Martin Oswald · Stefano Mattoccia

[ Exhibit Hall I ]

Abstract
Time-of-Flight (ToF) sensors provide efficient active depth sensing at relatively low power budgets; among such designs, only very sparse measurements from low-resolution sensors are considered to meet the increasingly limited power constraints of mobile and AR/VR devices. However, such extreme sparsity levels limit the seamless usage of ToF depth in SLAM. In this work, we propose ToF-Splatting, the first 3D Gaussian Splatting-based SLAM pipeline tailored for using effectively very sparse ToF input data. Our approach improves upon the state of the art by introducing a multi-frame integration module, which produces dense depth maps by merging cues from extremely sparse ToF depth, monocular color, and multi-view geometry. Extensive experiments on both synthetic and real sparse ToF datasets demonstrate the viability of our approach, as it achieves state-of-the-art tracking and mapping performances on reference datasets.
Poster
Jingming He · Chongyi Li · Shiqi Wang · Sam Kwong

[ Exhibit Hall I ]

Abstract
Recent works propose extending 3DGS with semantic feature vectors for simultaneous semantic segmentation and image rendering. However, these methods often treat the semantic and rendering branches separately, relying solely on 2D supervision while ignoring the 3D Gaussian geometry. Moreover, current adaptive strategies adapt the Gaussian set depending solely on rendering gradients, which can be insufficient in subtle or textureless regions. In this work, we propose a joint enhancement framework for 3D semantic Gaussian modeling that synergizes both semantic and rendering branches. Firstly, unlike conventional point cloud shape encoding, we introduce an anisotropic 3D Gaussian Chebyshev descriptor using the Laplace–Beltrami operator to capture fine-grained 3D shape details, thereby distinguishing objects with similar appearances and reducing reliance on potentially noisy 2D guidance. In addition, without rely solely on rendering gradient, we adaptively adjust Gaussian allocation and spherical harmonics (SH) with local semantic and shape signals, enhancing rendering efficiency through selective resource allocation. Finally, we employ a cross-scene knowledge transfer module to continuously update learned shape patterns, enabling faster convergence and robust representations without relearning shape information from scratch for each new scene. Experiments on multiple datasets demonstrate improvements in segmentation accuracy and rendering quality while maintaining high rendering frame rates.
Poster
Giacomo Meanti · Thomas Ryckeboer · Michael Arbel · Julien Mairal

[ Exhibit Hall I ]

Abstract
Inverse problems provide a fundamental framework for image reconstruction tasks, spanning deblurring, calibration, or low-light enhancement for instance. While widely used, they often assume full knowledge of the forward model---an unrealistic expectation---while collecting ground truth and measurement pairs is time-consuming and labor-intensive.Without paired supervision or an invertible forward model, solving inverse problems becomes significantly more challenging and error-prone. To address this, strong priors have traditionally been introduced to regularize the problem, enabling solutions from single images alone.In this work, however, we demonstrate that with minimal assumptions on the forward model and by leveraging small, unpaired clean and degraded datasets, we can achieve good estimates of the true degradation. We employ conditional flow matching to efficiently model the degraded data distribution and explicitly learn the forward model using a tailored distribution-matching loss.Through experiments on uniform and non-uniform deblurring tasks, we show that our method outperforms both single-image blind and unsupervised approaches, narrowing the gap to non-blind methods. We also showcase the effectiveness of our method with a proof of concept for automatic lens calibration---a real-world application traditionally requiring time-consuming experiments and specialized equipment. In contrast, our approach achieves this with minimal data acquisition effort.
Poster
Jonas Mirlach · Lei Wan · Andreas Wiedholz · Hannan Keen · Andreas Eich

[ Exhibit Hall I ]

Abstract
In autonomous driving, the integration of roadside perception systems is essential for overcoming occlusion challenges and enhancing the safety of vulnerable road users (VRUs). While LiDAR and visual (RGB) sensors are commonly used, thermal imaging remains underrepresented in datasets, despite its acknowledged advantages for VRU detection in extreme lighting conditions.In this paper, we present R-LiViT, the first dataset to combine LiDAR, RGB, and thermal imaging from a roadside perspective, with a strong focus on VRUs.R-LiViT captures three intersections during both day and night, ensuring a diverse dataset.It includes 10,000 LiDAR frames and 2,400 temporally and spatially aligned RGB and thermal images across over 150 traffic scenarios, with 6 and 8 annotated classes respectively, providing a comprehensive resource for tasks such as object detection and tracking.The dataset and the code for reproducing our evaluation results are made publicly available.
Poster
Bowen Wang · Yafei Wang · Wei Gong · Siheng Chen · Genjia Liu · Minhao Xiong · Chin Long Ng

[ Exhibit Hall I ]

Abstract
Whether autonomous driving can effectively handle challenging scenarios such as bad weather and complex traffic environments is still in doubt. One of the critical difficulties is that the single-view perception makes it hard to obtain the complementary perceptual information around the multi-condition scenes, such as meeting occlusion and congestion. To investigate the advantages of collaborative perception in high-risky driving scenarios, we construct a multiple challenging conditions dataset for large-range vehicle-infrastructure cooperative perception, called V2XScenes, which includes seven typical multi-modal layouts at successive road section. Particularly, each selected scene is labeled with a specific condition description, and we provide unique object tracking numbers across the entire road section and sequential frames to ensure consistency. Comprehensive cooperative perception benchmarks of 3D object detection and tracking for large-range roadside scenes are summarized, and the quantitative results based on the state-of-the-art demonstrate the effectiveness of collaborative perception facing challenging scenes. The data and benchmark codes of V2XScenes will be released.
Poster
Bingyi Liu · Jian Teng · Hongfei Xue · Enshu Wang · Chuanhui Zhu · Pu Wang · Libing Wu

[ Exhibit Hall I ]

Abstract
Collaborative perception significantly enhances individual vehicle perception performance through the exchange of sensory information among agents. However, real-world deployment faces challenges due to bandwidth constraints and inevitable calibration errors during information exchange. To address these issues, we propose mmCooper, a novel multi-agent, multi-stage, communication-efficient, and collaboration-robust cooperative perception framework. Our framework leverages a multi-stage collaboration strategy that dynamically and adaptively balances intermediate- and late-stage information to share among agents, enhancing perceptual performance while maintaining communication efficiency. To support robust collaboration despite potential misalignments and calibration errors, our framework prevents misleading low-confidence sensing information from transmission and refines the received detection results from collaborators to improve accuracy. The extensive evaluation results on both real-world and simulated datasets demonstrate the effectiveness of the mmCooper framework and its components.
Poster
Chaesong Park · Eunbin Seo · JihyeonHwang JihyeonHwang · Jongwoo Lim

[ Exhibit Hall I ]

Abstract
In this paper, we introduce SC-Lane, a novel slope-aware and temporally consistent heightmap estimation framework for 3D lane detection. Unlike previous approaches that rely on fixed slope anchors, SC-Lane adaptively determines the optimal fusion of slope-specific height features, improving robustness to diverse road geometries. To achieve this, we propose a Slope-Aware Adaptive Feature module that dynamically predicts the optimal weights for integrating multi-slope representations into a unified heightmap. Additionally, a Height Consistency Module enforces temporal coherence, ensuring stable and accurate height estimation across consecutive frames, which is crucial for real-world driving scenarios.To evaluate the effectiveness of SC-Lane, we introduce a LiDAR-derived heightmap dataset and adopt standard evaluation metrics, including Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and threshold-based accuracy. While these metrics are widely used in surface and depth estimation, their application to road height estimation has been underexplored. Extensive experiments on the OpenLane benchmark demonstrate that SC-Lane significantly improves both height estimation and 3D lane detection, achieving state-of-the-art performance with an F-score of 64.3% — outperforming existing methods by a notable margin. These results highlight SC-Lane’s potential for enhancing the reliability of autonomous driving perception.The code and dataset used in this study will be made publicly available …
Poster
Binbin Xiang · Maciej Wielgosz · Stefano Puliti · Kamil Král · Martin Krůček · Azim Missarov · Rasmus Astrup

[ Exhibit Hall I ]

Abstract
The segmentation of forest LiDAR 3D point clouds, including both individual tree and semantic segmentation, is fundamental for advancing forest management and ecological research. However, current approaches often struggle with the complexity and variability of natural forest environments. We present ForestFormer3D, a new unified and end-to-end framework designed for precise individual tree and semantic segmentation. ForestFormer3D incorporates ISA-guided query point selection, a score-based block merging strategy during inference, and a one-to-many association mechanism for effective training. By combining these new components, our model achieves state-of-the-art performance for individual tree segmentation on the newly introduced FOR-instanceV2 dataset, which spans diverse forest types and regions. Additionally, ForestFormer3D generalizes well to unseen test sets (Wytham woods and LAUTx), showcasing its robustness across different forest conditions and sensor modalities. The FOR-instanceV2 dataset and the ForestFormer3D code will be released post-acceptance.
Poster
Andrea Simonelli · Norman Müller · Peter Kontschieder

[ Exhibit Hall I ]

Abstract
The increasing availability of digital 3D environments, whether through image reconstruction, generation, or scans obtained via lasers or robots, is driving innovation across various fields. Among the numerous applications, there is a significant demand for those that enable 3D interaction, such as 3D Interactive Segmentation, which is useful for tasks like object selection and manipulation. Additionally, there is a persistent need for solutions that are efficient, precise, and consistently perform well across diverse settings, particularly in unseen environments and with unfamiliar objects. In this work, we introduce a method that consistently surpasses previous state-of-the-art techniques on both in-domain and out-of-domain datasets. Our simple approach integrates a voxel-based sparse encoder with a lightweight transformer-based decoder that implements implicit click fusion, achieving superior performance and maximizing efficiency. Our method demonstrates substantial improvements on benchmark datasets, including ScanNet, ScanNet++, S3DIS, and KITTI-360, and also on unseen geometric distributions such as Gaussian Splatting.
Poster
Bhavya Goyal · Felipe Gutierrez-Barragan · Wei Lin · Andreas Velten · Yin Li · Mohit Gupta

[ Exhibit Hall I ]

Abstract
LiDAR-based 3D sensors provide point clouds, a canonical 3D representation used in various 3D scene understanding tasks. Modern LiDARs face key challenges in various real-world scenarios such as long-distance or low-albedo objects, producing sparse or erroneous point clouds. These errors, which are rooted in the noisy raw LiDAR measurements, get propagated to downstream perception models, resulting in potentially severe loss of accuracy. This is because conventional 3D processing pipelines used to construct point clouds from raw LiDAR measurements do not retain the uncertainty information available in the raw sensor data. We propose a novel 3D scene representation called Probabilistic Point Clouds (PPC) where each point is augmented with a probability attribute that encapsulates the measurement uncertainty (confidence) in raw data. We further introduce inference approaches that leverage PPC for robust 3D object detection; these methods are versatile and can be used as computationally lightweight drop-in modules in 3D inference pipelines. We demonstrate, via both simulations and real captures, that PPC-based 3D inference methods outperform several baselines with LiDAR as well as camera-LiDAR fusion models, across challenging indoor and outdoor scenarios involving small, distant, and low-albedo objects, as well as strong ambient light.
Poster
Andreas Engelhardt · Mark Boss · Vikram Voleti · Chun-Han Yao · Hendrik Lensch · Varun Jampani

[ Exhibit Hall I ]

Abstract
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional pipeline steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially-varying PBR parameters and surface normals jointly with each generated RGB view based on explicit camera control. This unique setup allows for direct relighting in a 2.5D setting, and for generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse image inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
Poster
YIWEN CHEN · Hieu (Hayden) Nguyen · Vikram Voleti · Varun Jampani · Huaizu Jiang

[ Exhibit Hall I ]

Abstract
We introduce HouseCrafter, a novel approach that can lift a 2D floorplan into a complete large 3D indoor scene (\eg, a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in batches along sampled locations derived from the floorplan. At each step, the diffusion model conditions on previously generated images to produce new images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-FRONT dataset, we demonstrate that HouseCrafter can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights.
Poster
Junyan Ye · Jun He · Weijia Li · Zhutao Lv · Yi Lin · Jinhua Yu · Haote Yang · Conghui He

[ Exhibit Hall I ]

Abstract
Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images while maintaining consistent content layout, simulating a top-down view. The significant viewpoint difference leads to domain gaps between views, and dense urban scenes limit the visible range of street views, making this cross-view generation task particularly challenging. In this paper, we introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images, utilizing a diffusion model and the Bird’s-Eye View (BEV) paradigm. The Curved-BEV method in SkyDiffusion converts street-view images into a BEV perspective, effectively bridging the domain gap, and employs a "multi-to-one" mapping strategy to address occlusion issues in dense urban scenes. Next, SkyDiffusion designed a BEV-guided diffusion model to generate content-consistent and realistic aerial images. Additionally, we introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications, including disaster scene aerial synthesis, low-altitude UAV image synthesis, and historical high-resolution satellite image synthesis tasks. Experimental results demonstrate that SkyDiffusion outperforms state-of-the-art methods on cross-view datasets across natural (CVUSA), suburban (CVACT), urban (VIGOR-Chicago), and various application scenarios (G2A-3), achieving realistic and content-consistent aerial image generation. The code, datasets and more information of this work can be found at https://skydiffusion0307.github.io/.
Poster
Xiaobao Wei · Qingpo Wuwu · Zhongyu Zhao · Zhuangzhe Wu · Nan Huang · Ming Lu · ningning ma · Shanghang Zhang

[ Exhibit Hall I ]

Abstract
Photorealistic reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. While recent methods based on 3D/4D Gaussian Splatting (GS) have demonstrated promising results, they still encounter challenges in complex street scenes due to the unpredictable motion of dynamic objects. Current methods typically decompose street scenes into static and dynamic objects, learning the Gaussians in either a supervised manner (e.g., w/ 3D bounding-box) or a self-supervised manner (e.g., w/o 3D bounding-box). However, these approaches do not effectively model the motions of dynamic objects (e.g., the motion speed of pedestrians is clearly different from that of vehicles), resulting in suboptimal scene decomposition. To address this, we propose Explicit Motion Decomposition (EMD), which models the motions of dynamic objects by introducing learnable motion embeddings to the Gaussians, enhancing the decomposition in street scenes. The proposed plug-and-play EMD module compensates for the lack of motion modeling in self-supervised street Gaussian splatting methods. We also introduce tailored training strategies to extend EMD to supervised approaches. Comprehensive experiments demonstrate the effectiveness of our method, achieving state-of-the-art novel view synthesis performance in self-supervised settings.The code will be released.
Poster
Mingyuan Sun · Zheng Fang · Jiaxu Wang · Kun-Yi Zhang · Qiang Zhang · Renjing Xu

[ Exhibit Hall I ]

Abstract
We present GravlensX, an innovative method for rendering black holes with gravitational lensing effects using neural networks. The methodology involves training neural networks to fit the spacetime around black holes and then employing these trained models to generate the path of light rays affected by gravitational lensing. This enables efficient and scalable simulations of black holes, significantly decreasing the time required for rendering compared to traditional methods. We validate our approach through extensive rendering of multiple black hole systems with superposed Kerr metric, demonstrating its capability to produce accurate visualizations with significantly $15\times$ reduced computational time. Our findings suggest that neural networks offer a promising alternative for rendering complex astrophysical phenomena, potentially paving a new path to astronomical visualization. Our code will be open-source soon.
Poster
Xinlong Ding · Hongwei Yu · Jiawei Li · Feifan Li · Yu Shang · Bochao Zou · Huimin Ma · Jiansheng Chen

[ Exhibit Hall I ]

Abstract
Camera pose estimation is a fundamental computer vision task that is essential for applications like visual localization and multi-view stereo reconstruction. In the object-centric scenarios with sparse inputs, the accuracy of pose estimation can be significantly influenced by background textures that occupy major portions of the images across different viewpoints. In light of this, we introduce the Kaleidoscopic Background Attack (KBA), which uses identical segments to form discs with multi-fold radial symmetry. These discs maintain high similarity across different viewpoints, enabling effective attacks on pose estimation models even with natural texture segments. Additionally, a projected orientation consistency loss is proposed to optimize the kaleidoscopic segments, leading to significant enhancement in the attack effectiveness. Experimental results show that adversarial kaleidoscopic backgrounds optimized by KBA can effectively attack various camera pose estimation models.
Poster
Zitong Zhang · Suranjan Gautam · Rui Yu

[ Exhibit Hall I ]

Abstract
Generating immersive 360° indoor panoramas from 2D top-down views has applications in virtual reality, interior design, real estate, and robotics. This task is challenging due to the lack of explicit 3D structure and the need for geometric consistency and photorealism. We propose Top2Pano, an end-to-end model for synthesizing realistic indoor panoramas from top-down views. Our method estimates volumetric occupancy to infer 3D structures, then uses volumetric rendering to generate coarse color and depth panoramas. These guide a diffusion-based refinement stage using ControlNet, enhancing realism and structural fidelity. Evaluations on two datasets show Top2Pano outperforms baselines, effectively reconstructing geometry, occlusions, and spatial arrangements. It also generalizes well, producing high-quality panoramas from schematic floorplans. Our results highlight Top2Pano's potential in bridging top-down views with immersive indoor synthesis.
Poster
Yuxin CHENG · Binxiao Huang · Taiqiang Wu · Wenyong Zhou · Chenchen Ding · Zhengwu Liu · Graziano Chesi · Ngai Wong

[ Exhibit Hall I ]

Abstract
3D Gaussian inpainting, a critical technique for numerous applications in virtual reality and multimedia, has made significant progress with pretrained diffusion models. However, ensuring multi-view consistency, an essential requirement for high-quality inpainting, remains a key challenge. In this work, we present PAInpainter, a novel approach designed to advance 3D Gaussian inpainting by leveraging perspective-aware content propagation and consistency verification across multi-view inpainted images. Our method iteratively refines inpainting and optimizes the 3D Gaussian representation with multiple views adaptively sampled from a perspective graph. By propagating inpainted images as prior information and verifying consistency across neighboring views, PAInpainter substantially enhances global consistency and texture fidelity in restored 3D scenes. Extensive experiments demonstrate the superiority of PAInpainter over existing methods. Our approach achieves superior 3D inpainting quality, with PSNR scores of 26.03 dB and 29.51 dB on the SPIn-NeRF and NeRFiller datasets, respectively, highlighting its effectiveness and generalization capability.
Poster
Liang Han · Xu Zhang · Haichuan Song · Kanle Shi · Liang Han · Zhizhong Han

[ Exhibit Hall I ]

Abstract
Surface reconstruction from sparse views aims to reconstruct a 3D shape or scene from few RGB images. However, existing generalization-based methods do not generalize well on views that were unseen during training, while the reconstruction quality of overfitting-based methods is still limited by the limited geometry clues. To address this issue, we propose SparseRecon, a novel neural008 implicit reconstruction method for sparse views with volume rendering-based feature consistency and uncertainty-guided depth constraint. Firstly, we introduce a feature con sistency loss across views to constrain the neural implicit field. This design alleviates the ambiguity caused by insufficient consistency information of views and ensures completeness and smoothness in the reconstruction results. Secondly, we employ an uncertainty-guided depth constraint to back up the feature consistency loss in areas with occlusion and insignificant features, which recovers geometry details for better reconstruction quality. Experimental results demonstrate that our method outperforms the state-of-the-art methods, which can produce high-quality geometry with sparse-view input, especially in the scenarios on small overlapping views.
Poster
Zihui Gao · Jia-Wang Bian · Guosheng Lin · Hao Chen · Chunhua Shen

[ Exhibit Hall I ]

Abstract
Surface reconstruction and novel view rendering from sparse-view images are challenging. Signed Distance Function (SDF)-based methods struggle with fine details, while 3D Gaussian Splatting (3DGS)-based approaches lack global geometry coherence. We propose a novel hybrid method that combines both strengths: SDF captures coarse geometry to enhance 3DGS-based rendering, while newly rendered images from 3DGS refine SDF details for accurate surface reconstruction. As a result, our method surpasses state-of-the-art approaches in surface reconstruction and novel view synthesis on DTU and MobileBrick datasets.The code will be released upon acceptance.
Poster
Jianyun Xu · Song Wang · Ziqian Ni · Chunyong Hu · Sheng Yang · Jianke Zhu · Qiang Li

[ Exhibit Hall I ]

Abstract
We present SAM4D, a multi-modal and temporal foundation model designed for promptable segmentation across camera and LiDAR streams. Unified Multi-modal Positional Encoding (UMPE) is introduced to align camera and LiDAR features in a shared 3D space, enabling seamless cross-modal prompting and interaction. Additionally, we propose Motion-aware Cross-modal Memory Attention (MCMA), which leverages ego-motion compensation to enhance temporal consistency and long-horizon feature retrieval, ensuring robust segmentation across dynamically changing autonomous driving scenes. To avoid annotation bottlenecks, we develop a multi-modal automated data engine that synergizes VFM-driven video masklets, spatiotemporal 4D reconstruction, and cross-modal masklet fusion. This framework generates camera-LiDAR aligned pseudo-labels at a speed orders of magnitude faster than human annotation while preserving VFM-derived semantic fidelity in point cloud representations. We conduct extensive experiments on the constructed Waymo-4DSeg, which demonstrate the powerful cross-modal segmentation ability and great potential in data annotation of proposed SAM4D.
Poster
Chuxin Wang · Yixin Zha · Wenfei Yang · Tianzhu Zhang

[ Exhibit Hall I ]

Abstract
Recently, Mamba-based methods have demonstrated impressive performance in point cloud representation learning by leveraging State Space Model (SSM) with the efficient context modeling ability and linear complexity. However, these methods still face two key issues that limit the potential of SSM: Destroying the adjacency of 3D points during SSM processing and failing to retain long-sequence memory as the input length increases in downstream tasks. To address these issues, we propose StruMamba3D, a novel paradigm for self-supervised point cloud representation learning. It enjoys several merits. First, we design spatial states and use them as proxies to preserve spatial dependencies among points. Second, we enhance the SSM with a state-wise update strategy and incorporate a lightweight convolution to facilitate interactions between spatial states for efficient structure modeling. Third, our method reduces the sensitivity of pre-trained Mamba-based models to varying input lengths by introducing a sequence length-adaptive strategy. Experimental results across four downstream tasks showcase the superior performance of our method. In addition, our method attains the SOTA 95.1\% accuracy on ModelNet40 and 92.75\% accuracy on the most challenging split of ScanObjectNN without voting strategy.
Poster
In Cho · Youngbeom Yoo · Subin Jeon · Seon Joo Kim

[ Exhibit Hall I ]

Abstract
Constructing a compressed latent space through a variational autoencoder (VAE) is the key for efficient 3D diffusion models.This paper introduces COD-VAE, a VAE that encodes 3D shapes into a COmpact set of 1D latent vectors without sacrificing quality.COD-VAE introduces a two-stage autoencoder scheme to improve compression and decoding efficiency.First, our encoder block progressively compresses point clouds into compact latent vectors via intermediate point patches. Second, our triplane-based decoder reconstructs dense triplanes from latent vectors instead of directly decoding neural fields, significantly reducing computational overhead of neural fields decoding. Finally, we propose uncertainty-guided token pruning, which allocates resources adaptively by skipping computations in simpler regions and improves the decoder efficiency.Experimental results demonstrate that COD-VAE achieves 16$\times$ compression compared to the baseline while maintaining quality. This enables $20.8\times$ speedup in generation, highlighting that a large number of latent vectors is not a prerequisite for high-quality reconstruction and generation.
Poster
Nithin Gopalakrishnan Nair · Srinivas Kaza · Xuan Luo · Jungyeon Park · Stephen Lombardi · Vishal Patel

[ Exhibit Hall I ]

Abstract
Large transformer-based models have made significant progress in generalizable novel view synthesis (NVS) from sparse input views, generating novel viewpoints without the need for test-time optimization. However, these models are constrained by the limited diversity of publicly available scene datasets, making most real-world (in-the-wild) scenes out-of-distribution. To overcome this, we incorporate synthetic training data generated from diffusion models, which improves generalization across unseen domains. While synthetic data offers scalability, we identify artifacts introduced during data generation as a key bottleneck affecting reconstruction quality. To address this, we propose a token disentanglement process within the transformer architecture, enhancing feature separation and ensuring more effective learning. This refinement not only improves reconstruction quality over standard transformers but also enables scalable training with synthetic data. As a result, our method outperforms existing models on both in-dataset and cross-dataset evaluations, achieving state-of-the-art results across multiple benchmarks while significantly reducing computational costs.
Poster
Wenjie Huang · Qi Yang · Shuting Xia · He Huang · Yiling Xu · Zhu Li

[ Exhibit Hall I ]

Abstract
Existing AI-based point cloud compression methods struggle with dependence on specific training data distributions, which limits their real-world deployment. Implicit Neural Representation (INR) methods solve the above problem by encoding overfitted network parameters to the bitstream, resulting in more distribution-agnostic results. However, due to the limitation of encoding time and decoder size, current INR based methods only consider lossy geometry compression. In this paper, we propose the first lossless point cloud geometry compression method called Lossless Implicit Neural Representations for Point Cloud Geometry Compression (LINR-PCGC). To accelerate encoding speed, we design a group of point clouds level coding framework with an effective network initialization strategy, which can reduce around 60% encoding time. A lightweight coding network based on multiscale SparseConv, consisting of scale context extraction, child node prediction, and model compression modules, is proposed to realize fast inference and compact decoder size. Experimental results show that our method consistently outperforms traditional and AI-based methods: for example, with the convergence time in the MVUB dataset, our method reduces the bitstream by approximately 21.21% compared to G-PCC TMC13v23 and 21.95% compared to SparsePCGC.
Poster
Miaowei Wang · Changjian Li · Amir Vaxman

[ Exhibit Hall I ]

Abstract
We introduce Canonical Consolidation Fields (CanFields). This novel method interpolates arbitrary-length sequences of independently sampled 3D point clouds into a unified, continuous, and coherent deforming shape. Unlike prior methods that oversmooth geometry or produce topological and geometric artifacts, CanFields optimizes fine-detailed geometry and deformation jointly in an unsupervised fitting with two novel bespoke modules. First, we introduce a dynamic consolidator module that adjusts the input and assigns confidence scores, balancing the optimization of the canonical shape and its motion. Second, we represent the motion as a diffeomorphic flow parameterized by a smooth velocity field. We have validated our robustness and accuracy on more than 50 diverse sequences, demonstrating its superior performance even with missing regions, noisy raw scans, and sparse data. The code is available in Supplemental and will be made publicly available upon publication.
Poster
Chen Shi · Shaoshuai Shi · Kehua Sheng · Bo Zhang · Li Jiang

[ Exhibit Hall I ]

Abstract
Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Panoptic Scene Modeling (PSM), a module that unifies multimodal supervision—3D point cloud forecasting, 2D semantic representation, and image generation—to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX’s predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX’s effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX’s capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.
Poster
Giwon Lee · Wooseong Jeong · Daehee Park · Jaewoo Jeong · Kuk-Jin Yoon

[ Exhibit Hall I ]

Abstract
Motion planning is a crucial component of autonomous robot driving. While various trajectory datasets exist, effectively utilizing them for a target domain remains challenging due to differences in agent interactions and environmental characteristics. Conventional approaches, such as domain adaptation or ensemble learning, leverage multiple source datasets but suffer from domain imbalance, catastrophic forgetting, and high computational costs. To address these challenges, we propose Interaction-Merged Motion Planning (IMMP), a novel approach that leverages parameter checkpoints trained on different domains during adaptation to the target domain. IMMP follows a two-step process: pre-merging to capture agent behaviors and interactions, sufficiently extracting diverse information from the source domain, followed by merging to construct an adaptable model that efficiently transfers diverse interactions to the target domain. Our method is evaluated on various planning benchmarks and models, demonstrating superior performance compared to conventional approaches.
Poster
Tianyu Hong · Xiaobo Zhou · Wenkai Hu · Qi Xie · Zhihui Ke · Tie Qiu

[ Exhibit Hall I ]

Abstract
Collaborative perception is considered a promising approach to address the inherent limitations of single-vehicle systems by sharing data among vehicles, thereby enhancing performance in perception tasks such as bird’s‐eye view (BEV) semantic segmentation. However, existing methods share the entire dense, scene-level BEV feature, which contains significant redundancy and lacks height information, ultimately leading to unavoidable bandwidth waste and performance degradation. To address these challenges, we present $\textit{GSCOOP}$, the first collaborative semantic segmentation framework that leverages sparse, object-centric 3D Gaussians to fundamentally overcome communication bottlenecks. By representing scenes with compact Gaussians that preserve complete spatial information, $\textit{GSCOOP}$ achieves both high perception accuracy and communication efficiency. To further optimize transmission, we introduce the Priority-Based Gaussian Selection (PGS) module to adaptively select critical Gaussians and a Semantic Gaussian Compression (SGC) module to compress Gaussian attributes with minimal overhead. Extensive experiments on OPV2V and V2X-Seq demonstrate that GSCOOP achieves state-of-the-art performance, even with more than $500\times$ lower communication volume.
Poster
Yupeng Zheng · Pengxuan Yang · Zebin Xing · Qichao Zhang · Yuhang Zheng · Yinfeng Gao · Pengfei Li · Teng Zhang · Zhongpu Xia · Peng Jia · XianPeng Lang · Dongbin Zhao

[ Exhibit Hall I ]

Abstract
End-to-end autonomous driving directly generates planning trajectories from raw sensor data, yet it typically relies on costly perception supervision to extract scene information. A critical research challenge arises: constructing an informative driving world model to enable perception annotation-free, end-to-end planning via self-supervised learning. In this paper, we present World4Drive, an end-to-end autonomous driving framework that employs vision foundation models to build latent world models for generating and evaluating multi-modal planning trajectories. Specifically, World4Drive first extracts scene features, including driving intention and world latent representations enriched with spatial-semantic priors provided by vision foundation models. It then generates multi-modal planning trajectories based on current scene features and driving intentions and predicts multiple intention-driven future states within the latent space. Finally, it introduces a world model selector module to evaluate and select the best trajectory. We achieve perception annotation-free, end-to-end planning through self-supervised alignment between actual future observations and predicted observations reconstructed from the latent space. World4Drive achieves state-of-the-art performance without manual perception annotations on both the open-loop nuScenes and closed-loop NavSim benchmarks, demonstrating an 18.1% relative reduction in L2 error, 46.7% lower collision rate, and 3.75× faster.
Poster
Chengchang Tian · Jianwei Ma · Yan Huang · Zhanye Chen · Honghao Wei · Hui Zhang · Wei Hong

[ Exhibit Hall I ]

Abstract
Feature-level fusion shows promise in collaborative perception (CP) through balanced performance and communication bandwidth trade-off. However, its effectiveness critically relies on input feature quality. The acquisition of high-quality features faces domain gaps from hardware diversity and deployment conditions, alongside temporal misalignment from transmission delays. These challenges degrade feature quality with cumulative effects throughout the collaborative network. In this paper, we present the Domain-And-Time Alignment (DATA) network, designed to systematically align features while maximizing their semantic representations for fusion. Specifically, we propose a Consistency-preserving Domain Alignment Module (CDAM) that reduces domain gaps through proximal-region hierarchical downsampling and observability-constrained discriminator. We further propose a Progressive Temporal Alignment Module (PTAM) to handle transmission delays via multi-scale motion modeling and two-stage compensation. Building upon the aligned features, an Instance-focused Feature Aggregation Module (IFAM) is developed to enhance semantic representations. Extensive experiments demonstrate that DATA achieves state-of-the-art performance on three typical datasets, maintaining robustness with severe communication delays and pose errors.
Poster
Chengwei Ren · Fan Zhang · Liangchao Xu · Liang Pan · Ziwei Liu · Wenping Wang · Xiao-Ping Zhang · Yuan Liu

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) is a prevailing technique to reconstruct large-scale 3D scenes from multiview images for novel view synthesis, like a room, a block, and even a city. Such large-scale scenes are not static with changes constantly happening in these scenes, like a new building being built or a new decoration being set up. To keep the reconstructed 3D Gaussian fields up-to-date, a naive way is to reconstruct the whole scene after changing, which is extremely costly and inefficient. In this paper, we propose a new method called GauUpdate that allows partially updating an old 3D Gaussian field with new objects from a new 3D Gaussian field. However, simply inserting the new objects leads to inconsistent appearances because the old and new Gaussian fields may have different lighting environments from each other. GauUpdate addresses this problem by applying inverse rendering techniques in the 3DGS to recover both the materials and environmental lights. Based on the materials and lighting, we relight the new objects in the old 3D Gaussian field for consistent global illumination. For an accurate estimation of the materials and lighting, we put additional constraints on the materials and lighting conditions, that these two fields share the same …
Poster
Binjian Xie · Pengju Zhang · Hao Wei · Yihong Wu

[ Exhibit Hall I ]

Abstract
Single-view 3D reconstruction is a fundamental problem in computer vision, having a significant impact on downstream tasks such as autonomous driving, virtual reality and augmented reality. However, existing single-view reconstruction methods are unable to reconstruct the regions outside the input field-of-view or the areas occluded by visible parts. In this paper, we propose Hi-Gaussian, which employs feed-forward 3D Gaussians for efficient and generalizable single-view 3D reconstruction. A Normalized Spherical Projection module is introduced following an Encoder-Decoder network in our model, assigning a larger range to the transformed spherical coordinates, which can enlarge the field of view during scene reconstruction. Besides, to reconstruct occluded regions behind the visible part, we introduce a novel Hierarchical Gaussian Sampling strategy, utilizing two layers of Gaussians to hierarchically represent 3D scenes. We first use a pre-trained monocular depth estimation model to provide depth initialization for $leader$ Gaussians, and then leverage the $leader$ Gaussians to estimate the distribution followed by $follower$ Gaussians, which can flexibly move into occluded areas. Extensive experiments show that our method outperforms other methods for scene reconstruction and novel view synthesis, on both outdoor and indoor datasets.
Poster
Haoang Lu · Yuanqi Su · Xiaoning Zhang · Longjun Gao · Yu Xue · Le Wang

[ Exhibit Hall I ]

Abstract
This paper introduces VisHall3D, a novel two-stage framework for monocular semantic scene completion that aims to address the issues of feature entanglement and geometric inconsistency prevalent in existing methods. VisHall3D decomposes the scene completion task into two stages: reconstructing the visible regions (vision) and inferring the invisible regions (hallucination). In the first stage, VisFrontierNet, a visibility-aware projection module, is introduced to accurately trace the visual frontier while preserving fine-grained details. In the second stage, OcclusionMAE, a hallucination network, is employed to generate plausible geometries for the invisible regions using a noise injection mechanism. By decoupling scene completion into these two distinct stages, VisHall3D effectively mitigates feature entanglement and geometric inconsistency, leading to significantly improved reconstruction quality.The effectiveness of VisHall3D is validated through extensive experiments on two challenging benchmarks: SemanticKITTI and SSCBench-KITTI-360. VisHall3D achieves state-of-the-art performance, outperforming previous methods by a significant margin and paves the way for more accurate and reliable scene understanding in autonomous driving and other applications.
Poster
Yan Xia · Yunxiang Lu · Rui Song · Oussema Dhaouadi · Joao F. Henriques · Daniel Cremers

[ Exhibit Hall I ]

Abstract
We tackle the problem of localizing traffic cameras within a 3D reference map and propose a novel image-to-point cloud registration (I2P) method, TrafficLoc, in a coarse-to-fine matching fashion. To overcome the lack of large-scale real-world intersection datasets, we first introduce Carla Intersection, a new simulated dataset with 75 urban and rural intersections in Carla. We find that current I2P methods struggle with cross-modal matching under large viewpoint differences, especially at traffic intersections. TrafficLoc thus employs a novel Geometry-guided Attention Loss (GAL) to focus only on the corresponding geometric regions under different viewpoints during 2D-3D feature fusion. To address feature inconsistency in paired image patch-point groups, we further propose Inter-intra Contrastive Learning (ICL) to enhance separating 2D patch / 3D group features within each intra-modality and introduce Dense Training Alignment (DTA) with soft-argmax for improving position regression. Extensive experiments show our TrafficLoc greatly improves the performance over the SOTA I2P methods (**up to 86%**) on Carla Intersection and generalizes well to real-world data. TrafficLoc also achieves new SOTA performance on KITTI and NuScenes datasets, demonstrating the superiority across both in-vehicle and traffic cameras. The code and dataset will be available upon acceptance.
Poster
Xiangdong Zhang · Shaofeng Zhang · Junchi Yan

[ Exhibit Hall I ]

Abstract
Point cloud learning, especially in a self-supervised way without manual labels, has gained growing attention in both vision and learning communities due to its potential utility in a wide range of applications. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it may thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to self-reconstruction, which enables our method to surpass previous single-modal self-reconstruction methods in 3D self-supervised learning. Specifically, it outperforms the self-reconstruction baseline (Point-MAE) by 6.5%, 7.0%, and 6.7% in three variants of ScanObjectNN with the Mlp-Linear evaluation protocol. Source code will be released.
Poster
Chensheng Peng · Ido Sobol · Masayoshi Tomizuka · Kurt Keutzer · Chenfeng Xu · Or Litany

[ Exhibit Hall I ]

Abstract
We present a novel framework for training 3D image-conditioned diffusion models using only 2D supervision. Recovering 3D structure from 2D images is inherently ill-posed due to the ambiguity of possible reconstructions, making generative models a natural choice. However, most existing 3D generative models rely on full 3D supervision, which is impractical due to the scarcity of large-scale 3D datasets. To address this, we propose leveraging sparse-view supervision as a scalable alternative. While recent reconstruction models use sparse-view supervision with differentiable rendering to lift 2D images to 3D, they are predominantly deterministic, failing to capture the diverse set of plausible solutions and producing blurry predictions in uncertain regions. A key challenge in training 3D diffusion models with 2D supervision is that the standard training paradigm requires both the denoising process and supervision to be in the same modality. We address this by decoupling the noisy samples being denoised from the supervision signal, allowing the former to remain in 3D while the latter is provided in 2D. Our approach leverages suboptimal predictions from a deterministic image-to-3D model—acting as a "teacher"—to generate noisy 3D inputs, enabling effective 3D diffusion training without requiring full 3D ground truth. We validate our framework on both object-level …
Poster
Xiangyu Han · Zhen Jia · Boyi Li · Yan Wang · Boris Ivanovic · Yurong You · Lingjie Liu · Yue Wang · Marco Pavone · Chen Feng · Yiming Li

[ Exhibit Hall I ]

Abstract
Photorealistic simulators are essential for the training and evaluation of vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis (NVS), a crucial capability that generates diverse unseen viewpoints to accommodate the broad and continuous pose distribution of AVs. Recent advances in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic rendering at real-time speeds and have been widely used in modeling large-scale driving scenes. However, their performance is commonly evaluated using an interpolated setup with highly correlated training and test views. In contrast, extrapolation, where test views largely deviate from training views, remains underexplored, limiting progress in generalizable simulation technology. To address this gap, we leverage publicly available AV datasets with multiple traversals, multiple vehicles, and multiple cameras to build the first Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct both quantitative and qualitative evaluations of state-of-the-art NVS methods across different evaluation settings. Our results show that current NVS methods are prone to overfitting to training views. Besides, incorporating diffusion priors and improving geometry cannot fundamentally improve NVS under large view changes, highlighting the need for more robust approaches and large-scale training. We will release the data to help advance self-driving and urban robotics simulation technology.
Poster
Chiao-An Yang · Raymond A. Yeh

[ Exhibit Hall I ]

Abstract
Facial landmark detection is an important task in computer vision with numerous downstream applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been a strong contender in achieving state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. As argmax is not differentiable, to enable end-to-end training on deep-nets, these methods rely on a differentiable approximation of argmax, namely Soft-argmax. In this work, we revisit this long-standing choice of using Soft-argmax and find that it may not be necessary. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W) with faster training convergence by roughly $2.2\times$ while maintaining intuitive design choices in our model.
Poster
Tianhang Cheng · Albert Zhai · Evan Chen · Rui Zhou · Yawen Deng · Zitong Li · Kejie Zhao · Janice Shiu · Qianyu Zhao · Yide Xu · Xinlei Wang · Yuan Shen · Sheng Wang · Lisa Ainsworth · Kaiyu Guan · Shenlong Wang

[ Exhibit Hall I ]

Abstract
Learning 3D parametric shape models of objects has gained popularity in vision and graphics and has showed broad utility in 3D reconstruction, generation, understanding, and simulation. While powerful models exist for humans and animals, equally expressive approaches for modeling plants are lacking. In this work, we present Demeter, a data-driven parametric model that encodes key factors of a plant morphology, including topology, shape, articulation, and deformation into a compact learned representation. Unlike previous parametric models, Demeter handles varying shape topology across various species and models three sources of shape variation: articulation, subcomponent shape variation, and non-rigid deformation. To advance crop plant modeling, we collected a large-scale, ground-truthed dataset from a soybean farm as a testbed. Experiments show that Demeter effectively synthesizes shapes, reconstructs structures, and simulates biophysical processes. Code and data will be open-sourced.
Poster
Tianrui Lou · Xiaojun Jia · Siyuan Liang · Jiawei Liang · Ming Zhang · Yanjun Xiao · Xiaochun Cao

[ Exhibit Hall I ]

Abstract
Physical adversarial attack methods expose the vulnerabilities of deep neural networks and pose a significant threat to safety-critical scenarios such as autonomous driving. Camouflage-based physical attack is a more promising approach compared to the patch-based attack, offering stronger adversarial effectiveness in complex physical environments. However, most prior work relies on mesh priors of the target object and virtual environments constructed by simulators, which are time-consuming to obtain and inevitably differ from the real world. Moreover, due to the limitations of the backgrounds in training images, previous methods often fail to produce multi-view robust adversarial camouflage and tend to fall into sub-optimal solutions. Due to these reasons, prior work lacks adversarial effectiveness and robustness across diverse viewpoints and physical environments. We propose a physical attack framework based on 3D Gaussian Splatting (3DGS), named PGA, which provides rapid and precise reconstruction with few images, along with photo-realistic rendering capabilities. Our framework further enhances cross-view robustness and adversarial effectiveness by preventing mutual and self-occlusion among Gaussians and employing a min-max optimization approach that adjusts the imaging background of each viewpoint, helping the algorithm filter out non-robust adversarial features. Extensive experiments validate the effectiveness and superiority of PGA.
Poster
Katie Luo · Minh-Quan Dao · Zhenzhen Liu · Mark Campbell · Wei-Lun (Harry) Chao · Kilian Weinberger · Ezio Malis · Vincent FREMONT · Bharath Hariharan · Mao Shan · Stewart Worrall · Julie Stephany Berrio Perez

[ Exhibit Hall I ]

Abstract
Vehicle-to-everything (V2X) collaborative perception has emerged as a promising solution to address the limitations of single-vehicle perception systems. However, existing V2X datasets are limited in scope, diversity, and quality. To address these gaps, we present Mixed Signals, a comprehensive V2X dataset featuring 45.1k point clouds and 240.6k bounding boxes collected from three connected autonomous vehicles (CAVs) equipped with two different configurations of LiDAR sensors, plus a roadside unit with dual LiDARs. Our dataset provides point clouds and bounding box annotations across 10 classes, ensuring reliable data for perception training. We provide detailed statistical analysis on the quality of our dataset and extensively benchmark existing V2X methods on it. Mixed Signals is ready-to-use, with precise alignment and consistent annotations across time and viewpoints. We hope our work advances research in the emerging, impactful field of V2X perception.
Poster
Zijun Lin · Shuting He · Cheston Tan · Bihan Wen

[ Exhibit Hall I ]

Abstract
Sequential grounding in 3D point clouds (SG3D) refers to locating sequences of objects by following text instructions for a daily activity with detailed steps. Current 3D visual grounding (3DVG) methods treat text instructions with multiple steps as a whole, without extracting useful temporal information from each step. However, the instructions in SG3D often contain pronouns such as "it", "here" and "the same" to make language expressions concise. This requires grounding methods to understand the context and retrieve relevant information from previous steps to correctly locate object sequences. Due to the lack of an effective module for collecting related historical information, state-of-the-art 3DVG methods face significant challenges in adapting to the SG3D task. To fill this gap, we propose GroundFlow — a plug-in module for temporal reasoning on 3D point cloud sequential grounding. Firstly, we demonstrate that integrating GroundFlow improves the task accuracy of 3DVG baseline methods by a large margin (+7.5\% and +10.2\%) in the SG3D benchmark, even outperforming a 3D large language model pre-trained on various datasets. Furthermore, we selectively extract both short-term and long-term step information based on its relevance to the current instruction, enabling GroundFlow to take a comprehensive view of historical information and maintain its temporal …
Poster
Teng Li · Guangcong Zheng · Rui Jiang · Shuigenzhan Shuigenzhan · Tao Wu · Yehao Lu · Yining Lin · Chuanyun Deng · Yepan Xiong · Min Chen · Lin Cheng · Xi Li

[ Exhibit Hall I ]

Abstract
Recent advancements in camera-trajectory-guided image-to-video generation offer higher precision and better support for complex camera control compared to text-based approaches. However, they also introduce significant usability challenges, as users often struggle to provide precise camera parameters when working with arbitrary real-world images without knowledge of their depth nor scene scale.To address these real-world application issues, we propose RealCam-I2V, a novel diffusion-based video generation framework that integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step. During training, the reconstructed 3D scene enables scaling camera parameters from relative to metric scales, ensuring compatibility and scale consistency across diverse real-world images. In inference, RealCam-I2V offers an intuitive interface where users can precisely draw camera trajectories by dragging within the 3D scene.To further enhance precise camera control and scene consistency, we propose scene-constrained noise shaping, which shapes high-level noise and also allows the framework to maintain dynamic and coherent video generation in lower noise stages.RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images. We further enables applications like camera-controlled looping video generation and generative frame interpolation.
Poster
Tiankai Chen · Yushu Li · Adam Goodge · Fei Teng · Xulei Yang · Tianrui Li · Xun Xu

[ Exhibit Hall I ]

Abstract
Out-of-distribution (OOD) detection in 3D point cloud data remains a challenge, particularly in applications where safe and robust perception is critical. While existing OOD detection methods have shown progress for 2D image data, extending these to 3D environments involves unique obstacles. This paper introduces a training-free framework that leverages Vision-Language Models (VLMs) for effective OOD detection in 3D point clouds. By constructing a graph based on class prototypes and testing data, we exploit the data manifold structure to enhancing the effectiveness of VLMs for 3D OOD detection. We propose a novel Graph Score Propagation (GSP) method that incorporates prompt clustering and self-training negative prompting to improve OOD scoring with VLM. Our method is also adaptable to few-shot scenarios, providing options for practical applications. We demonstrate that GSP consistently outperforms state-of-the-art methods across synthetic and real-world datasets 3D point cloud OOD detection.
Poster
Chong Xia · Shengjun Zhang · Fangfu Liu · Chang Liu · Khodchaphun Hirunyaratsameewong · Yueqi Duan

[ Exhibit Hall I ]

Abstract
Perpetual 3D scene generation aims to produce long-range and coherent 3D view sequences, which is applicable for long-term video synthesis and 3D scene reconstruction. Existing methods follow a "navigate-and-imagine" fashion and rely on outpainting for successive view expansion. However, the generated view sequences suffer from semantic drift issue derived from the accumulated deviation of the outpainting module. To tackle this challenge, we propose ScenePainter, a new framework for semantically consistent 3D scene generation, which aligns the outpainter's scene-specific prior with the comprehension of the current scene. To be specific, we introduce a hierarchical graph structure dubbed SceneConceptGraph to construct relations among multi-level scene concepts, which directs the outpainter for consistent novel views and can be dynamically refined to enhance diversity. Extensive experiments demonstrate that our framework overcomes the semantic drift issue and generates more consistent and immersive 3D view sequences.
Poster
Hao-Yu Hou · Chun-Yi Lee · Motoharu Sonogashira · Yasutomo Kawanishi

[ Exhibit Hall I ]

Abstract
The ability to abstract complex 3D environments into simplified and structured representations is crucial across various domains. 3D semantic scene graphs (SSGs) achieve this by representing objects as nodes and their interrelationships as edges, facilitating high-level scene understanding. Existing methods for 3D SSG generation, however, face significant challenges, including high computational demands and non-incremental processing that hinder their suitability for real-time open-world applications. To address this issue, in this work, we propose FROSS (**F**aster-than-**R**eal-Time **O**nline 3D **S**emantic **S**cene Graph Generation), an innovative approach for online and faster-than-real-time 3D SSG generation method that leverages the direct lifting of 2D scene graphs to 3D space and represents objects as 3D Gaussian distributions. This framework eliminates the dependency on precise and computationally-intensive point cloud processing. Furthermore, we extend the Replica dataset with inter-object relationship annotations, creating the ReplicaSSG dataset for comprehensive evaluation of FROSS. The experimental results from evaluations on ReplicaSSG and 3DSSG datasets show that FROSS can achieve superior performance while being orders of magnitude faster than prior 3D SSG generation methods.
Poster
Qing Li · Huifang Feng · Xun Gong · Liang Han

[ Exhibit Hall I ]

Abstract
Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models will be made publicly available.
Poster
Mai Su · Zhongtao Wang · Huishan Au · Yilong Li · Xizhe Cao · Chengwei Pan · Yisong Chen · Guoping Wang

[ Exhibit Hall I ]

Abstract
3DGS is an emerging and increasingly popular technology in the field of novel view synthesis. Its highly realistic rendering quality and real-time rendering capabilities make it promising for various applications. However, when applied to large-scale aerial urban scenes, 3DGS methods suffer from issues such as excessive memory consumption, slow training times, prolonged partitioning processes, and significant degradation in rendering quality due to the increased data volume. To tackle these challenges, we introduce $\textbf{HUG}$, a novel approach that enhances data partitioning and reconstruction quality by leveraging a hierarchical neural Gaussian representation. We first propose a visibility-based data partitioning method that is simple yet highly efficient, significantly outperforming existing methods in speed. Then, we introduce a novel hierarchical weighted training approach, combined with other optimization strategies, to substantially improve reconstruction quality. Our method achieves state-of-the-art results on one synthetic dataset and four real-world datasets.
Poster
Bin Rao · Haicheng Liao · Yanchen Guan · Chengyue Wang · Bonan Wang · Jiaxun Zhang · Zhenning Li

[ Exhibit Hall I ]

Abstract
Accurately predicting the future trajectories of traffic agents is essential in autonomous driving. However, due to the inherent imbalance in trajectory distributions, tail data in natural datasets often represents more complex and hazardous scenarios. Existing studies typically rely solely on a base model’s prediction error, without considering the diversity and uncertainty of long-tail trajectory patterns. We propose an adaptive momentum and decoupled contrastive learning framework (AMD), which integrates unsupervised and supervised contrastive learning strategies. By leveraging an improved momentum contrast learning (MoCo-DT) and decoupled contrastive learning (DCL) module, our framework enhances the model’s ability to recognize rare and complex trajectories. Additionally, we design four types of trajectory random augmentation methods and introduce an online iterative clustering strategy, allowing the model to dynamically update pseudo-labels and better adapt to the distributional shifts in long-tail data. We propose three different criteria to define long-tail trajectories and conduct extensive comparative experiments on the nuScenes and ETH/UCY datasets. The results show that AMD not only achieves optimal performance in long-tail trajectory prediction but also demonstrates outstanding overall prediction accuracy.
Poster
Junhao Ge · Zuhong Liu · Longteng Fan · Yifan Jiang · Jiaqi Su · Yiming Li · Zhejun Zhang · Siheng Chen

[ Exhibit Hall I ]

Abstract
End-to-end (E2E) autonomous driving (AD) models require diverse, high-quality data to perform well across various driving scenarios. However, collecting large-scale real-world data is expensive and time-consuming, making high-fidelity synthetic data essential for enhancing data diversity and model robustness. Existing driving simulators for synthetic data generation have significant limitations: game-engine-based simulators struggle to produce realistic sensor data, while NeRF-based and diffusion-based methods face efficiency challenges. Additionally, recent simulators designed for closed-loop evaluation provide limited interaction with other vehicles, failing to simulate complex real-world traffic dynamics. To address these issues, we introduce SceneCrafter, a realistic, interactive, and efficient AD simulator based on 3D Gaussian Splatting (3DGS). SceneCrafter not only efficiently generates realistic driving logs across diverse traffic scenarios but also enables robust closed-loop evaluation of end-to-end models. Experimental results demonstrate that SceneCrafter serves as both a reliable evaluation platform and a efficient data generator that significantly improves end-to-end model generalization.
Poster
Gangwei Xu · Jiaxin Liu · Xianqi Wang · Junda Cheng · Yong Deng · Jinliang Zang · Yurui Chen · Xin Yang

[ Exhibit Hall I ]

Abstract
State-of-the-art stereo matching methods typically use costly 3D convolutions to aggregate a full cost volume, but their computational demands make mobile deployment challenging. Directly applying 2D convolutions for cost aggregation often results in edge blurring, detail loss, and mismatches in textureless regions. Some complex operations, like deformable convolutions and iterative warping, can partially alleviate this issue; however, they are not mobile-friendly, limiting their deployment on mobile devices. In this paper, we present a novel bilateral aggregation network (BANet) for mobile stereo matching that produces high-quality results with sharp edges and fine details using only 2D convolutions. Specifically, we first separate the full cost volume into detailed and smooth volumes using a spatial attention map, then perform detailed and smooth aggregations accordingly, ultimately fusing both to obtain the final disparity map. Additionally, to accurately identify high-frequency detailed regions and low-frequency smooth/textureless regions, we propose a new scale-aware spatial attention module. Experimental results demonstrate that our BANet-2D significantly outperforms other mobile-friendly methods, achieving 35.3\% higher accuracy on the KITTI 2015 leaderboard than MobileStereoNet-2D, with faster runtime on mobile devices. The extended 3D version, BANet-3D, achieves the highest accuracy among all real-time methods on high-end GPUs.
Poster
Nicolai Hermann · Jorge Condor · Piotr Didyk

[ Exhibit Hall I ]

Abstract
Modern reconstruction techniques can effectively model complex 3D scenes from sparse 2D views. However, automatically assessing the quality of novel views and identifying artifacts is challenging due to the lack of ground truth images and the limitations of No-Reference image metrics in predicting reliable artifact maps. The absence of such metrics hinders assessment of the quality of novel views and limits the adoption of post-processing techniques, such as inpainting, to enhance reconstruction quality. To tackle this, recent work has established a new category of metrics (Cross-Reference), predicting image quality solely by leveraging context from alternate viewpoint captures. In this work, we propose a new Cross-Reference metric, Puzzle Similarity, which is designed to localize artifacts in novel views. Our approach utilizes image patch statistics from the input views to establish a scene-specific distribution, later used to identify poorly reconstructed regions in the novel views. Given the lack of good measures to evaluate Cross-Reference methods in the context of 3D reconstruction, we collected a novel human-labeled dataset of artifact and distortion maps in unseen reconstructed views. Through this dataset, we demonstrate that our method achieves state-of-the-art localization of artifacts in novel views, correlating with human assessment, even without aligned references. We can …
Poster
Lening Wang · Wenzhao Zheng · Dalong Du · Yunpeng Zhang · Yilong Ren · Han Jiang · Zhiyong Cui · Haiyang Yu · Jie Zhou · Shanghang Zhang

[ Exhibit Hall I ]

Abstract
Simulating driving environments in 4D is crucial for developing accurate and immersive autonomous driving systems. Despite progress in generating driving scenes, challenges in transforming views and modeling the dynamics of space and time remain. To tackle these issues, we propose a fresh methodology that reconstructs real-world driving environments and utilizes a generative network to enable 4D simulation. This approach builds continuous 4D point cloud scenes by leveraging surround-view data from autonomous vehicles. By separating the spatial and temporal elements, it creates smooth keyframe sequences. Furthermore, video generation techniques are employed to produce lifelike 4D simulation videos from any given perspective. To extend the range of possible viewpoints, we incorporate training using decomposed camera poses, which allows for enhanced modeling of distant scenes. Additionally, we merge camera trajectory data to synchronize 3D points across consecutive frames, fostering a richer understanding of the evolving scene. With training across multiple scene levels, our method is capable of simulating scenes from any viewpoint and offers deep insight into the evolution of scenes over time in a consistent spatial-temporal framework. In comparison with current methods, this approach excels in maintaining consistency across views, background coherence, and overall accuracy, significantly contributing to the development of more …
Poster
Markus Knoche · Daan de Geus · Bastian Leibe

[ Exhibit Hall I ]

Abstract
Predicting the motion of other agents in a scene is highly relevant for autonomous driving, as it allows a self-driving car to anticipate. Inspired by the success of decoder-only models for language modeling, we propose DONUT, a Decoder-Only Network for Unrolling Trajectories. Different from existing encoder-decoder forecasting models, we encode historical trajectories and predict future trajectories with a single autoregressive model. This allows the model to make iterative predictions in a consistent manner, and ensures that the model is always provided with up-to-date information, enhancing the performance. Furthermore, inspired by multi-token prediction for language modeling, we introduce an 'overprediction' strategy that gives the network the auxiliary task of predicting trajectories at longer temporal horizons. This allows the model to better anticipate the future, and further improves the performance. With experiments, we demonstrate that our decoder-only approach outperforms the encoder-decoder baseline, and achieves new state-of-the-art results on the Argoverse 2 single-agent motion forecasting benchmark. Code will be made available upon acceptance.
Poster
Dominik Scheuble · Hanno Holzhüter · Steven Peters · Mario Bijelic · Felix Heide

[ Exhibit Hall I ]

Abstract
Lidar has become crucial for autonomous driving, providing high-resolution 3D scans that are key for accurate scene understanding. To this end, lidar sensors measure the time-resolved full waveforms from the returning laser light, which a subsequent digital signal processor (DSP) converts to point clouds by identifying peaks in the waveform. Conventional automotive lidar DSP pipelines process each waveform individually, ignoring potentially valuable context from neighboring waveforms. As a result, lidar point clouds are prone to artifacts from low signal-to-noise ratio (SNR) regions, highly reflective objects, and environmental conditions like fog. While leveraging neighboring waveforms has been investigated extensively in transient imaging, the application has been limited to scientific or experimental hardware. In this work, we propose a learned DSP that directly processes full waveforms using a transformer architecture leveraging features from adjacent waveforms to generate high-fidelity multi-echo point clouds. To assess our method, we modify a conventional automotive lidar and capture data in real-world driving scenarios. Furthermore, we collect dedicated test sets in a weather chamber to asses our method in different environmental conditions. Trained on both synthetic and real data, the method improves Chamfer distance by 32 cm and 20 cm compared to on-device peak finding methods and existing …
Poster
Pierre-André Brousseau · Sébastien Roy

[ Exhibit Hall I ]

Abstract
Absolute depth estimation from single camera sequence of images is a relevant task given that mobile machines increasingly rely on vision to navigate. Deep learning for stereo matching has been demonstrated to improve performance for stereo rectified depth estimation but these methods require straightforward left-right camera setups. This work proposes to introduce deep stereo matching to two views of a monocular image sequence obtained from a camera in motion in a static scene. This paper introduces a novel and principled spherical epipolar rectification model, which handles all camera motions. This rectification model is differentiable and allows self-supervised deep stereo matching algorithms to compute disparity and recover depth, given known camera pose. This paper also introduces a spherical crop operation which limits rectified image size and allows for competitive absolute depth estimation performance. This results in a spherical rectification model that is demonstrated to provide metric depth and compete favorably with a current state-of-the-art monocular depth estimator.
Poster
Sankeerth Durvasula · Sharanshangar Muhunthan · Zain Moustafa · Richard Chen · Ruofan Liang · Yushi Guan · Nilesh Ahuja · Nilesh Jain · Selvakumar Panneer · Nandita Vijaykumar

[ Exhibit Hall I ]

Abstract
3D Gaussian Splatting (3DGS) is a state-of-art technique to model real-world scenes with high quality and real-time rendering.Typically, a higher quality representation can be achieved by using a large number of 3D Gaussians. However, using large 3D Gaussian counts significantly increases the GPU device memory for storing model parameters. A large model thus requires powerful GPUs with high memory capacities for training and has slower training/rendering latencies due to the inefficiencies of memory access and data movement. In this work, we introduce ContraGS, a method to enable training directly on compressed 3DGS representations without reducing the Gaussian Counts, and thus with a little loss in model quality. ContraGS leverages codebooks to compactly store a set of Gaussian parameter vectors throughout the training process, thereby significantly reducing memory consumption. While codebooks have been demonstrated to be highly effective at compressing fully trained 3DGS models, directly training using codebook representations is an unsolved challenge. ContraGS solves the problem of learning non-differentiable parameters in codebook-compressed representations by posing parameter estimation as a Bayesian inference problem. To this end, ContraGS provides a framework that effectively uses MCMC sampling to sample over a posterior distribution of these compressed representations. With ContraGS, we demonstrate that ContraGS …
Poster
Sijie Wang · Siqi Li · Yawei Zhang · Shangshu Yu · Shenghai Yuan · Rui She · Quanjiang Guo · JinXuan Zheng · Ong Howe · Leonrich Chandra · Shrivarshann Srijeyan · Aditya Sivadas · Toshan Aggarwal · Heyuan Liu · Hongming Zhang · CHEN CHUJIE · JIANG JUNYU · Lihua Xie · Wee Peng Tay

[ Exhibit Hall I ]

Abstract
Multi-modal perception is essential for unmanned aerial vehicle (UAV) operations, as it enables a comprehensive understanding of the UAVs' surrounding environment. However, most existing multi-modal UAV datasets are primarily biased toward localization and 3D reconstruction tasks, or only support map-level semantic segmentation due to the lack of frame-wise annotations for both camera images and LiDAR point clouds. This limitation prevents them from being used for high-level scene understanding tasks. To address this gap and advance multi-modal UAV perception, we introduce UAVScenes, a large-scale dataset designed to benchmark various tasks across both 2D and 3D modalities. Our benchmark dataset is built upon the well-calibrated multi-modal UAV dataset MARS-LVIG, originally developed only for simultaneous localization and mapping (SLAM). We enhance this dataset by providing manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, along with accurate 6-degree-of-freedom (6-DoF) poses. These additions enable a wide range of UAV perception tasks, including segmentation, depth estimation, 6-DoF localization, place recognition, and novel view synthesis (NVS).
Poster
Jiahui Ren · Mochu Xiang · Jiajun Zhu · Yuchao Dai

[ Exhibit Hall I ]

Abstract
Wide-baseline panorama reconstruction has emerged as a highly effective and pivotal approach for not only achieving geometric reconstruction of the surrounding 3D environment, but also generating highly realistic and immersive novel views. Although existing methods have shown remarkable performance across various benchmarks, they are predominantly reliant on accurate pose information. In practical real-world scenarios, the acquisition of precise pose often requires additional computational resources and is highly susceptible to noise. These limitations hinder the broad applicability and practicality of such methods. In this paper, we present PanoSplatt3R, an unposed wide-baseline panorama reconstruction method. We extend and adapt the foundational reconstruction pretrainings from the perspective domain to the panoramic domain, thus enabling powerful generalization capabilities. To ensure a seamless and efficient domain-transfer process, we introduce RoPE rolling that spans rolled coordinates in rotary positional embeddings across different attention heads, maintaining a minimal modification to RoPE's mechanism, while modeling the horizontal periodicity of panorama images.Comprehensive experiments demonstrate that PanoSplatt3R, even in the absence of pose information, significantly outperforms current state-of-the-art methods. This superiority is evident in both the generation of high-quality novel views and the accuracy of depth estimation, thereby showcasing its great potential for practical applications.
Poster
Changha Shin · Woong Oh Cho · Seon Joo Kim

[ Exhibit Hall I ]

Abstract
360° visual content is widely shared on platforms such as YouTube and plays a central role in virtual reality, robotics, and autonomous navigation. However, consumer-grade dual-fisheye systems consistently yield imperfect panoramas due to inherent lens separation and angular distortions. In this work, we introduce a novel calibration framework that incorporates a dual-fisheye camera model into the 3D Gaussian Splatting pipeline. Our approach not only simulates the realistic visual artifacts produced by dual-fisheye cameras but also enables the synthesis of seamlessly rendered 360° images. By jointly optimizing 3D Gaussian parameters alongside calibration variables that emulate lens gaps and angular distortions, our framework transforms imperfect omnidirectional inputs into flawless novel view synthesis. Extensive evaluations on real-world datasets confirm that our method produces seamless renderings—even from imperfect images—and outperforms existing 360° rendering models.
Poster
Wanshui Gan · Fang Liu · Hongbin Xu · Ningkai Mo · Naoto Yokoya

[ Exhibit Hall I ]

Abstract
We introduce GaussianOcc, a systematic method that investigates Gaussian splatting for fully self-supervised and efficient 3D occupancy estimation in surround views. First, traditional methods for self-supervised 3D occupancy estimation still require ground truth 6D poses from sensors during training. To address this limitation, we propose Gaussian Splatting for Projection (GSP) module to provide accurate scale information for fully self-supervised training from adjacent view projection. Additionally, existing methods rely on volume rendering for final 3D voxel representation learning using 2D signals (depth maps, semantic maps), which is both time-consuming and less effective. We propose Gaussian Splatting from Voxel space (GSV) to leverage the fast rendering properties of Gaussian splatting. As a result, the proposed GaussianOcc method enables fully self-supervised (no ground truth pose) 3D occupancy estimation in competitive performance with low computational cost (2.7 times faster in training and 5 times faster in rendering).
Poster
Kai Ye · Chong Gao · Guanbin Li · Wenzheng Chen · Baoquan Chen

[ Exhibit Hall I ]

Abstract
Recent 3D Gaussian Splatting (3DGS) representations have demonstrated remarkable performance in novel view synthesis; further, material-lighting disentanglement on 3DGS warrants relighting capabilities and its adaptability to broader applications. While the general approach to the latter operation lies in integrating differentiable physically-based rendering (PBR) techniques to jointly recover BRDF materials and environment lighting, achieving a precise disentanglement remains an inherently difficult task due to the challenge of accurately modeling light transport. Existing approaches typically approximate Gaussian points' normals, which constitute an implicit geometric constraint. However, they usually suffer from inaccuracies in normal estimation that subsequently degrade light transport, resulting in noisy material decomposition and flawed relighting results. To address this, we propose GeoSplatting, a novel approach that augments 3DGS with explicit geometry guidance for precise light transport modeling. By differentiably constructing a surface-grounded 3DGS from an optimizable mesh, our approach leverages well-defined mesh normals and the opaque mesh surface, and additionally facilitates the use of mesh-based ray tracing techniques for efficient, occlusion-aware light transport calculations. This enhancement ensures precise material decomposition while preserving the efficiency and high-quality rendering capabilities of 3DGS. Comprehensive evaluations across diverse datasets demonstrate the effectiveness of GeoSplatting, highlighting its superior efficiency and state-of-the-art inverse rendering performance.
Poster
Soumyadipta Banerjee · Jiaul Paik · Debashis Sen

[ Exhibit Hall I ]

Abstract
A translation framework that produces images as if they were captured with a telephoto lens, from images captured with a wide-angle lens, will help in reducing the necessity of complex, expensive and bulky lenses on smartphones. To this end, we propose an image-to-image translation pipeline to simulate the lens compression and perspective adjustment associated with this reconstruction, where the size of the main subject in the images remains the same. We judiciously design depth-based image layering, layer-wise in-painting, redundancy reduction and layer scaling modules to construct the desired tele-photo image, where the pipeline parameters are estimated by a convolutional network. Our approach is compatible with the related optical transformation, and hence, contents behind the main subject are enlarged and that before are diminished, achieving lens compression with appropriate perspective adjustment. Our pipeline performs well qualitatively and quantitatively on several source-target image pairs we have captured solely for this task, and also on images in-the-wild. We show that it can simulate the different amounts of lens compression associated with targeted $2\times$, $4\times$, $8\times$ changes in the focal length. Further, the pipeline is demonstrated to be effective for a sub-class of the lens-compression problem - portrait perspective distortion correction. We also provide …
Poster
Fangfu Liu · Hao Li · Jiawei Chi · Hanyang Wang · Minghui Yang · Fudong Wang · Yueqi Duan

[ Exhibit Hall I ]

Abstract
Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views.Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability.
Poster
Yunzhe Shao · Xinyu Yi · Lu Yin · Shihui Guo · Jun-Hai Yong · Feng Xu

[ Exhibit Hall I ]

Abstract
This paper proposes a novel method called MagShield, designed to address the issue of magnetic interference in sparse inertial motion capture (MoCap) systems. Existing Inertial Measurement Unit (IMU) systems are prone to orientation estimation errors in magnetically disturbed environments, limiting their practical application in real-world scenarios. To address this problem, MagShield employs a "detect-then-correct" strategy, first detecting magnetic disturbances through multi-IMU joint analysis, and then correcting orientation errors using human motion priors. MagShield can be integrated with most existing sparse inertial MoCap systems, improving their performance in magnetically disturbed environments. Experimental results demonstrate that MagShield significantly enhances the accuracy of motion capture under magnetic interference and exhibits good compatibility across different sparse inertial MoCap systems. Code will be released.
Poster
Hongyi Zhang · Laurie Bose · Jianing Chen · Piotr Dudek · Walterio Mayol-Cuevas

[ Exhibit Hall I ]

Abstract
Pixel Processor Arrays (PPAs) are vision sensors that embed data and processing into every pixel element. PPAs can execute visual processing directly at the point of light capture, and output only sparse, high-level information. This is in sharp contrast with the conventional visual pipeline, where whole images must be transferred from sensor to processor. This sparse data readout also provides several major benefits such as higher frame rate, lower energy consumption and lower bandwidth requirements. In this work, we demonstrate generation, matching and storage of binary descriptors for visual keypoint features, entirely upon PPA with no need to output images to external processing, making our approach inherently privacy-aware.Our method spreads descriptors across multiple pixel-processors, which allows for significantly larger descriptors than any prior pixel-processing works. These large descriptors can be used for a range of tasks such as place and object recognition. We demonstrate the accuracy of our in-pixel feature matching up to $ \sim$94.5%, at $\sim$210fps, across a range of datasets, with a greater than $100\times$ reduction in data transfer and bandwidth requirements over traditional cameras.
Poster
Dongki Jung · Jaehoon Choi · Yonghan Lee · Dinesh Manocha

[ Exhibit Hall I ]

Abstract
We present a novel 3D mapping pipeline for large-scale indoor environments. To address the significant challenges in large-scale indoor scenes, such as prevalent occlusions and textureless regions, we propose IM360, a novel approach that leverages the wide field of view of omnidirectional images and integrates the spherical camera model into the Structure-from-Motion (SfM) pipeline. Our SfM utilizes dense matching features specifically designed for 360$^\circ$ images, demonstrating superior capability in image registration. Furthermore, with the aid of mesh-based neural rendering techniques, we introduce a texture optimization method that refines texture maps and accurately captures view-dependent properties by combining diffuse and specular components. We evaluate our pipeline on large-scale indoor scenes, demonstrating its effectiveness in real-world scenarios. In practice, IM360 demonstrates superior performance, achieving a 3.5 PSNR increase in textured mesh reconstruction. We attain state-of-the-art performance in terms of camera localization and registration on Matterport3D and Stanford2D3D.
Poster
Siddharth Tourani · Jayarami Gurram · Akash Kumbar · Satyajit Tourani · Nishant Goyal · Madhava Krishna · Dinesh Reddy Narapureddy · Muhammad Haris Khan

[ Exhibit Hall I ]

Abstract
Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves near state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. Furthermore, when incorporating LiDAR, our approach surpasses existing methods in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion …
Poster
xinyi zheng · Steve Zhang · Weizhe Lin · Fan Zhang · Walterio Mayol-Cuevas · Yunze Liu · Junxiao Shen

[ Exhibit Hall I ]

Abstract
Current state-of-the-art 3D reconstruction models face limitations in building extra-large scale outdoor scenes, primarily due to the lack of sufficiently large-scale and detailed datasets. In this paper, we present a extra-large fine-grained dataset with 10 billion points composed of 41,006 drone-captured high-resolution aerial images, covering 20 diverse and culturally significant scenes from worldwide locations such as Cambridge campus, the Pyramids, and the Forbidden City. Compared to existing datasets, ours offers significantly larger scale and higher detail, uniquely suited for fine-grained 3D applications. Each scene contains an accurate spatial layout and comprehensive structural information, supporting detailed 3D reconstruction tasks. By reconstructing environments using these detailed images, our dataset supports multiple applications, including outputs in the widely adopted COLMAP format, establishing a novel benchmark for evaluating state-of-the-art large-scale Gaussian Splatting methods.The dataset’s flexibility encourages innovations and supports model plug-ins, paving the way for future 3D breakthroughs. All datasets and code will be open-sourced for community use.
Poster
Xidan Zhang · Yihan Zhuang · Qian Guo · Haodong Yang · Xuelin Qian · Gong Cheng · Junwei Han · Zhongling Huang

[ Exhibit Hall I ]

Abstract
Approaches for improving generative adversarial networks (GANs) training under a few samples have been explored for natural images. However, these methods have limited effectiveness for synthetic aperture radar (SAR) images, as they do not account for the unique electromagnetic scattering properties of SAR. To remedy this, we propose a physics-inspired regularization method dubbed $\Phi$-GAN, which incorporates the ideal point scattering center (PSC) model of SAR with two physical consistency losses. The PSC model approximates SAR targets using physical parameters, ensuring that $\Phi$-GAN generates SAR images consistent with real physical properties while preventing discriminator overfitting by focusing on PSC-based decision cues. To embed the PSC model into GANs for end-to-end training, we introduce a physics-inspired neural module capable of estimating the physical parameters of SAR targets efficiently. This module retains the interpretability of the physical model and can be trained with limited data. We propose two physical loss functions: one for the generator, guiding it to produce SAR images with physical parameters consistent with real ones, and one for the discriminator, enhancing its robustness by basing decisions on PSC attributes. We evaluate $\Phi$-GAN across several conditional GAN (cGAN) models, demonstrating state-of-the-art performance in data-scarce scenarios on three SAR image datasets.
Poster
Clément Chadebec · Onur Tasar · Sanjeev Sreetharan · Benjamin Aubin

[ Exhibit Hall I ]

Abstract
In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation. We show that the method can reach state-of-the-art results for various image-to-image tasks using only a single inference step. In addition to its efficiency, we also demonstrate the versatility of the method across different image translation tasks such as object removal, normal and depth estimation, and object relighting. We also derive a conditional framework of LBM and demonstrate its effectiveness by tackling the tasks of controllable image relighting and shadow generation.
Poster
Maximilian Pittner · Joel Janai · Mario Faigle · Alexandru Condurache

[ Exhibit Hall I ]

Abstract
3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface.Conventional 3D methods detect lanes from dense Birds-Eye-View (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface.While recent sparse lane detectors have outperformed dense BEV approaches, they remain simple adaptations of the standard detection transformer, completely ignoring valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization.Identifying the weaknesses of existing 3D lane datasets, we further introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy.Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.We aim to release …
Poster
Mengkun She · Felix Seegräber · David Nakath · Patricia Schöntag · Kevin Köser

[ Exhibit Hall I ]

Abstract
We address the challenge of constructing a consistent and photorealistic Neural Radiance Field (NeRF) in inhomogeneously illuminated, scattering environments with unknown, co-moving light sources. While most existing works on underwater scene representation focus on homogeneous, globally illuminated scattering mediums, limited attention has been given to such scenarios-such as when a robot explores water deeper than a few tens of meters, where sunlight becomes insufficient. To address this, we propose a novel illumination field that is locally attached to the camera, enabling the capture of uneven lighting effects within the viewing frustum. We combine this with a volumetric representation of the medium to an overall method which effectively handles the interaction between the dynamic illumination field and the static scattering medium. Evaluation results demonstrate the effectiveness and flexibility of our approach. We release our code and dataset at link.
Poster
Christophe Bolduc · Yannick Hold-Geoffroy · Jean-Francois Lalonde

[ Exhibit Hall I ]

Abstract
We present GaSLight, a method that generates spatially-varying lighting from regular images. Our method proposes using HDR Gaussian Splats as light source representation, marking the first time regular images can serve as light sources in a 3D renderer. Our two-stage process first enhances the dynamic range of images plausibly and accurately by leveraging the priors embedded in diffusion models. Next, we employ Gaussian Splats to model 3D lighting, achieving spatially variant lighting. Our approach yields state-of-the-art results on HDR estimations and their applications in illuminating virtual objects and scenes. To facilitate the benchmarking of images as light sources, we introduce a novel dataset of calibrated and unsaturated HDR to evaluate images as light sources. We assess our method using a combination of this novel dataset and an existing dataset from the literature. The code to reproduce our method will be available upon acceptance.
Poster
Timo Teufel · xilong zhou · Umar Iqbal · Pramod Rao · Pulkit Gera · Jan Kautz · Vladislav Golyanik · Christian Theobalt

[ Exhibit Hall I ]

Abstract
Simultaneous relighting and novel-view rendering of digital human representations is an important yet challenging task with numerous applications. However, progress in this area has been significantly limited due to the lack of publicly available, high-quality datasets, especially for full-body human captures. To address this critical gap, we introduce the HumanOLAT dataset, the first publicly accessible large-scale dataset providing multi-view One-Light-at-a-Time (OLAT) captures of full-body humans. The dataset includes HDR RGB frames under various illumination conditions, such as white light, environment maps, color gradients and fine-grained OLAT illuminations. Our evaluations on state-of-the-art relighting and novel-view synthesis methods underscore both the dataset's value and the significant challenges still present in accurately modeling complex human-centric appearance and lighting interactions. We believe that HumanOLAT will significantly facilitate future research, enabling rigorous benchmarking and advancements in both general and human-specific relighting and rendering techniques.
Poster
Robin Swanson · Esther Y. H. Lin · Masen Lamb · Suresh Sivanandam · Kiriakos N. Kutulakos

[ Exhibit Hall I ]

Abstract
Astronomical telescopes suffer from a tradeoff between field-of-view (FoV) and image resolution: increasing the FoV leads to an optical field that is under-sampled by the science camera. This work presents a novel computational imaging approach to overcome this tradeoff by leveraging the existing adaptive optics (AO) systems in modern ground-based telescopes. Our key idea is to use the AO system’s deformable mirror to apply a series of learned, precisely controlled distortions to the optical wavefront, producing a sequence of images that exhibit distinct, high-frequency, sub-pixel shifts. These images can then be jointly upsampled to yield the final super-resolved image. Crucially, we show this can be done while simultaneously maintaining the core AO operation --- correcting for the unknown and rapidly changing wavefront distortions caused by Earth's atmosphere. To achieve this, we incorporate end-to-end optimization of both the induced mirror distortions and the upsampling algorithm, such that telescope-specific optics and temporal statistics of atmospheric wavefront distortions are accounted for. Our experimental results with a hardware prototype, as well as simulations, demonstrate significant SNR improvements of up to 12 dB over non-AO super-resolution baselines, using only existing telescope optics and no hardware modifications. Moreover, by using a precise bench-top replica of a …
Poster
Juhyung Ha · Vibhas Vats · Alimoor Reza · Soon-heung Jung · David Crandall

[ Exhibit Hall I ]

Abstract
Point-cloud upsampling aims to generate dense point sets from sparse or incomplete 3D data while preserving geometric fidelity. Most existing works follow point-to-point (P2P) framework to produce denser point sets through iterative, fixed-scale upsampling, which can limit flexibility in handling various levels of detail in 3D models. Alternatively, voxel-based methods can dynamically upsample point density in voxel space but often struggle to preserve precise point locations due to quantization effects.In this work, we introduce Hybrid-Voxel Point-cloud Upsampling Network (HVPUNet), an efficient framework for dynamic point-cloud upsampling to address the limitations of both point-based and voxel-based methods. HVPUNet integrates two key modules: (1) a Shape Completion Module to restore missing geometry by filling empty voxels, and (2) a Super-Resolution Module to enhance spatial resolution and capture finer surface details. Moreover, we adopt progressive refinement, operational voxel expansion, and implicit learning to improve efficiency in 3D reconstruction. Experimental results demonstrate that HVPUNet effectively upscales large scenes and reconstructs intricate geometry at significantly lower computational cost, providing a scalable and versatile solution for 3D reconstruction, super-resolution, and high-fidelity surface generation.

Demonstration: Demos 6 Thu 23 Oct 02:30 p.m.  

  • Hybrid Rendering for Multimodal Autonomous Driving: Merging Neural and Physics-Based Simulation, Máté Tóth, Péter Kovács, Zoltán Bendefy, Zoltán Hortsin, Balázs Teréki, Tamás Matuszka
  • Ubiquitous Ultra-Efficient Mobile Video Super-resolution with Logic Gates, Felix Petersen, Louis Le Coeur
  • Consensus-Driven Active Model Selection, Justin Kay, Grant Van Horn, Subhransu Maji, Daniel Sheldon, Sara Beery