Poster
Bo Peng · Jie Lu · Guangquan Zhang · Zhen Fang
[ Exhibit Hall I ]
Abstract
This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data.Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various real-world settings.
Poster
You Huang · Lichao Chen · Jiayi Ji · Liujuan Cao · Shengchuan Zhang · Rongrong Ji
[ Exhibit Hall I ]
Abstract
Interactive segmentation (IS) improves annotation efficiency by segmenting target regions from user prompts, with widespread applications in real-world scenarios. Current approaches face a critical trade-off: dense-token methods achieve superior accuracy and detail preservation but suffer from prohibitively slow processing on CPU devices, while the Segment Anything Model (SAM) advances the field with sparse prompt tokens for fast inference but compromises segmentation quality. In this paper, we propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing, which introduces four key enhancements. First, we propose Dynamic Prompt Embedding (DPE) that adaptively processes only regions of interest while avoiding additional overhead from background tokens. Second, we introduce Dynamic Hybrid Attention (DHA), which leverages previous segmentation masks to route tokens through either full attention ($O(N^2)$) for boundary regions or our proposed efficient BSQ attention ($O(N)$) for non-boundary regions. Third, we develop Hybrid Mixture of Experts (HMoE), which applies similar adaptive computation strategies in FFN modules with CPU-optimized parallel processing. Finally, we present Dynamic Local Upsampling (DLU), a reverse operation of DPE, which localizes objects with a lightweight MLP and performs fine-grained upsampling only in detected regions. Experimental results on high-precision IS benchmarks demonstrate that Inter2Former achieves SOTA performance with high …
Poster
Yuting He · Shuo Li
[ Exhibit Hall I ]
Abstract
Contrastive learning (CL) has become a cornerstone of self-supervised pretraining (SSP) in foundation models; however, extending CL to pixel-wise representation—crucial for medical vision—remains an open problem. Standard CL formulates SSP as a binary optimization problem (binary CL) where the excessive pursuit of feature dispersion leads to an ``over-dispersion" problem, breaking pixel-wise feature correlation thus disrupting the intra-class distribution. Our vector CL reformulates CL as a vector regression problem, enabling dispersion quantification in pixel-wise pretraining via modeling feature distances in regressing displacement vectors. To implement this novel paradigm, we propose the COntrast in VEctor Regression (\textbf{COVER}) framework. COVER establishes an extendable vector-based self-learning, enforces a consistent optimization flow from vector regression to distance modeling, and leverages a vector pyramid architecture for granularity adaptation, thus preserving pixel-wise feature correlations in SSP. Extensive experiments across 8 tasks, spanning 2 dimensions and 4 modalities, show that COVER significantly improves pixel-wise SSP, advancing generalizable medical visual foundation models. Codes will be publicly available at [GitHub].
Poster
Ming Hu · Kun yuan · Yaling Shen · feilong tang · Xiaohao Xu · Lin Zhou · Wei Li · Ying Chen · Zhongxing Xu · Zelin Peng · Siyuan Yan · Vinkle Srivastav · Diping Song · Tianbin Li · Danli Shi · Jin Ye · Nicolas Padoy · Nassir Navab · Junjun He · Zongyuan Ge
[ Exhibit Hall I ]
Abstract
Vision-language pretraining (VLP) enables open-world generalization beyond predefined labels, a critical capability in surgery due to the diversity of procedures, instruments, and patient anatomies. However, applying VLP to ophthalmic surgery presents unique challenges, including limited vision-language data, intricate procedural workflows, and the need for hierarchical understanding, ranging from fine-grained surgical actions to global clinical reasoning. To address these, we introduce OphVL, a large-scale, hierarchically structured dataset containing over 375K video-text pairs, making it 15× larger than existing surgical VLP datasets. OphVL captures a diverse range of ophthalmic surgical attributes, including surgical phases, operations, actions, instruments, medications, disease causes, surgical objectives, and postoperative care recommendations. By aligning short clips with detailed narratives and full-length videos with structured titles, OphVL provides both fine-grained surgical details and high-level procedural context. Building on OphVL, we propose OphCLIP, a hierarchical retrieval-augmented VLP framework. OphCLIP leverages silent surgical videos as a knowledge base, retrieving semantically relevant content to enhance narrated procedure learning. This enables OphCLIP to integrate explicit linguistic supervision with implicit visual knowledge, improving ophthalmic workflow modeling. Evaluations across 11 benchmark datasets for surgical phase recognition and multi-instrument identification demonstrate OphCLIP’s robust generalization and superior performance, establishing it as a foundation model for ophthalmic surgery.
Poster
Xiaokun Feng · Shiyu Hu · Xuchen Li · Dailing Zhang · Meiqi Wu · Jing Zhang · Xiaotang Chen · Kaiqi Huang
[ Exhibit Hall I ]
Abstract
Vision-language tracking aims to locate the target object in the video sequence using a template patch and a language description provided in the initial frame. To achieve robust tracking, especially in complex long-term scenarios that reflect real-world conditions as recently highlighted by MGIT, it is essential not only to characterize the target features but also to utilize the context features related to the target. However, the visual and textual target-context cues derived from the initial prompts generally align only with the initial target state. Due to their dynamic nature, target states are constantly changing, particularly in complex long-term sequences. It is intractable for these cues to continuously guide Vision-Language Trackers (VLTs). Furthermore, for the text prompts with diverse expressions, our experiments reveal that existing VLTs struggle to discern which words pertain to the target or the context, complicating the utilization of textual cues. In this work, we present a novel tracker named ATCTrack, which can obtain multimodal cues Aligned with the dynamic target states through comprehensive Target-Context feature modeling, thereby achieving robust tracking. Specifically, (1) for the visual modality, we propose an effective temporal visual target-context modeling approach that provides the tracker with timely visual cues. (2) For the textual …
Poster
Simone Alberto Peirone · Francesca Pistilli · Giuseppe Averta
[ Exhibit Hall I ]
Abstract
Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content.We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads.By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture.We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.
Poster
Kuniaki Saito · Donghyun Kim · Kwanyong Park · Atsushi Hashimoto · Yoshitaka Ushiku
[ Exhibit Hall I ]
Abstract
An image captioning model flexibly switching its language pattern, e.g., descriptiveness and length, should be useful since it can be applied to diverse applications. However, despite the dramatic improvement in generative vision-language models, fine-grained control over the properties of generated captions is not easy due to two reasons: (i) existing models are not given the properties as a condition during training and (ii) existing models cannot smoothly transition its language pattern from one state to the other. Given this challenge, we propose a new approach, CaptionSmiths, to acquire a single captioning model that can handle diverse language patterns. First, our approach quantifies three properties of each caption, length, descriptiveness, and uniqueness of a word, as continuous scalar values, without human annotation. Given the values, we represent the conditioning via interpolation between two endpoint vectors corresponding to the extreme states, e.g., one for a very short caption and one for a very long caption. Empirical results demonstrate that the resulting model can smoothly change the properties of the output captions and show higher lexical alignment than baselines. For instance, CaptionSmiths reduces the error in controlling caption length by 506% despite better lexical alignment.
Poster
Guangyu Ren · Hengyan Liu · Michalis Lazarou · Tania Stathaki
[ Exhibit Hall I ]
Abstract
Camouflaged scenes, where objects blend seamlessly into their environments, pose significant challenges to both human observers and computer vision systems. These objects match the background in color, texture, and shape, making them difficult to detect. To this end, we propose leveraging the Segment Anything Model (SAM) to tackle this challenging task effectively. Specifically, we propose how to exploit SAM without requiring any manual prompts by proposing several ideas. At the core of our method lies the rich information extracted through multi-modal prompts. At first, we generate an image caption using the BLIP model and obtain its text embedding through the use of a text encoder. We then generate a visual embedding through the vision encoder of the BLIP model and use both as inputs to SAM to provide additional semantic information about the image. Finally, we propose a couple of architectural novelties, a) we effectively integrate the multi-modal information in SAM through a multi-level adapter and b) we replace the dense embedding of SAM with the image embedding of its image encoder. Our method achieves new state-of-the-art performance in 11 out of 12 metrics in three benchmark datasets for camouflaged detection. Additionally, our method can be successfully adapted to other …
Poster
Li Caoshuo · Zengmao Ding · Xiaobin Hu · Bang Li · Donghao Luo · AndyPianWu AndyPianWu · Chaoyang Wang · Chengjie Wang · Taisong Jin · SevenShu SevenShu · Yunsheng Wu · Yongge Liu · Rongrong Ji
[ Exhibit Hall I ]
Abstract
As one of the earliest ancient languages, Oracle Bone Script (**OBS**) encapsulates the cultural records and intellectual expressions of ancient civilizations. Despite the discovery of approximately 4,500 OBS characters, only about 1,600 have been deciphered. The remaining undeciphered ones, with their complex structure and abstract imagery, pose significant challenges for interpretation. To address these challenges, this paper proposes a novel two-stage semantic typography framework, named **OracleFusion**. In the first stage, this approach leverages the Multimodal Large Language Model (MLLM) with enhanced Spatial Awareness Reasoning (SAR) to analyze the glyph structure of the OBS character and perform visual localization of key components. In the second stage, we introduce Oracle Structural Vector Fusion (**OSVF**), incorporating glyph structure constraints and glyph maintenance constraints to ensure the accurate generation of semantically enriched vector fonts. This approach preserves the objective integrity of the glyph structure, offering visually enhanced representations that assist experts in deciphering OBS. Extensive qualitative and quantitative experiments demonstrate that OracleFusion outperforms state-of-the-art baseline models in terms of semantics, visual appeal, and glyph maintenance, significantly enhancing both readability and aesthetic quality. Furthermore, OracleFusion provides expert-like insights on unseen oracle characters, making it a valuable tool for advancing the decipherment of OBS.
Poster
Haoning Wu · Ziheng Zhao · Ya Zhang · Yanfeng Wang · Weidi Xie
[ Exhibit Hall I ]
Abstract
Training medical image segmentation models for rare yet clinically significant imaging modalities is challenging due to the scarcity of annotated data, and manual mask annotations can be costly and labor-intensive to acquire.This paper investigates **leveraging generative models to synthesize training data, to train segmentation models for underrepresented modalities**, particularly on annotation-scarce MRI. Concretely, our contributions are threefold:(i) we introduce **MRGen-DB**, a large-scale radiology image-text dataset comprising extensive samples with rich metadata, including modality labels, attributes, regions, and organs information, with a subset having pixelwise mask annotations;(ii) we present **MRGen**, a diffusion-based data engine for controllable medical image synthesis, conditioned on text prompts and segmentation masks. MRGen can generate realistic images for diverse MRI modalities lacking mask annotations, facilitating segmentation training in low-source domains;(iii) extensive experiments across multiple modalities demonstrate that MRGen significantly improves segmentation performance on unannotated modalities by providing high-quality synthetic data. We believe that our method bridges a critical gap in medical image analysis, extending segmentation capabilities to scenarios that are challenging to acquire manual annotations. The codes, models, and data will be publicly available.
Poster
Fanhong Zeng · Huanan LI · Juntao Guan · Rui Fan · Tong Wu · Xilong Wang · Lai Rui
[ Exhibit Hall I ]
Abstract
To enable the deployment of Vision Transformers on resource-constrained mobile and edge devices, the development of efficient ViT models has attracted significant attention. Researchers achieving remarkable improvements in accuracy and speed by optimizing attention mechanisms and integrating lightweight CNN modules. However, existing designs often overlook runtime overhead from memory-bound operations and the shift in feature characteristics from spatial-dominant to semantic-dominant as networks deepen. This work introduces TinyNeXt, a family of efficient hybrid ViTs for TinyML, featuring Lean Single-Head Self-Attention to minimize memory-bound operations, and a macro design tailored to feature characteristics at different stages. TinyNeXt strikes a better accuracy-speed trade-off across diverse tasks and hardware platforms, outperforming state-of-the-art models of comparable scale. For instance, our TinyNeXt-T achieves a remarkable 71.5\% top-1 accuracy with only 1.0M parameters on ImageNet-1K. Furthermore, compared to recent efficient models like MobileViT-XXS and MobileViT-XS, TinyNeXt-S and TinyNeXt-M achieve 3.7\%/0.5\% higher accuracy, respectively, while running 2.1$\times$/2.6$\times$ faster on Nvidia Jetson Nano.
Poster
Yuntao Shou · Xiangyong Cao · PeiqiangYan PeiqiangYan · Qiaohui Qiaohui · Qian Zhao · Deyu Meng
[ Exhibit Hall I ]
Abstract
In recent years, whole slide image (WSI)-based survival analysis has attracted much attention. In practice, WSIs usually come from different hospitals (or domains) and may have significant differences. These differences generally result in large gaps in distribution between different WSI domains and thus, the survival analysis models trained on one domain may fail to transfer to another. To address this issue, we propose a Dual-branch Encoder and Two-level Alignment (DETA) framework to explore both feature and category-level alignment between different WSI domains. Specifically, we first formulate the concerned problem as graph domain adaptation (GDA) using the graph representation of WSIs. Then, we construct a dual-branch graph encoder, including the message passing (MP) and the shortest path (SP) branches, to explicitly and implicitly extract semantic information from the graph-represented WSIs. To realize GDA, we propose a two-level alignment approach: at the category level, we develop a coupling technique by virtue of the dual-branch structure, leading to reduced divergence between the category distributions of the two domains; at the feature level, we introduce an adversarial perturbation strategy to better augment source domain feature, resulting in improved alignment in feature distribution. Extensive experiments have demonstrated the effectiveness of our proposed DETA framework in …
Poster
Ming Dai · Wenxuan Cheng · Jiang-Jiang Liu · Sen Yang · Wenxiao Cai · Yanpeng Sun · Wankou Yang
[ Exhibit Hall I ]
Abstract
Referring Image Segmentation (RIS) is a challenging task that aims to segment objects in an image based on natural language expressions. While prior studies have predominantly concentrated on improving vision-language interactions and achieving fine-grained localization, a systematic analysis of the fundamental bottlenecks in existing RIS frameworks remains underexplored. To bridge this gap, we propose $\textbf{DeRIS}$, a novel framework that decomposes RIS into two key components: $\textit{perception}$ and $\textit{cognition}$. This modular decomposition facilitates a systematic analysis of the primary bottlenecks impeding RIS performance. Our findings reveal that the predominant limitation lies not in perceptual deficiencies, but in the insufficient multi-modal cognitive capacity of current models. To mitigate this, we propose a Loopback Synergy mechanism, which enhances the synergy between the perception and cognition modules, thereby enabling precise segmentation while simultaneously improving robust image-text comprehension. Additionally, we analyze and introduce a simple non-referent sample conversion data augmentation to address the long-tail distribution issue related to target existence judgement in general scenarios. Notably, $\textbf{DeRIS}$ demonstrates inherent adaptability to both non- and multi-referents scenarios without requiring specialized architectural modifications, enhancing its general applicability.
Poster
Hongwei Lin · Dongyu Pan · Qiming Xia · Hai Wu · Cheng Wang · Siqi Shen · Chenglu Wen
[ Exhibit Hall I ]
Abstract
Recently, learning-based multi-agent cooperative perception has garnered widespread attention. However, the inherent vulnerabilities of neural networks, combined with the risks posed by cooperative communication as a wide-open backdoor, render these systems highly susceptible to adversarial attacks.Existing attack methods lack stealth as they perturb transmitted information indiscriminately, producing numerous false positives that are readily detected by consensus-based defenses. This paper proposes Pretend Benign (PB), a novel stealthy adversarial attack method that exploits vulnerabilities in cooperative perception to enable the attacker to disguise as a benign cooperator. To achieve this, we first introduce the Attack Region Selection (ARS) module, which divides the perception area into sub-regions based on confidence levels to pinpoint optimal attack locations. Then, we propose Multi-target Adversarial Perturbation Generation (MAPG), which maintains consensus, gain the victim’s trust, and thereby reverse the normal cooperative role of perception. To mitigate the latency in adversarial signal generation and communication, we further propose a real-time attack by predicting future information through historical feature flow. Extensive experiments on the OPV2V and V2XSet datasets demonstrate that PB effectively bypasses state-of-the-art defense methods, underscoring its stealth and efficacy.
Poster
Sunghyun Park · Jungsoo Lee · Shubhankar Borse · Munawar Hayat · Sungha Choi · Kyuwoong Hwang · Fatih Porikli
[ Exhibit Hall I ]
Abstract
While open-vocabulary semantic segmentation (OVSS) can segment an image into semantic regions based on arbitrarily given text descriptions even for classes unseen during training, it fails to understand personal texts (e.g., 'my mug cup') for segmenting regions of specific interest to users. This paper addresses challenges like recognizing 'my mug cup' among 'multiple mug cups'. To overcome this challenge, we introduce a novel task termed personalized open-vocabulary semantic segmentation and propose a text prompt tuning-based plug-in method designed to recognize personal visual concepts using a few pairs of images and masks, while maintaining the performance of the original OVSS. Based on the observation that reducing false predictions is essential when applying text prompt tuning to this task, our proposed method employs 'negative mask proposal' that captures visual concepts other than the personalized concept. We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them. This approach enhances personalized OVSS without compromising the original OVSS performance. We demonstrate the superiority of our method on our newly established benchmarks for this task, including FSS$^{per}$, CUB$^{per}$, and ADE$^{per}$.
Poster
Zheng Ziqiang · Wong Kwan · Binh-Son Hua · Jianbo Shi · Sai-Kit Yeung
[ Exhibit Hall I ]
Abstract
We investigate coral reef semantic segmentation, in which coral reefs are governed by multifaceted factors, like genes, environmental changes, and internal interactions. Unlike segmenting structural units/instances, which are predictable and follow a set pattern, also referred to as commonsense or prior, segmenting coral reefs involves modeling \textit{self-repeated}, \textit{asymmetric}, and \textit{amorphous} distribution of elements, \emph{e.g.}, corals can grow in almost any shape and appearance. We revisited existing segmentation approaches and found that both computer vision and coral reef research communities failed to incorporate the intrinsic properties of the corals into model design. In this work, we propose a simple formulation for coral reef semantic segmentation: \textit{segment} as the basis to model both \textit{within-segment} and \textit{cross-segment} affinities. We propose \textbf{CoralSRT}, a feature rectification module via self-supervised guidance, to reduce the stochasticity of coral features extracted by powerful foundation models (FMs), as demonstrated in Fig.~\ref{fig:teaser}. We incorporate the intrinsic properties of corals to strengthen within-segment affinity by guiding the features within the self-supervised segments to align with the centrality. We investigate the features from FMs that were optimized by various pretext tasks on significantly large-scale unlabeled or labeled data, already contain rich information for modeling both within-segment and cross-segment affinity, enabling the adaptation …
Poster
Olaf Dünkel · Artur Jesslen · Jiahao Xie · Christian Theobalt · Christian Rupprecht · Adam Kortylewski
[ Exhibit Hall I ]
Abstract
An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they mostly do not capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To remove failure cases, we propose a filtering mechanism that outperforms previous methods and hence enables a reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model …
Poster
Tiange Luo · Lajanugen Logeswaran · Justin Johnson · Honglak Lee
[ Exhibit Hall I ]
Abstract
We introduce RegionFocus, a visual test-time scaling approach that enhances GUI-based AI agents by leveraging visual cues to navigate the complexity of modern web interfaces. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving action accuracy without relying on extensive text-based reasoning. To support this process, we propose an image-as-history mechanism that visualizes key landmarks at each step, providing a transparent action record and enabling the agent to effectively choose among action candidates.Even with a simple region selection strategy, we observe significant performance gains of 31.7\% on Screenspot-pro and 34.9\% on WebVoyager benchmarks on top of a state-of-the-art open Vision Language Model Agent, highlighting the effectiveness of visual test-time scaling in interactive settings.Our code will be released publicly.
Poster
Jiangming Shi · Xiangbo Yin · yeyunchen yeyunchen · Yachao Zhang · zhizhong zhang · Yuan Xie · Yanyun Qu
[ Exhibit Hall I ]
Abstract
Composed Image Retrieval (CIR) aims to retrieve a target image using a query that combines a reference image and a textual description, benefiting users to express their intent more effectively. Despite significant advances in CIR methods, two unresolved problems remain: 1) existing methods overlook multi-schema interaction due to the lack of fine-grained explicit visual supervision, which hinders the capture of complex correspondences, and 2) existing methods overlook noisy negative pairs formed by potential corresponding query-target pairs, which increases confusion. To address these problems, we propose a Multi-schemA Proximity Network (MAPNet) for CIR, consisting of two key components: Multi-Schema Interaction (MSI) and Relaxed Proximity Loss (RPLoss). Specifically, MSI leverages textual descriptions as an implicit guide to establish correspondences between multiple objects and attributes in the reference and target images, enabling multi-schema interactions. Then, RPLoss further aligns the query and target features while avoiding the poison of noisy negative pairs by denoising and reweighting strategy. Comprehensive experiments conducted on CIRR, FashionIQ, and LaSCo demonstrate that MAPNet achieves competitive results against state-of-the-art CIR methods. The source code will be made publicly available after the paper is accepted.
Poster
Tianming Liang · Kun-Yu Lin · Chaolei Tan · Jianguo Zhang · Wei-Shi Zheng · Jian-Fang Hu
[ Exhibit Hall I ]
Abstract
Referring video object segmentation (RVOS) aims to segment target objects throughout a video based on a text description. This is challenging as it involves deep vision-language understanding, pixel-level dense prediction and spatiotemporal reasoning. Despite notable progress in recent years, existing methods still exhibit a noticeable gap when considering all these aspects. In this work, we propose \textbf{ReferDINO}, a strong RVOS model that inherits region-level vision-language alignment from foundational visual grounding models, and is further endowed with pixel-level dense perception and cross-modal spatiotemporal reasoning. In detail, ReferDINO integrates two key components: 1) a grounding-guided deformable mask decoder that utilizes location prediction to progressively guide mask prediction through differentiable deformation mechanisms; 2) an object-consistent temporal enhancer that injects pretrained time-varying text features into inter-frame interaction to capture object-aware dynamic changes. Moreover, a confidence-aware query pruning strategy is designed to accelerate object decoding without compromising model performance. Extensive experimental results on five benchmarks demonstrate that our ReferDINO significantly outperforms previous methods (e.g., +3.9\% \(\mathcal{J}\&\mathcal{F}\) on Ref-YouTube-VOS) while maintaining real-time inference speed (51 FPS). Code and models will be released.
Poster
Saarthak Kapse · Pushpak Pati · Srikar Yellapragada · Srijan Das · Rajarsi Gupta · Joel Saltz · Dimitris Samaras · Prateek Prasanna
[ Exhibit Hall I ]
Abstract
Pretraining a Multiple Instance Learning (MIL) aggregator enables the derivation of Whole Slide Image (WSI)-level embeddings from patch-level representations without supervision. While recent multimodal MIL pretraining approaches leveraging auxiliary modalities have demonstrated performance gains over unimodal WSI pretraining, the acquisition of these additional modalities necessitates extensive clinical profiling. This requirement increases costs and limits scalability in existing WSI datasets lacking such paired modalities. To address this, we propose Gigapixel Vision-Concept Knowledge Contrastive pretraining (GECKO), which aligns WSIs with a Concept Prior derived from the available WSIs. First, we derive an inherently interpretable concept prior by computing the similarity between each WSI patch and textual descriptions of predefined pathology concepts. GECKO then employs a dual-branch MIL network: one branch aggregates patch embeddings into a WSI-level deep embedding, while the other aggregates the concept prior to a corresponding WSI-level concept embedding. Both aggregated embeddings are aligned using a contrastive objective, thereby pretraining the entire dual-branch MIL model. Moreover, when auxiliary modalities such as transcriptomics data are available, GECKO seamlessly integrates them. Across five diverse tasks, GECKO consistently outperforms prior unimodal and multimodal pretraining approaches while also delivering clinically meaningful interpretability that bridges the gap between computational models and pathology expertise.
Poster
Qi Qin · Le Zhuo · Yi Xin · Ruoyi Du · Zhen Li · Bin Fu · Yiting Lu · Xinyue Li · Dongyang Liu · Xiangyang Zhu · Will Beddow · Erwann Millon · Victor Perez · Wenhai Wang · Yu Qiao · Bo Zhang · Xiaohong Liu · Hongsheng Li · Chang Xu · Peng Gao
[ Exhibit Hall I ]
Abstract
We introduce **Lumina-Image 2.0**, an advanced text-to-image (T2I) model that surpasses previous state-of-the-art methods across multiple benchmarks. Lumina-Image 2.0 is characterized by two key features: (1) *Unification* – it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), which can generate detailed and accurate multilingual captions for our model. This not only accelerates model convergence, but also enhances prompt adherence, multi-granularity prompt handling, and task expansion with customized prompt templates. (2)*Efficiency* – to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies to optimize our model, alongside inference-time acceleration strategies without compromising image quality. We evaluate our model on academic benchmarks and T2I arenas, with results confirming that it matches or exceeds existing state-of-the-art models across various metrics, highlighting the effectiveness of our methods.
Poster
Doriand Petit · Steve Bourgeois · Vincent Gay-Bellile · Florian Chabot · Loïc Barthe
[ Exhibit Hall I ]
Abstract
3D semantic segmentation provides high-level scene understanding for applications in robotics, autonomous systems, etc. Traditional methods adapt exclusively to either task-specific goals (open-vocabulary segmentation) or scene content (unsupervised semantic segmentation). We propose DiSCO-3D, the first method addressing the broader problem of 3D Open-Vocabulary Sub-concepts Discovery, which aims to provide a 3D semantic segmentation that adapts to both the scene and user queries. We build DiSCO-3D on Neural Fields representations, combining unsupervised segmentation with weak open-vocabulary guidance. Our evaluations demonstrate that DiSCO-3D achieves effective performance in Open-Vocabulary Sub-concepts Discovery and exhibits state-of-the-art results in the edge cases of both open-vocabulary and unsupervised segmentation.
Poster
Sheng Ye · Xin Chen · Yan Zhang · Xianming Lin · Liujuan Cao
[ Exhibit Hall I ]
Abstract
Camouflaged object detection (COD) faces unique challenges where target boundaries are intrinsically ambiguous due to their textural similarity to backgrounds. Existing methods relying on single-modality features often produce fragmented predictions due to insufficient boundary constraints.To address this, we propose ESCNet with dynamically coupled edge-texture perception. Our framework introduces three core innovations that work in concert:1) Adaptive Edge-Texture Perceptor (AETP), which creates an edge prediction behaviour where edge and texture information are mutually reinforcing based on the multi-scale features of the image integrated with the global semantic context of the Transformer;2) Dual-Stream Feature Augmentor (DSFA), which dynamically adjusts the kernel sampling position according to the local texture complexity and edge orientation, thus accurately enhancing the feature information at fractal boundaries and amorphous texture locations;3) Multi-Feature Modulation Module (MFMM), which establishes incremental fine-grained improvements for feature calibration and model prediction through enhanced characterisation of edge perception and hierarchical integration of multiple textures. This interconnected system forms a feedback loop where enhanced representations of edge perception enhance model texture prediction and vice versa. Our ESCNet demonstrates significant performance advantages on all three authoritative datasets. On the $F^w_\beta$ metric, ESCNet achieves 0.859 and 0.843 on the NC4K and CAMO datasets, respectively.
Poster
Zhenwei Shao · Mingyang Wang · Zhou Yu · Wenwen Pan · Yan Yang · Tao Wei · Hongyuan Zhang · Ning Mao · Chen Wei · Jun Yu
[ Exhibit Hall I ]
Abstract
Large vision-language models (VLMs) have demonstrated remarkable capabilities in open-world multimodal understanding, yet their high computational overheads pose great challenges for practical deployment. Some recent works have proposed methods to accelerate VLMs by pruning redundant visual tokens guided by the attention maps of VLM's early layers. Despite the success of these token pruning methods, they still suffer from two major shortcomings: (i) considerable accuracy drop due to insensitive attention signals in early layers, and (ii) limited speedup when generating long responses (e.g., 30 tokens). To address the limitations above, we present TwigVLM---a simple and general architecture by ``growing'' a lightweight twig upon an early layer of the base VLM. Compared with most existing VLM acceleration methods purely based on visual token pruning, our TwigVLM not only achieves better accuracy retention by employing a twig-guided token pruning (TTP) strategy, but also yields higher generation speed by utilizing a self-speculative decoding (SSD) strategy. Taking LLaVA-1.5-7B as the base VLM, experimental results show that TwigVLM preserves 96% of the original performance after pruning 88.9% of visual tokens and achieves 154% speedup in generating long responses, delivering significantly better performance in terms of both accuracy and speed over the state-of-the-art VLM acceleration methods.
Poster
Kecheng Chen · Xinyu Luo · Tiexin Qin · Jie Liu · Hui Liu · Victor Ho Fun Lee · Hong Yan · Haoliang Li
[ Exhibit Hall I ]
Abstract
Foundation medical segmentation models, with MedSAM being the most popular, have achieved promising performance across organs and lesions. However, MedSAM still suffers from compromised performance on specific lesions with intricate structures and appearance, as well as bounding box prompt-induced perturbations. Although current test-time adaptation (TTA) methods for medical image segmentation may tackle this issue, partial (e.g., batch normalization) or whole parametric updates restrict their effectiveness due to limited update signals or catastrophic forgetting in large models. Meanwhile, these approaches ignore the computational complexity during adaptation, which is particularly significant for modern foundation models. To this end, our theoretical analyses reveal that directly refining image embeddings is feasible to approach the same goal as parametric updates under the MedSAM architecture, which enables us to realize high computational efficiency and segmentation performance without the risk of catastrophic forgetting. Under this framework, we propose to encourage maximizing factorized conditional probabilities of the posterior prediction probability using a proposed distribution-approximated latent conditional random field loss combined with an entropy minimization loss. Experiments show that we achieve about 3% Dice score improvements across three datasets while reducing computational complexity by over 7 times.
Poster
Nicholas DiBrita · Jason Han · Tirthak Patel
[ Exhibit Hall I ]
Abstract
Research in quantum machine learning has recently proliferated due to the potential of quantum computing to accelerate machine learning. An area of machine learning that has not yet been explored is neural ordinary differential equation (neural ODE) based residual neural networks (ResNets), which aim to improve the effectiveness of neural networks using the principles of ordinary differential equations. In this work, we present our insights about why analog Rydberg atom quantum computers are especially well-suited for ResNets. We also introduce ResQ, a novel framework to optimize the dynamics of Rydberg atom quantum computers to solve classification problems in machine learning using analog quantum neural ODEs.
Poster
hahyeon choi · Junhoo Lee · Nojun Kwak
[ Exhibit Hall I ]
Abstract
Audio-Visual Localization (AVL) aims to identify sound-emitting sources within a visual scene. However, existing studies focus on image-level audio-visual associations, failing to capture temporal dynamics. Moreover, they assume simplified scenarios where sound sources are always visible and involve only a single object. To address these limitations, we propose AVATAR, a video-centric AVL benchmark that incorporates high-resolution temporal information. AVATAR introduces four distinct scenarios -- Single-sound, Mixed-sound, Multi-entity, and Off-screen -- enabling a more comprehensive evaluation of AVL models. Additionally, we present TAVLO, a novel video-centric AVL model that explicitly integrates temporal information. Experimental results show that conventional methods struggle to track temporal variations due to their reliance on global audio features and frame-level mappings. In contrast, TAVLO achieves robust and precise audio-visual alignment by leveraging high-resolution temporal modeling. Our work empirically demonstrates the importance of temporal dynamics in AVL and establishes a new standard for video-centric audio-visual localization.
Poster
Youngeun Kim · Seunghwan Lee · Aecheon Jung · Bogon Ryu · Sungeun Hong
[ Exhibit Hall I ]
Abstract
Model merging enables efficient multi-task models by combining task-specific fine-tuned checkpoints. However, storing multiple task-specific checkpoints requires significant memory, limiting scalability and restricting model merging to larger models and diverse tasks. In this paper, we propose quantizing task vectors (i.e., the difference between pre-trained and fine-tuned checkpoints) instead of quantizing fine-tuned checkpoints. We observe that task vectors exhibit a narrow weight range, enabling low-precision quantization (≤ 4 bit) within existing task vector merging frameworks. To further mitigate quantization errors within ultra-low bit precision (e.g., 2 bit), we introduce Residual Task Vector Quantization, which decomposes the task vector into a base vector and offset component. We allocate bits based on quantization sensitivity, ensuring precision while minimizing error within a memory budget. Experiments on image classification and dense prediction show our method maintains or improves model merging performance while using only 8% of the memory required for full-precision checkpoints. Code and quantized task vectors will be released.
Poster
Jiacheng Lu · Hui Ding · Shiyu Zhang · Guoping Huo
[ Exhibit Hall I ]
Abstract
MRI tumor segmentation remains a critical challenge in medical imaging, where volumetric analysis faces unique computational demands due to the complexity of 3D data. The spatially sequential arrangement of adjacent MRI slices provides valuable information that enhances segmentation continuity and accuracy, yet this characteristic remains underutilized in many existing models. The spatial correlations between adjacent MRI slices can be regarded as “temporal-like” data, similar to frame sequences in video segmentation tasks. To bridge this gap, we propose M-Net, a flexible framework specifically designed for sequential image segmentation. M-Net introduces the novel Mesh-Cast mechanism, which seamlessly integrates arbitrary sequential models into the processing of both channel and temporal information, thereby systematically capturing the inherent “temporal-like” spatial correlations between MRI slices and ensuring consistent segmentation across sequences. Additionally, we define an MRI sequential input pattern and design a Two-Phase Sequential (TPS) training strategy, which first focuses on learning common patterns across sequences before refining slice-specific feature extraction. This approach leverages temporal modeling techniques to preserve volumetric contextual information while avoiding the high computational cost of full 3D convolutions, thereby enhancing the generalizability and robustness of M-Net in sequential segmentation tasks. Experiments on the BraTS2019 and BraTS2023 datasets demonstrate that M-Net outperforms existing …
Poster
Yupeng Hu · Changxing Ding · Chang Sun · Shaoli Huang · Xiangmin Xu
[ Exhibit Hall I ]
Abstract
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
Poster
Xiaolong Sun · Le Wang · Sanping Zhou · Liushuai Shi · Kun Xia · Mengnan Liu · Yabing Wang · Gang Hua
[ Exhibit Hall I ]
Abstract
Video temporal grounding is a critical video understanding task, which aims to localize moments relevant to a language description. The challenge of this task lies in distinguishing relevant and irrelevant moments. Previous methods focused on learning continuous features exhibit weak differentiation between foreground and background features. In this paper, we propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG), which quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments. Specifically, MQVTG maintains a learnable moment codebook, where each video moment matches a codeword. Considering the visual diversity, i.e., various visual expressions for the same moment, MQVTG treats moment-codeword matching as a clustering process without using discrete vectors, avoiding the loss of useful information from direct hard quantization. Additionally, we employ effective prior-initialization and joint-projection strategies to enhance the maintained moment codebook. With its simple implementation, the proposed method can be integrated into existing temporal grounding models as a plug-and-play component. Extensive experiments on six popular benchmarks demonstrate the effectiveness and generalizability of MQVTG, significantly outperforming state-of-the-art methods. Further qualitative analysis shows that our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination. The code …
Poster
Yongkun Du · Zhineng Chen · Hongtao Xie · Caiyan Jia · Yu-Gang Jiang
[ Exhibit Hall I ]
Abstract
Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned linear classifier, and therefore fast inference. However, they generally exhibit worse accuracy than encoder-decoder-based methods (EDTRs) due to struggling with text irregularity and linguistic missing. To address these challenges, we propose SVTRv2, a CTC model endowed with the ability to handle text irregularities and model linguistic context. First, a multi-size resizing strategy is proposed to resize text instances to appropriate predefined sizes, effectively avoiding severe text distortion. Meanwhile, we introduce a feature rearrangement module to ensure that visual features accommodate the requirement of CTC, thus alleviating the alignment puzzle. Second, we propose a semantic guidance module. It integrates linguistic context into the visual features, allowing CTC model to leverage language information for improved accuracy. Moreover, this module can be omitted at the inference stage and would not increase the time cost. We extensively evaluate SVTRv2 in both standard and recent challenging benchmarks, where SVTRv2 is fairly compared to mainstream STR models across multiple scenarios, including different types of text irregularity, languages, long text, and whether employing pretraining. The …
Poster
Yuan Gao · Sangwook Kim · Jianzhong You · Chris Mcintosh
[ Exhibit Hall I ]
Abstract
Medical decision-making requires integrating diverse medical information, from imaging to clinical narratives. These medical modalities are often acquired in a many-to-many manner. However, current medical vision-language pretraining models (Med-VLPMs) fail to directly account for this many-to-many mapping in their model training and embeddings. To address this, we present Probabilistic Modality-Enhanced Diagnosis (ProbMED), a multi-modal Med-VLPM that employs probabilistic contrastive learning to model distributions over embeddings rather than fixed-point, deterministic estimates. ProbMED aligns four distinct modalities—chest X-rays, electrocardiograms, echocardiograms, and clinical text—into a unified probabilistic embedding space. Our framework uses InfoNCE objective with a probabilistic distance metric (Hellinger distance) to integrate inter-modality distributions. To improve intra-modality binding, we introduce a synthetic sampling loss powered by probabilistic embeddings to capture modality-specific mean and variance. Extensive experiments across 13 medical datasets demonstrate that our model outperforms state-of-the-art Med-VLPMs in cross-modality retrieval, zero-shot and few-shot classification. We also show the robust integration of multiple modalities for prognostication, demonstrating the improved intra and inter-modality binding of multimodal medical data embeddings. The anonymized code can be found in https://anonymous.4open.science/r/probMED-8564.
Poster
Zeyuan Yang · Delin Chen · Xueyang Yu · Maohao Shen · Chuang Gan
[ Exhibit Hall I ]
Abstract
Long video understanding poses unique challenges due to its temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as "VCA". Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences.Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM's self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our approach’s superior effectiveness and efficiency.
Poster
Yiwu Zhong · Zhuoming Liu · Yin Li · Liwei Wang
[ Exhibit Hall I ]
Abstract
Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (\eg, a \textbf{7-fold} reduction in FLOPs) while preserving the performance of video and image LLMs. Further, under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (\eg, \textbf{+4.6} on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs.
Poster
Cong Wei · Yujie Zhong · yingsen zeng · Haoxian Tan · Yong Liu · Hongfa Wang · Yujiu Yang
[ Exhibit Hall I ]
Abstract
Boosted by Multi-modal Large Language Models (MLLMs), text-guided universal segmentation models for the image and video domains have made rapid progress recently. However, these methods are often developed separately for specific domains, overlooking the similarities in task settings and solutions across these two areas. In this paper, we define the union of referring segmentation and reasoning segmentation at both the image and video levels as Instructed Visual Segmentation (IVS). Correspondingly, we propose InstructSeg, an end-to-end segmentation pipeline equipped with MLLMs for IVS. Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. Additionally, we introduce vision-guided multi-granularity text fusion to better integrate global and detailed text information with fine-grained visual guidance. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks, surpassing both segmentation specialists and MLLM-based methods with a single model.
Poster
Cihang Peng · Qiming HOU · Zhong Ren · Kun Zhou
[ Exhibit Hall I ]
Abstract
We present ROVI, a high-quality synthetic dataset for instance-grounded text-to-image generation, created by labeling 1M curated web images. Our key innovation is a strategy called re-captioning, focusing on the pre-detection stage, where a VLM (Vision-Language Model) generates comprehensive visual descriptions that are then processed by an LLM (Large Language Model) to extract a flat list of potential categories for OVDs (Open-Vocabulary Detectors) to detect. This approach yields a global prompt inherently linked to instance annotations while capturing secondary visual elements humans typically overlook. Evaluations show that ROVI exceeds existing detection datasets in image quality and resolution while containing two orders of magnitude more categories with an open-vocabulary nature. For demonstrative purposes, a GLIGEN model trained on ROVI significantly outperforms state-of-the-art alternatives in instance grounding accuracy, prompt fidelity, and aesthetic quality. We will release our dataset and reproducible pipeline to facilitate future research.
Poster
Yogesh Kumar · Uday Agarwal · Manish Gupta · Anand Mishra
[ Exhibit Hall I ]
Abstract
Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU …
Poster
Heeji Yoon · Heeseong Shin · Eunbeen Hong · Hyunwook Choi · Hansang Cho · Daun Jeong · Seungryong Kim
[ Exhibit Hall I ]
Abstract
Semi-supervised instance segmentation poses challenges due to limited labeled data, causing difficulties in accurately localizing distinct object instances. Current teacher-student frameworks still suffer from performance constraints due to unreliable pseudo-label quality stemming from limited labeled data. While the Segment Anything Model (SAM) offers robust segmentation capabilities at various granularities, directly applying SAM introduces challenges such as class-agnostic predictions and potential over-segmentation. To address these complexities, we carefully integrate SAM into the semi-supervised instance segmentation framework, developing a novel distillation method that effectively captures the precise localization capabilities of SAM without compromising semantic recognition. Furthermore, we incorporate pseudo-label refinement as well as a specialized data augmentation with the refined pseudo-labels, resulting in superior performance. We establish state-of-the-art performance, and provide comprehensive experiments and ablation studies to validate the effectiveness of our proposed approach.
Poster
Boyu Chen · Zhengrong Yue · Siran Chen · Zikang Wang · Yang Liu · Peng Li · Yali Wang
[ Exhibit Hall I ]
Abstract
Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video-related questions and exchange reasons. 4) Reflection: We evaluate each agent's performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in …
Poster
Wei Suo · Ji Ma · Mengyang Sun · Lin Wu · PENG WANG · Yanning Zhang
[ Exhibit Hall I ]
Abstract
Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational costs pose a significant barrier to wide application. To enhance inference efficiency, most existing approaches can be categorized as parameter-dependent or token-dependent strategies to reduce computational demands. However, parameter-dependent methods require retraining LVLMs to recover performance while token-dependent strategies struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios. The code for this work will be made publicly available.
Poster
Tan Pan · Zhaorui Tan · Kaiyu Guo · Dongli Xu · Weidi Xu · Chen Jiang · Xin Guo · Yuan Qi · Yuan Cheng
[ Exhibit Hall I ]
Abstract
3D medical image self-supervised learning (mSSL) holds great promise for medical analysis. Effectively supporting broader applications requires considering anatomical structure variations in location, scale, and morphology, which are crucial for capturing meaningful distinctions. However, previous mSSL methods partition images with fixed-size patches, often ignoring the structure variations. In this work, we introduce a novel perspective on 3D medical images with the goal of learning structure-aware representations. We assume that patches within the same structure share the same semantics (semantic consistency) while those from different structures exhibit distinct semantics (semantic discrepancy). Based on this assumption, we propose an mSSL framework named $S^2DC$, achieving Structure-aware Semantic Discrepancy and Consistency in two steps. First, $S^2DC$ enforces distinct representations for different patches to increase semantic discrepancy by leveraging an optimal transport strategy. Second, $S^2DC$ advances semantic consistency at the structural level based on neighborhood similarity distribution. By bridging patch-level and structure-level representations, $S^2DC$ achieves structure-aware representations. Thoroughly evaluated across 10 datasets, 4 tasks, and 3 modalities, our proposed method consistently outperforms the state-of-the-art methods in mSSL.
Poster
Ziling Wu · Armaghan Moemeni · Praminda Caleb-Solly
[ Exhibit Hall I ]
Abstract
Unsupervised object discovery (UOD) aims to detect and segment objects in 2D images without handcrafted annotations. Recent progress in self-supervised representation learning has led to some success in UOD algorithms. However, the absence of ground truth provides existing UOD methods with two challenges: 1) determining if a discovered region is foreground or background, and 2) knowing how many objects remain undiscovered. To address these two problems, previous solutions rely on foreground priors to distinguish if the discovered region is foreground, and conduct one or fixed iterations of discovery. However, the existing foreground priors are heuristic and not always robust, and a fixed number of discoveries leads to under or over-segmentation, since the number of objects in images varies. This paper introduces UnionCut, a robust foreground prior based on ensemble methods that detects the union of foreground areas of an image, allowing UOD algorithms to identify foreground objects and stop discovery once the majority of the foreground union in the image is segmented. On top of that, we propose UnionSeg, a vision transformer distilled from UnionCut that outputs the foreground union faster and more accurately. Our experiments show that by combining with UnionCut or UnionSeg, previous state-of-the-art UOD methods witness an …
Poster
Ruchit Rawal · Reza Shirkavand · Heng Huang · Gowthami Somepalli · Tom Goldstein
[ Exhibit Hall I ]
Abstract
Video large language models have not yet been widely deployed, largely due to their tendency to hallucinate. Typical benchmarks for Video-LLMs rely simply on multiple choice questions. Unfortunately, it has been observed that VideoLLMs hallucinate far more aggressively on freeform text generation tasks like video captioning than they do on multiple choice verification tasks. To address this weakness, we propose ARGUS, a VideoLLM benchmark that measures freeform video captioning performance. By comparing VideoLLM outputs to human ground truth captions, ARGUS quantifies dual metrics. First, we measure the rate of hallucinations in the form of incorrect statements about video content or temporal relationships. Second, we measure the rate at which the model omits important descriptive details. Together, these dual metrics form a comprehensive view of video captioning performance.
Poster
Shuo Jin · Siyue Yu · Bingfeng Zhang · Mingjie Sun · Yi Dong · Jimin XIAO
[ Exhibit Hall I ]
Abstract
Training-free open-vocabulary semantic segmentation has advanced with vision-language models like CLIP, which exhibit strong zero-shot abilities. However, CLIP's attention mechanism often wrongly emphasises specific image tokens, namely outliers, which results in irrelevant over-activation. Existing approaches struggle with these outliers that arise in intermediate layers and propagate through the model, ultimately degrading spatial perception. In this paper, we propose a Self-adaptive Feature Purifier framework (SFP) to suppress propagated outliers and enhance semantic representations for open-vocabulary semantic segmentation. Specifically, based on an in-depth analysis of attention responses between image and class tokens, we design a self-adaptive outlier mitigator to detect and mitigate outliers at each layer for propagated feature purification. In addition, we introduce a semantic-aware attention enhancer to augment attention intensity in semantically relevant regions, which strengthens the purified feature to focus on objects. Further, we introduce a hierarchical attention integrator to aggregate multi-layer attention maps to refine spatially coherent feature representations for final segmentation. Our proposed SFP enables robust outlier suppression and object-centric feature representation, leading to a more precise segmentation. Extensive experiments show that our method achieves state-of-the-art performance and surpasses existing methods by an average of 4.6% mIoU on eight segmentation benchmarks. The code will be released.
Poster
Giyeol Kim · Sooyoung Yang · Jihyong Oh · Myungjoo Kang · Chanho Eom
[ Exhibit Hall I ]
Abstract
Person search aims to jointly perform person detection and re-identification by localizing and identifying a query person within a gallery of uncropped scene images. Existing methods predominantly utilize ImageNet pre-trained backbones, which may be less effective at capturing the contextual and fine-grained features crucial for person search. Moreover, they rely on a shared backbone feature for both person detection and re-identification, leading to suboptimal features due to conflicting optimization objectives. Recently, diffusion models have emerged as powerful vision backbones, capturing rich visual priors from large-scale datasets. In this paper, we propose DiffPS (Diffusion Prior Knowledge for Person Search), a novel framework that leverages a frozen pre-trained diffusion model while eliminating the optimization conflict between two sub-tasks. We analyze key properties of diffusion priors and propose three specialized modules: (i) Diffusion-Guided Region Proposal Network (DGRPN) for enhanced person localization, (ii) Multi-Scale Frequency Refinement Network (MSFRN) to mitigate shape bias, and (iii) Semantic-Adaptive Feature Aggregation Network (SFAN) to leverage text-aligned diffusion features. DiffPS sets a new state-of-the-art on CUHK-SYSU and PRW. Our code will be available online at the time of the publication.
Poster
Yuhan Liu · Jingwen Fu · Yang Wu · Kangyi Wu · Pengna Li · Jiayi Wu · Sanping Zhou · Jingmin Xin
[ Exhibit Hall I ]
Abstract
Leveraging the vision foundation models has emerged as a mainstream paradigm that improves the performance of image feature matching. However, previous works have ignored the misalignment when introducing the foundation models into feature matching. The misalignment arises from the discrepancy between the foundation models focusing on single-image understanding and the cross-image understanding requirement of feature matching. Specifically, 1) the embeddings derived from commonly used foundation models exhibit discrepancies with the optimal embeddings required for feature matching; 2) lacking an effective mechanism to leverage the single-image understanding ability into cross-image understanding. A significant consequence of the misalignment is they struggle when addressing multi-instance feature matching problems. To address this, we introduce a simple but effective framework, called IMD (Image feature Matching with a pre-trained Diffusion model) with two parts: 1) Unlike the dominant solutions employing contrastive-learning based foundation models that emphasize global semantics, we integrate the generative-based diffusion models to effectively capture instance-level details. 2) We leverage the prompt mechanism in generative model as a natural tunnel, propose a novel cross-image interaction prompting module to facilitate bidirectional information interaction between image pairs. To more accurately measure the misalignment, we propose a new benchmark called IMIM, which focuses on multi-instance scenarios. Our …
Poster
Sanghyun Jo · Seo Lee · Seungwoo Lee · Seohyung Hong · Hyungseok Seo · Kyungsu Kim
[ Exhibit Hall I ]
Abstract
Cell instance segmentation (CIS) is crucial for identifying individual cell morphologies in histopathological images, providing valuable insights for biological and medical research. While unsupervised CIS (UCIS) models aim to reduce the heavy reliance on labor-intensive image annotations, they fail to accurately capture cell boundaries, causing missed detections and poor performance. Recognizing the absence of error-free instances as a key limitation, we present COIN (COnfidence score-guided INstance distillation), a novel annotation-free framework with three key steps: (1) Increasing the sensitivity for the presence of error-free instances via unsupervised semantic segmentation with optimal transport, leveraging its ability to discriminate spatially minor instances, (2) Instance-level confidence scoring to measure the consistency between model prediction and refined mask and identify highly confident instances, offering an alternative to ground truth annotations, and (3) Progressive expansion of confidence with recursive self-distillation. Extensive experiments across six datasets show COIN outperforming existing UCIS methods, even surpassing semi- and weakly-supervised approaches across all metrics on the MoNuSeg and TNBC datasets. The code will be made available upon publication.
Poster
Walid Bousselham · Angie Boggust · Sofian Chaybouti · Hendrik Strobelt · Hilde Kuehne
[ Exhibit Hall I ]
Abstract
Vision Transformers (ViTs) have become a standard architecture in computer vision. However, because of their modeling of long-range dependencies through self-attention mechanisms, the explainability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of single ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement method to enhance the transparency of ViTs. We evaluate LeGrad in various setups, including segmentation, perturbation, and open-vocabulary settings, showcasing its improved spatial fidelity and its versatility compared to other SotA explainability methods. Code will be released.
Poster
Zhengyin Liang · Hui Yin · Min Liang · Qianqian Du · Ying Yang · Hua Huang
[ Exhibit Hall I ]
Abstract
Modality or domain distribution shifts pose formidable challenges in 3D semantic segmentation. Existing methods predominantly address either cross-modal or cross-domain adaptation in isolation, leading to insufficient exploration of semantic associations and complementary features in heterogeneous data. To bridge this gap, we present UniDxMD, a unified representation method for cross-modal unsupervised domain adaptation (UDA) in 3D semantic segmentation that simultaneously tackles both cross-modal and cross-domain adaptation objectives. Our core insight is deriving a unified discrete representation from heterogeneous data to mitigate distribution shifts, inspired by vector quantization. Specifically, we propose a differentiable, cluster-based soft quantization mechanism (CSQM) that maps heterogeneous data (spanning modalities and domains) into a shared discrete latent space. Then, we introduce latent space regularization (LSR), leveraging joint prototypes that satisfy semantic relational consistency as learnable anchors to enhance the compactness and semantic discriminability of the discrete latent space. Our method paves the way for advancing cross-modal UDA in 3D semantic segmentation towards the unified representation. Extensive results across four challenging cross-modal UDA scenarios demonstrate the superiority of our method, achieving state-of-the-art performance on multiple benchmarks. Code will be available publicly.
Poster
Rui Sun · Huayu Mai · Wangkai Li · Yujia Chen · Yuan Wang
[ Exhibit Hall I ]
Abstract
Semi-supervised semantic segmentation has attracted considerable attention as it alleviates the need for extensive pixel-level annotations. However, existing methods often overlook the potential optimization conflict between supervised and unsupervised learning objectives, leading to suboptimal performance. In this paper, we identify this under-explored issue and propose a novel Pareto Optimization Strategy (POS) to tackle it. POS aims to find a descent gradient direction that benefits both learning objectives, thereby facilitating model training. By dynamically assigning weights to the gradients at each iteration based on the model's learning status, POS effectively reconciles the intrinsic tension between the two objectives. Furthermore, we analyze POS from the perspective of gradient descent in random batch sampling and propose the Magnitude Enhancement Operation (MEO) to further unleash its potential by considering both direction and magnitude during gradient integration. Extensive experiments on challenging benchmarks demonstrate that integrating POS into existing semi-supervised segmentation methods yields consistent improvements across different data splits and architectures (CNN, Transformer), showcasing its effectiveness.
Poster
Xiaoling Hu · Xiangrui Zeng · Oula Puonti · Juan Iglesias · Bruce Fischl · Yaël Balbastre
[ Exhibit Hall I ]
Abstract
Domain randomization through synthesis is a powerful strategy to train networks that are unbiased with respect to the domain of the input images. Randomization allows networks to see a virtually infinite range of intensities and artifacts during training, thereby minimizing overfitting to appearance and maximizing generalization to unseen data. Although powerful, this approach relies on the accurate tuning of a large set of hyperparameters that govern the probabilistic distribution of the synthesized images. Instead of manually tuning these parameters, we introduce Learn2Synth, a novel procedure in which synthesis parameters are learned using a small set of real labeled data. Unlike methods that impose constraints to align synthetic data with real data (e.g., contrastive or adversarial techniques), which risk misaligning the image and its label map, we tune an augmentation engine such that a segmentation network trained on synthetic data has optimal accuracy when applied to real data. This approach allows the training procedure to benefit from real labeled examples, without ever using these real examples to train the segmentation network, which avoids biasing the network towards the properties of the training set. Specifically, we develop parametric and nonparametric strategies to enhance synthetic images in a way that improves the performance …
Poster
Chunxiao Li · Xiaoxiao Wang · Meiling Li · Boming Miao · Peng Sun · Yunjian Zhang · Xiangyang Ji · Yao Zhu
[ Exhibit Hall I ]
Abstract
With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization – RRDataset encompasses high-quality images from seven major scenarios (War \& Conflict, Disasters \& Accidents, Political \& Social Events, Medical \& Public Health, Culture \& Religion, Labor \& Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness – examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms.3) Re-digitization Robustness – assessing model effectiveness on images altered through four distinct re-digitization methods.We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms. Our dataset …
Poster
Soonwoo Cha · Jiwoo Song · Juan Yeo · Hyunbin Jin · Taesup Kim
[ Exhibit Hall I ]
Abstract
Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging a model’s own knowledge across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal self-distillation process to refine CLIP’s representations, preserving local semantic consistency while sharpening local detail recognition. On open-vocabulary object detection and semantic segmentation benchmarks, ATAS achieves substantial performance gains, outperforming baseline CLIP models. These results validate the effectiveness of our approach and underscore the importance of jointly maintaining semantic coherence and fine-grained alignment for advanced open-vocabulary dense prediction.
Poster
Yicheng Feng · Yijiang Li · Wanpeng Zhang · Sipeng Zheng · Hao Luo · Zihao Yue · Zongqing Lu
[ Exhibit Hall I ]
Abstract
We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos—the spatial-temporal dynamics of objects throughout the videos. VideoOrion employs expert vision models to extract object dynamics through a detect-segment-track pipeline, encoding them into a set of object tokens by aggregating spatial-temporal object features. Our method addresses the persistent challenge in Video-LLMs of efficiently compressing high-dimensional video data into semantic tokens that are comprehensible to LLMs. Compared to prior methods which resort to downsampling the original video or aggregating visual tokens using resamplers, leading to information loss and entangled semantics, VideoOrion not only offers a more natural and efficient way to derive compact, disentangled semantic representations but also enables explicit object modeling of video content with minimal computational cost. Moreover, the introduced object tokens naturally allow VideoOrion to accomplish video-based referring tasks. Experimental results show that VideoOrion can learn to make good use of the object tokens, and achieves competitive results on both general video question answering and video-based referring benchmarks.
Poster
Juntao Chen · Wen Shen · Zhihua Wei · Lijun Sun · Hongyun Zhang
[ Exhibit Hall I ]
Abstract
Zero-shot Referring Expression Comprehension (REC) aims at locating an object described by a natural language query without training on task-specific datasets. Current approaches often utilize Vision-Language Models (VLMs) to perform region-text matching based on region proposals. However, this may downgrade their performance since VLMs often fail in relation understanding and isolated proposals inevitably lack global image context. To tackle these challenges, we first design a general formulation for code-based relation reasoning. It instructs Large Language Models (LLMs) to decompose complex relations and adaptively implement code for spatial and relation computation. Moreover, we directly extract region-text relevance from cross-modal attention maps in VLMs. Observing the inherent bias in VLMs, we further develop a simple yet effective bias deduction method, which enhances attention maps' capability to align text with the corresponding regions. Experimental results on four representative datasets demonstrate the SOTA performance of our method. On the RefCOCO dataset centered on spatial understanding, our method gets an average improvement of 10\% over the previous zero-shot SOTA. Code will be released as our paper is accepted.
Poster
Hongchi Ma · Guanglei Yang · Debin Zhao · Yanli JI · Wangmeng Zuo
[ Exhibit Hall I ]
Abstract
Industrial visual inspection is crucial for detecting defects in manufactured products, but it traditionally relies on human operators, leading to inefficiencies. Industrial Visual Anomaly Detection (IVAD) has emerged as a promising solution, with methods such as zero-shot, few-shot, and reconstruction-based techniques. However, zero-shot methods struggle with subtle anomalies, and reconstruction-based methods fail to capture fine-grained details. Few-shot methods, which use limited samples and prompts, offer a more efficient approach. Despite their promise, challenges remain in managing intra-class variation among references and in effectively extracting more representative anomaly features.This paper presents \textbf{R}etrieval-\textbf{e}nhanced \textbf{M}ulti-modal \textbf{P}rompt Fusion \textbf{A}nomaly \textbf{D}etection (ReMP-AD), a framework that introduces Intra-Class Token Retrieval (ICTR) to reduce noise in the memory bank and Vision-Language Prior Fusion (VLPF) to guide the encoder in capturing more distinctive and relevant features of anomalies. Experiments on the VisA and MVTec-AD datasets demonstrate that ReMP-AD outperforms existing methods, achieving 97.8\%/94.1\% performance in 4-shot anomaly segmentation and classification. Our approach also shows strong results on the PCB-Bank dataset, highlighting its effectiveness in few-shot industrial anomaly detection.
Poster
Omkar Thawakar · Dmitry Demidov · Ritesh Thawkar · Rao Anwer · Mubarak Shah · Fahad Khan · Salman Khan
[ Exhibit Hall I ]
Abstract
Composed video retrieval is a challenging task that strives to retrieve a target video based on a query video and a textual description detailing specific modifications. Standard retrieval frameworks typically struggle to handle the complexity of fine-grained compositional queries and variations in temporal understanding limiting their retrieval ability in the fine-grained setting. To address this issue, we introduce a novel dataset that captures both fine-grained and composed actions across diverse video segments, enabling more detailed compositional changes in retrieved video content.The proposed dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense modification text that is around seven times more than its existing counterpart. We further develop a new model that integrates visual and textual information through Cross-Attention (CA) fusion using grounded text encoder, enabling precise alignment between dense query modifications and target videos. The proposed model achieves state-of-the-art results surpassing existing methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text setting and outperforms the state-of-the-art by 3.4\%, highlighting its efficacy in terms of leveraging detailed video descriptions and dense modification texts. Our proposed dataset, code, and model will be publicly released.
Poster
Yang Xiao · Wang Lu · Jie Ji · Ruimeng Ye · Li · Xiaolong Ma · Bo Hui
[ Exhibit Hall I ]
Abstract
The design of artificial neural networks (ANNs) is inspired by the structure of the human brain, and in turn, ANNs offer a potential means to interpret and understand brain signals. Existing methods primarily align brain signals with real-world signals using Mean Squared Error (MSE), which solely focuses on local point-wise alignment, and ignores global matching, leading to coarse interpretations and inaccuracies in brain signal decoding. In this paper, we address these issues through optimal transport (OT) and theoretically demonstrate why OT provides a more effective alignment strategy than MSE. Specifically, we construct a transport plan between brain voxel embeddings and image embeddings, enabling more precise matching. By controlling the amount of transport, we mitigate the influence of redundant information.We apply our alignment model directly to the Brain Captioning task by feeding brain siginals into a large language model (LLM) instead of images. Our approach achieves state-of-the-art performance across ten evaluation metrics, surpassing the previous best method by an average of 6.11\% in single-subject training and 3.81\% in cross-subject training.Additionally, we have uncovered several insightful conclusions that align with existing brain research. We unveil the redundancy and synergy of brain information processing through region masking and data dimensionality reduction visualization experiments. …
Poster
Joonmyung Choi · Sanghyeok Lee · Byungoh Ko · Eunseo Kim · Jihyung Kil · Hyunwoo Kim
[ Exhibit Hall I ]
Abstract
Transformers have demonstrated remarkable success across various vision tasks, yet the quadratic complexity of self-attention remains a challenge for efficient inference.To address this, previous works such as FlashAttention optimize GPU memory access, and token compression techniques have been explored to reduce computational cost by reducing the number of tokens.However, conventional token importance measures rely on additional learnable modules or attention maps, making them impractical in training-free settings and incompatible with FlashAttention due to the inaccessibility of intermediate attention maps to minimize memory access.Here, we propose a novel training-free, model-agnostic token importance criterion, representation shift, which quantifies the information injected by each operation.Combined with the proposed representation shift, we can apply token compression on FlashAttention to further boost inference speed without requiring additional training or attention maps. This method also extends naturally beyond Transformers, e.g., convolutional neural networks (CNNs).Extensive experiments demonstrate that our representation shift, allowing token compression with FlashAttention and CNNs, results in up to 5.5$\times$ speed-up in video understandings.Through quantitative and qualitative experiments, we have shown that representation shift is a more robust alternative to conventional attention-based scores.
Poster
SungMin Jang · Wonjun Kim
[ Exhibit Hall I ]
Abstract
Open-vocabulary 3D semantic segmentation has been actively studied by incorporating language features into 3D scene representations.Even though many methods have shown the notable improvement in this task, they still have difficulties to make language embeddings be consistent across different views. This inconsistency highly results in mis-labeling where different language embeddings are assigned to the same part of an object. To address this issue, we propose a simple yet powerful method that aligns language embeddings via the identity information. The key idea is to locate language embeddings for the same identity closely in the latent space while putting them apart otherwise. This approach allows the same object to have identical language embeddings in novel views with accurate semantic masks, which are well aligned with the input text. Furthermore, we propose a progressive mask expanding scheme that enables more accurate extraction of semantic mask boundaries. This scheme is very effective in preserving the boundary shape of the target region by allowing the model to consider the local relationship between segments. Experimental results on benchmark datasets demonstrate that our method delivers state-of-the-art performance in open-vocabulary 3D semantic segmentation.
Poster
Yefei He · Feng Chen · Jing Liu · Wenqi Shao · Hong Zhou · Kaipeng Zhang · Bohan Zhuang
[ Exhibit Hall I ]
Abstract
The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform sparse attention mechanism solely on those important tokens, reducing the latency in the prefill phase. Tokens deemed less important will be discarded to reduce KV cache size, alleviating the memory bottleneck in the decoding …
Poster
Xiaoqi Wang · Clint Sebastian · Wenbin He · Liu Ren
[ Exhibit Hall I ]
Abstract
The recent advancements in large foundation models have driven the success of open-set image segmentation, a task focused on segmenting objects beyond predefined categories. Among various prompt types (such as points, boxes, texts, and visual references), visual reference segmentation stands out for its unique flexibility and strong zero-shot capabilities. Recently, several SAM-based methods have made notable progress in this task by automatically generating prompts to guide SAM. However, these methods often generate prompts at object boundaries due to suboptimal prompt encoder, which results in instability and reduced robustness. In this work, we introduce ProSAM, a simple but effective method to address the stability challenges we identified in existing SAM-based visual reference segmentation approaches. By learning a variational prompt encoder to predict multivariate prompt distributions, ProSAM avoids generating prompts that lie in unstable regions, overcoming the instability caused by less robust prompts. Our approach consistently surpasses state-of-the-art methods on the Pascal-5$^i$ and COCO-20$^i$ datasets, providing a more robust solution for visual reference segmentation.
Poster
Victor Quétu · Zhu LIAO · Nour Hezbri · Fabio Pizzati · Enzo Tartaglione
[ Exhibit Hall I ]
Abstract
Although deep neural networks are well-known for their outstanding performance in tackling complex tasks, their hunger for computational resources remains a significant hurdle, posing energy-consumption issues and restricting their deployment on resource-constrained devices, preventing their widespread adoption. In this paper, we present an optimal transport-based method to reduce the depth of over-parametrized deep neural networks, alleviating their computational burden. More specifically, we propose a new regularization strategy based on the Max-Sliced Wasserstein distance to minimize the distance between the intermediate feature distributions in the neural network. We show that minimizing this distance enables the complete removal of intermediate layers in the network, achieving better performance/depth trade-off compared to existing techniques.We assess the effectiveness of our method on traditional image classification setups and extend it to generative image models. Both source code and models will be released upon acceptance of the article.
Poster
Heeseok Jung · Jun-Hyeon Bak · Yujin Jeong · Gyugeun Lee · Jinwoo Ahn · Eun-Sol Kim
[ Exhibit Hall I ]
Abstract
In this paper, we propose a novel zero-shot compositional video understanding method inspired by how young children efficiently learn new concepts and flexibly expand their existing knowledge framework. While recent large-scale visual language models (VLMs) have achieved remarkable advancements and demonstrated impressive performance improvements across various tasks, they require massive amounts of data and computational resources. However, despite their high benchmark performance, they often fail to solve simple zero-shot composition tasks. Moreover, VLMs designed for video data demand even greater computational resources. We introduce a new video representation learning method inspired by human compositional learning to address these challenges. Specifically, we demonstrate that achieving zero-shot compositional learning requires effective representation learning that disentangles given data into meaningful semantic units. We propose a novel method that learns such disentangled representations based on an information-theoretic measure. By optimizing coding rate reduction, we successfully learn spatio-temporally disentangled features from videos, one of the most challenging data. Our approach significantly enhances compositional generalizability, demonstrating its effectiveness in zero-shot learning scenarios.
Poster
Zhen Qu · Xian Tao · Xinyi Gong · ShiChen Qu · Xiaopei Zhang · Xingang Wang · Fei Shen · Zhengtao Zhang · Mukesh Prasad · Guiguang Ding
[ Exhibit Hall I ]
Abstract
Recent vision-language models (e.g., CLIP) have demonstrated remarkable class-generalizable ability to unseen classes in few-shot anomaly segmentation (FSAS), leveraging supervised prompt learning or fine-tuning on seen classes. However, their ability to generalize across categories mainly relies on prior knowledge of real seen anomaly samples. In this paper, we propose a novel framework, namely DictAS, which enables a unified model to detect visual anomalies in unseen object categories without any retraining on the target data, only employing a few normal reference images as visual prompts. The insight behind DictAS is to transfer dictionary lookup capabilities to the FSAS task for unseen classes via self-supervised learning, instead of merely memorizing normal and abnormal feature patterns from the training set. Specifically, DictAS mainly consists of three components: (1) **Dictionary Construction** - to simulate the index and content of a real dictionary by building it with normal reference image features. (2) **Dictionary Lookup** - to retrieve queried region features from the dictionary using a sparse lookup strategy. When the queried feature cannot be successfully retrieved from the dictionary, it is classified as an anomaly. (3) **Query Discrimination Regularization** - to enhance anomaly discrimination by making abnormal features harder to retrieve from the dictionary. To …
Poster
Chunhao Lu · Qiang Lu · Meichen Dong · Jake Luo
[ Exhibit Hall I ]
Abstract
Current end-to-end multi-modal models utilize different encoders and decoders to process input and output information. This separation hinders the joint representation learning of various modalities. To unify multi-modal processing, we propose a novel architecture called MDM (Multi-modal Diffusion Mamba). MDM utilizes a Mamba-based multi-step selection diffusion model to progressively generate and refine modality-specific information through a unified variational autoencoder for both encoding and decoding. This innovative approach allows MDM to achieve superior performance when processing high-dimensional data, particularly in generating high-resolution images and extended text sequences simultaneously. Our evaluations in areas such as image generation, image captioning, visual question answering, text comprehension, and reasoning tasks demonstrate that MDM significantly outperforms existing end-to-end models (MonoFormer, LlamaGen, and Chameleon etc.) and competes effectively with SOTA models like GPT-4V, Gemini Pro, and Mistral. Our results validate MDM's effectiveness in unifying multi-modal processes while maintaining computational efficiency, establishing a new direction for end-to-end multi-modal architectures.
Poster
Peijun Bao · Chenqi Kong · SIYUAN YANG · Zihao Shao · Xinghao Jiang · Boon Ng · Meng Er · Alex Kot
[ Exhibit Hall I ]
Abstract
Temporal video grounding aims to localize the described temporal moment in an untrimmed video based on a natural language query. A major challenge of this task is its heavy reliance on labor-intensive annotations for training. Unlike existing works that directly train models on manually curated data, we propose a novel paradigm to reduce annotation costs: pretraining the model on unlabeled, real-world videos. Although this dataset is not perfectly accurate, it is easily scalable without requiring extensive manual effort. To support this, we introduce Temporal Video Grounding Pretraining (Vid-Group), a large-scale dataset collected with minimal human intervention, consisting of over 50K videos captured in the wild and 200K pseudo annotations. Direct pretraining on these imperfect pseudo annotations, however, presents significant challenges, including mismatched sentence-video pairs and imprecise temporal boundaries. To address these issues, we propose the ReCorrect algorithm, which comprises two main phases: semantics-guided refinement and memory-consensus correction. The semantics-guided refinement enhances the pseudo labels by leveraging semantic similarity with video frames to clean out unpaired data and make initial adjustments to temporal boundaries. In the following memory-consensus correction phase, a memory bank tracks the model predictions, progressively correcting the temporal boundaries based on consensus within the memory. Comprehensive experiments demonstrate …
Poster
Chengyu Tao · Xuanming Cao · Juan Du
[ Exhibit Hall I ]
Abstract
Industrial quality inspection plays a critical role in modern manufacturing by identifying defective products during production. While single-modality approaches using either 3D point clouds or 2D RGB images suffer from information incompleteness, multimodal anomaly detection offers promise through the complementary fusion of crossmodal data. However, existing methods face challenges in effectively integrating unimodal results and improving discriminative power. To address these limitations, we first reinterpret memory bank-based anomaly scores in single modalities as isotropic Euclidean distances in local feature spaces. Dynamically evolving from Eulidean metrics, we propose a novel $\underline{G}$eometry-$\underline{G}$uided $\underline{S}$core $\underline{F}$usion (G$^{2}$SF) framework that progressively learns an anisotropic local distance metric as a unified score for the fusion task. Through a geometric encoding operator, a novel Local Scale Prediction Network (LSPN) is proposed to predict direction-aware scaling factors that characterize first-order local feature distributions, thereby enhancing discrimination between normal and anomalous patterns. Additionally, we develop specialized loss functions and score aggregation strategy from geometric priors to ensure both metric generalization and efficacy. Comprehensive evaluations on the MVTec-3D AD dataset demonstrate the state-of-the-art detection performance of our method with low positive rate and better recall, which is essential in industrial application, and detailed ablation analysis validates each component's contribution. (\textit{Code …
Poster
Tianyu Zou · Shengwu Xiong · Ruilin Yao · Yi Rong
[ Exhibit Hall I ]
Abstract
This paper studies the few-shot segmentation (FSS) task, which aims to segment objects belonging to unseen categories in a query image by learning a model on a small number of well-annotated support samples. Our analysis of two mainstream FSS paradigms reveals that the predictions made by prototype learning methods are usually conservative, while those of affinity learning methods tend to be more aggressive. This observation motivates us to balance the conservative and aggressive information captured by these two types of FSS frameworks so as to improve the segmentation performance. To achieve this, we propose a **P**rototype-**A**ffinity **H**ybrid **Net**work (PAHNet), which introduces a Prototype-guided Feature Enhancement (PFE) module and an Attention Score Calibration (ASC) module in each attention block of an affinity learning model (called affinity learner). These two modules utilize the predictions generated by a pre-trained prototype learning model (called prototype predictor) to enhance the foreground information in support and query image representations and suppress the mismatched foreground-background (FG-BG) relationships between them, respectively. In this way, the aggressiveness of the affinity learner can be effectively mitigated, thereby eventually increasing the segmentation accuracy of our PAHNet method. Experimental results show that PAHNet achieves new state-of-the-art performance across 1-shot and 5-shot settings …
Poster
Jieun Kim · Jinmyeong Kim · Yoonji Kim · Sung-Bae Cho
[ Exhibit Hall I ]
Abstract
Large vision-language models (LVLMs) often exhibit object hallucination, a phenomenon where models generate descriptions of non-existent objects within images. Prior methods have sought to mitigate this issue by adjusting model logits to reduce linguistic bias, but they often lack precise control over visual uncertainty, sometimes exacerbating hallucinations instead of mitigating them. To address this limitation, we propose a novel decoding strategy called fuzzy contrastive decoding (FuzzyCD) that uses Takagi-Sugeno fuzzy inference to refine hallucination control. FuzzyCD adaptively assigns weights to high-hallucination logits while mitigating unnecessary linguistic bias. Specifically, it transforms the log-probabilities of top-1 tokens from both standard and hallucination logits into a \textit{confidence} linguistic fuzzy set. Through Takagi-Sugeno fuzzy inference, it dynamically adjusts hallucination logits to prevent the model from over-relying on spurious linguistic patterns. Experimental results on object hallucination datasets demonstrate that hallucination is mitigated by 11\%p compared to conventional LVLMs. In-depth analyses highlight the effectiveness of FuzzyCD in enhancing the reliability of vision-language models.
Poster
Jianting Tang · Yubo Wang · Haoyu Cao · Linli Xu
[ Exhibit Hall I ]
Abstract
Mainstream Multimodal Large Language Models (MLLMs) achieve visual understanding by using a vision projector to bridge well-pretrained vision encoders and large language models (LLMs). The inherent gap between visual and textual modalities makes the embeddings from the vision projector critical for visual comprehension. However, current alignment approaches treat visual embeddings as contextual cues and merely apply auto-regressive supervision to textual outputs, neglecting the necessity of introducing equivalent direct visual supervision, which hinders the potential finer alignment of visual embeddings. In this paper, based on our analysis of the refinement process of visual embeddings in the LLM’s shallow layers, we propose BASIC, a method that utilizes refined visual embeddings within the LLM as supervision to directly guide the projector in generating initial visual embeddings. Specifically, the guidance is conducted from two perspectives: (i) optimizing embedding directions by reducing angles between initial and supervisory embeddings in semantic space; (ii) improving semantic matching by minimizing disparities between the logit distributions of both visual embeddings. Without additional supervisory models or artificial annotations, BASIC significantly improves the performance of MLLMs across a wide range of benchmarks, demonstrating the effectiveness of our introduced direct visual supervision.
Poster
Corentin Dumery · Noa Ette · Aoxiang Fan · Ren Li · Jingyi Xu · Hieu Le · Pascal Fua
[ Exhibit Hall I ]
Abstract
Visual object counting is a fundamental computer vision task underpinning numerous real-world applications, from cell counting in biomedicine to traffic and wildlife monitoring. However, existing methods struggle to handle the challenge of stacked 3D objects in which most objects are hidden by those above them. To address this important yet underexplored problem, we propose a novel 3D counting approach that decomposes the task into two complementary subproblems - estimating the 3D geometry of the object stack and the occupancy ratio from multi-view images. By combining geometric reconstruction and deep learning-based depth analysis, our method can accurately count identical objects within containers, even when they are irregularly stacked. We validate our 3D Counting pipeline on diverse real-world and large-scale synthetic datasets, which we will release publicly to facilitate further research.
Poster
Xiaohang Zhan · Dingming Liu
[ Exhibit Hall I ]
Abstract
We propose a novel training-free image generation algorithm that precisely controls the occlusion relationships between objects in an image. Existing image generation methods typically rely on prompts to influence occlusion, which often lack precision. While layout-to-image methods provide control over object locations, they fail to address occlusion relationships explicitly. Given a pre-trained image diffusion model, our method leverages volume rendering principles to ``render'' the scene in latent space, guided by occlusion relationships and the estimated transmittance of objects. This approach does not require retraining or fine-tuning the image diffusion model, yet it enables accurate occlusion control due to its physics-grounded foundation. In extensive experiments, our method significantly outperforms existing approaches in terms of occlusion accuracy. Furthermore, we demonstrate that by adjusting the opacities of objects or concepts during rendering, our method can achieve a variety of effects, such as altering the transparency of objects, the density of mass (e.g., forests), the concentration of particles (e.g., rain, fog), the intensity of light, and the strength of lens effects, etc.
Poster
Chao Liu · Yangbo Jiang · Nenggan Zheng
[ Exhibit Hall I ]
Abstract
Extracting tubular structures from images is a widespread and challenging task in computer vision. To explore these continuous structures, iterative tracing methods offer a promising direction. However, in scenes with dense and blurred branches, existing tracing methods tend to jump to adjacent branches during tracing process, leading a significant topological mistake. The reason of this shortcoming is that the tracing model only focuses on the estimation of discrete nodes and ignores their connection attribution. To solve this problem, we introduce NETracer, a topology-aware iterative tracing method to improve the continuity and topological accuracy. In our approach, a node-edge estimation network with local connectivity loss is trained to produce the future nodes and their connective edges. Then, a geodesic distance-based search strategy is employed with the help of predicted edge cues to trace the future branches more accurately. Additionally, to comprehensively assess the effect of the tracing model, an new tracing metric is proposed to evaluate the local accuracy, continuity, and topological correctness of the traced branches. We demonstrate that our proposed method outperforms existing segmentation and tracing methods on five 2D road, vessel and 3D neuron datasets.
Poster
Yujia Tong · Yuze Wang · Jingling Yuan · Chuang Hu
[ Exhibit Hall I ]
Abstract
Model quantization enables efficient deployment of deep neural networks on edge devices through low-bit parameter representation, yet raises critical challenges for implementing machine unlearning (MU) under data privacy regulations. Existing MU methods designed for full-precision models fail to address two fundamental limitations in quantized networks: 1) Noise amplification from label mismatch during data processing, and 2) Gradient imbalance between forgotten and retained data during training. These issues are exacerbated by quantized models' constrained parameter space and discrete optimization. We propose Q-MUL, the first dedicated unlearning framework for quantized models. Our method introduces two key innovations: 1) Similar Labels assignment replaces random labels with semantically consistent alternatives to minimize noise injection, and 2) Adaptive Gradient Reweighting dynamically aligns parameter update contributions from forgotten and retained data. Through systematic analysis of quantized model vulnerabilities, we establish theoretical foundations for these mechanisms. Extensive evaluations on benchmark datasets demonstrate Q-MUL's superiority over existing approaches.
Poster
Yuan Tian · Shuo Wang · Rongzhao Zhang · Zijian Chen · Yankai Jiang · Chunyi Li · Xiangyang Zhu · Fang Yan · Qiang Hu · Xiaosong Wang · Guangtao Zhai
[ Exhibit Hall I ]
Abstract
Medical imaging has significantly advanced computer-aided diagnosis, yet its re-identification (ReID) risks raise critical privacy concerns, calling for de-identification (DeID) techniques. Unfortunately, existing DeID methods neither particularly preserve medical semantics, nor are flexibly adjustable towards different privacy levels. To address these issues, we propose a divide-and-conquer framework that comprises two steps: (1) \textbf{Identity-Blocking}, which blocks varying proportions of identity-related regions, to achieve different privacy levels; and (2) \textbf{Medical-Semantics-Compensation}, which leverages pre-trained Medical Foundation Models (MFMs) to extract medical semantic features to compensate the blocked regions. Moreover, recognizing that features from MFMs may still contain residual identity information, we introduce a \textbf{Minimum Description Length} principle-based feature decoupling strategy, to effectively decouple and discard such identity components. Extensive evaluations against existing approaches across seven datasets and three downstream tasks, demonstrating our state-of-the-art performance.
Poster
Yuanhan Zhang · Yunice Chew · Yuhao Dong · Aria Leo · Bo Hu · Ziwei Liu
[ Exhibit Hall I ]
Abstract
Human intelligence requires both correctness and robustness, with the former being foundational for the latter. In video understanding, correctness ensures the accurate interpretation of visual content, and robustness maintains consistent performance in challenging conditions. Despite advances in video large language models (video LLMs), existing benchmarks inadequately reflect the gap between these models and human intelligence in maintaining correctness and robustness in video interpretation. We introduce the Video Turing Test (Video-TT), a benchmark designed to assess if video LLMs can interpret real-world videos as effectively as humans.Video-TT differentiates between errors due to inadequate frame sampling and 1) genuine gaps in understanding complex visual narratives, and 2) evaluates robustness against natural adversarial questions. Video-TT comprises 1,000 YouTube Shorts videos, each with one open-ended question and four adversarial questions that probe visual and narrative complexity. Our evaluation shows a significant gap between video LLMs and human performance, underscoring the need for benchmarks like Video-TT to advance video understanding.
Poster
Langyu Wang · Langyu Wang · Yingying Chen · Yiyuan Zhang · Ming Tang · Jinqiao Wang
[ Exhibit Hall I ]
Abstract
The weakly-supervised audio-visual video parsing (AVVP) aims to predict all modality-specific events and locate their temporal boundaries. Despite significant progress, due to the limitations of the weakly-supervised and the deficiencies of the model architecture, existing methods are lacking in simultaneously improving both the segment-level prediction and the event-level prediction. In this work, we propose a audio-visual Mamba network with pseudo labeling aUGmentation (MUG) for emphasising the uniqueness of each segment and excluding the noise interference from the alternate modalities. Specifically, we annotate some of the pseudo-labels based on previous work. Using unimodal pseudo-labels, we perform cross-modal random combinations to generate new data, which can enhance the model’s ability to parse various segment-level event combinations. For feature processing and interaction, we employ a audio-visual mamba network. The AV-Mamba enhances the ability to perceive different segments and excludes additional modal noise while sharing similar modal information. Our extensive experiments demonstrate that MUG improves state-of-the-art results on LLP dataset, especially in visual metrics (e.g., gains of 2.8\% and 1.1\% in terms of Segment-level visual and Event-level visual metrics).
Poster
Xin Shen · Xinyu Wang · Lei Shen · Kaihao Zhang · Xin Yu
[ Exhibit Hall I ]
Abstract
Cross-view isolated sign language recognition (CV-ISLR) addresses the challenge of identifying isolated signs from viewpoints unseen during training, a problem aggravated by the scarcity of multi-view data in existing benchmarks. To bridge this gap, we introduce a novel two-stage framework comprising View Synthesis and Contrastive Multi-task View-Semantics Recognition. In the View Synthesis stage, we simulate unseen viewpoints by extracting 3D keypoints from the frontal-view training dataset and synthesizing common-view 2D skeleton sequences with virtual camera rotation, which enriches view diversity without the cost of multi-camera setups. However, direct training on these synthetic samples leads to limited improvement, as viewpoint-specific and semantics-specific features remain entangled. To overcome this drawback, the Contrastive Multi-task View-Semantics Recognition stage employs the cross-attention mechanism and contrastive learning objective, explicitly disentangling viewpoint-related information from sign semantics, thus obtaining robust view-invariant representations. We evaluate our approach on the MM-WLAuslan dataset, the first benchmark for CV-ISLR, and on our extended protocol (MTV-Test) that includes additional multi-view data captured in the wild. Experimental results demonstrate that our method not only improves the accuracy of frontal-view skeleton-based isolated sign language recognition, but also exhibits superior generalization to novel viewpoints. The MTV-Test set and code will be publicly released here.
Poster
Zeren Jiang · Chuanxia Zheng · Iro Laina · Diane Larlus · Andrea Vedaldi
[ Exhibit Hall I ]
Abstract
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes.By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, depth, and ray maps. It uses a new multi-modal alignment algorithm to align and fuse these modalities, as well as multiple sliding windows, at inference time, thus obtaining robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods, including recent methods such as MonST3R, which are also designed to handle dynamic scenes.
Poster
Xuehan Chen · Guangyu Ren · Tianhong Dai · Tania Stathaki · Hengyan Liu
[ Exhibit Hall I ]
Abstract
Foundation models, such as Segment Anything (SAM), have exhibited remarkable performance in conventional segmentation tasks, primarily due to their training on large-scale datasets. Nonetheless, challenges remain in specific downstream tasks, such as Camouflaged Object Detection (COD). Existing research primarily aims to enhance performance by integrating additional multimodal information derived from other foundation models. However, directly leveraging the information generated by these models may introduce additional biases due to domain shifts. To address this issue, we propose an Adaptive Refinement Module (ARM), which efficiently processes multimodal information and simultaneously enhances refined mask prompt. Furthermore, we construct an auxiliary embedding that effectively exploits the intermediate information generated during ARM, providing SAM with richer feature representations. Experimental results indicate that our proposed architecture surpasses most state-of-the-art (SOTA) models in the COD task, particularly excelling in structured target segmentation.
Poster
Wenzheng Zeng · Difei Gao · Mike Zheng Shou · Hwee Tou Ng
[ Exhibit Hall I ]
Abstract
Video LLMs show great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D$^2$VLM, a framework that \underline{d}ecouples the learning of these two tasks while also emphasizing their inherent \underline{d}ependency. We adopt a ``grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear …
Poster
Bingchao Wang · Zhiwei Ning · Jianyu Ding · Xuanang Gao · Yin Li · Dongsheng Jiang · JIE YANG · Wei Liu
[ Exhibit Hall I ]
Abstract
CLIP has shown promising performance across many short-text tasks in a zero-shot manner. However, limited by the input length of the text encoder, CLIP struggles on under-stream tasks with long-text inputs ($>77$ tokens). To improve long-text understanding while preserving short-text capabilities, we propose Fix-CLIP which includes three novel modules: (1) A dual-branch training pipeline that aligns short and long texts with masked and raw images respectively, which boosts the long-text representation while preserving the short-text ability. (2) Multiple learnable regional prompts with unidirectional masks in Transformer layers for regional information extraction. (3) A hierarchical feature alignment module in the intermediate encoder layers to promote the consistency of multi-scale features. Furthermore, we collect 30M images and utilize existing MLLMs to synthesize long-text captions for training. Extensive experiments show that Fix-CLIP achieves state-of-the-art performance on both long-text and short-text retrieval benchmarks. For downstream applications, we reveal that Fix-CLIP's text encoder delivers promising performance in a plug-and-play manner for diffusion models with long-text input.
Poster
Mengdi Liu · Zhangyang Gao · Hong Chang · Stan Li · Shiguang Shan · Xilin Chen
[ Exhibit Hall I ]
Abstract
Understanding how genes influence phenotype across species is a fundamental challenge in genetic engineering, which will facilitate advances in various fields such as crop breeding, conservation biology, and personalized medicine. However, current phenotype prediction models are limited to individual species and expensive phenotype labeling process, making the genotype-to-phenotype prediction a highly domain-dependent and data-scarce problem. To this end, we suggest taking images as morphological proxies, facilitating cross-species generalization through large-scale multimodal pretraining. We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA considering two critical evolutionary signals, i.e., multiple sequence alignments (MSA) and environmental contexts. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency. Extensive experiments show that integrating evolutionary signals with environmental context enriches the model's understanding of phenotype variability across species, thereby offering a valuable and promising exploration into advanced AI-assisted genomic analysis.
Poster
Hyolim Kang · Yunsu Park · Youngbeom Yoo · Yeeun Choi · Seon Joo Kim
[ Exhibit Hall I ]
Abstract
We introduce Hierarchical Streaming Video Understanding, a task that combines online temporal action localization with free-form description generation. Given the scarcity of datasets with hierarchical and fine-grained temporal annotations, we demonstrate that LLMs can effectively group atomic actions into higher-level events, enriching existing datasets.We then propose OpenHOUSE (Open-ended Hierarchical Online Understanding System for Events), which extends streaming action perception beyond action classification. OpenHOUSE features a specialized streaming module that accurately detects boundaries between closely adjacent actions, nearly doubling the performance of direct extensions of existing methods.We envision the future of streaming action perception in the integration of powerful generative models, with OpenHOUSE representing a key step in that direction.
Poster
Mattia Segu · Marta Tintore Gazulla · Yongqin Xian · Luc Gool · Federico Tombari
[ Exhibit Hall I ]
Abstract
Scaling up model size and training data has advanced foundation models for instance-level perception, achieving state-of-the-art in-domain and zero-shot performance across object detection and segmentation. However, their high computational cost limits adoption on resource-constrained platforms. We first examine the limitations of existing architectures in enabling efficient edge deployment without compromising performance. We then introduce MOBIUS, a family of foundation models for universal instance segmentation, designed for Pareto-optimal downscaling to support deployment across devices ranging from high-end accelerators to mobile hardware. To reduce training and inference demands, we propose: (i) a bottleneck pixel decoder for efficient multi-scale and multi-modal fusion, (ii) a language-guided uncertainty calibration loss for adaptive decoder pruning, and (iii) a streamlined, unified training strategy. Unlike efficient baselines that trade accuracy for reduced complexity, MOBIUS reduces pixel and transformer decoder FLOPs by up to 55% and 75%, respectively, while maintaining state-of-the-art performance in just a third of the training iterations. MOBIUS establishes a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
Poster
Guoyizhe Wei · Rama Chellappa
[ Exhibit Hall I ]
Abstract
Vision Transformers (ViTs) have delivered remarkable progress through global self-attention, yet their quadratic complexity can become prohibitive for high-resolution inputs. In this work, we present ViT-Linearizer, a cross-architecture distillation framework that transfers rich ViT representations into a linear-time, recurrent-style model. Our approach leverages 1) activation matching, an intermediate constraint that encourages student to align its token-wise dependencies with those produced by the teacher, and 2) masked prediction, a contextual reconstruction objective that requires the student to predict the teacher’s representations for unseen (masked) tokens, to effectively distill the quadratic self-attention knowledge into the student while maintaining efficient complexity. Empirically, our method provides notable speedups particularly for high-resolution tasks, significantly addressing the hardware challenges in inference. Additionally, it also elevates Mamba-based architectures’ performance on standard vision benchmarks, achieving a competitive 84.3% top-1 accuracy on ImageNet with a base-sized model. Our results underscore the good potential of RNN-based solutions for large-scale visual tasks, bridging the gap between theoretical efficiency and real-world practice.
Poster
Meng Tian · Shuo Yang · Xinxiao Wu
[ Exhibit Hall I ]
Abstract
Driven by large-scale contrastive vision-language pre-trained models such as CLIP, recent advancements in the image-text matching task have achieved remarkable success in representation learning. Due to image-level visual-language alignment, CLIP falls short in understanding fine-grained details such as object attributes and spatial relationships between objects. Recent efforts have attempted to compel CLIP to acquire structured visual representations by introducing prompt learning to achieve object-level alignment. While achieving promising results, they still lack the capability to perceive actions, which are crucial for describing the states or relationships between objects. Therefore, we propose to endow CLIP with fine-grained action-level understanding by introducing an LLM-enhanced action-aware multi-modal prompt-tuning method, incorporating the action-related external knowledge generated by large language models (LLMs). Specifically, we design an action triplet prompt and an action state prompt to exploit compositional semantic knowledge and state-related causal knowledge implicitly stored in LLMs. Subsequently, we propose an adaptive interaction module to aggregate attentive visual features conditioned on action-aware prompted knowledge for establishing discriminative and action-aware visual representations, which further improves the performance. Comprehensive experimental results on two benchmark datasets demonstrate the effectiveness of our method.
Poster
Weixian Lei · Jiacong Wang · Haochen Wang · Xiangtai Li · Jun Hao Liew · Jiashi Feng · Zilong Huang
[ Exhibit Hall I ]
Abstract
This paper introduces SAIL, a single transformer unified multimodal large language model (MLLM) that integrates raw pixel encoding and language decoding within a singular architecture. Unlike existing modular MLLMs, which rely on a pre-trained vision transformer (ViT), SAIL eliminates the need for a separate vision encoder, presenting a more minimalist architecture design. Instead of introducing novel architectural components, SAIL adapts mix-attention mechanisms and multimodal positional encodings to better align with the distinct characteristics of visual and textual modalities. We systematically compare SAIL's properties-including scalability, cross-modal information flow patterns, and visual representation capabilities-with those of modular MLLMs. By scaling both training data and model size, SAIL achieves performance comparable to modular MLLMs. Notably, the removal of pretrained ViT components enhances SAIL's scalability and results in significantly different cross-modal information flow patterns. Moreover, SAIL demonstrates strong visual representation capabilities, achieving results on par with ViT-22B in vision tasks such as semantic segmentation. Code and models will be released.
Poster
Pablo Garcia-Fernandez · Lorenzo Vaquero · Mingxuan Liu · Feng Xue · Daniel Cores · Nicu Sebe · Manuel Mucientes · Elisa Ricci
[ Exhibit Hall I ]
Abstract
Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset will be made available.
Poster
Hao LU · Yuting Zhang · Jiaqi Tang · Bowen Fu · Wenhang Ge · Wei Wei · Kaishun Wu · Ying-Cong Chen
[ Exhibit Hall I ]
Abstract
Remote Photoplethysmography (rPPG) enables non-contact extraction of physiological signals, providing significant advantages in medical monitoring, emotion recognition, and face anti-spoofing. However, the extraction of reliable rPPG signals is hindered by motion variations in real-world environments, leading to entanglement issue. To address the challenge, we employ the Generalizable Gaussian Model (GGM) to disentangle geometry and chroma components with 4D Gaussian representations. Employing the GGM for robust rPPG estimation is non-trivial. Firstly, there are no camera parameters in the dataset, resulting in the inability to render video from 4D Gaussian. The ``4D virtual camera'' is proposed to construct extra Gaussian parameters to describe view and motion changes, giving the ability to render video with the fixed virtual camera parameters. Further, the chroma component is still not explicitly decoupled in 4D Gaussian representation. Explicit motion modeling (EMM) is designed to decouple the motion variation in an unsupervised manner. Explicit chroma modeling (ECM) is tailored to decouple specular, physiological, and noise signals, respectively. To validate our approach, we expand existing rPPG datasets to include various motion and illumination interference scenarios, demonstrating the effectiveness of our method in real-world settings. The code will be available after acceptance.
Poster
Seogkyu Jeon · Kibeom Hong · Hyeran Byun
[ Exhibit Hall I ]
Abstract
Recent domain generalized semantic segmentation (DGSS) studies have achieved notable improvements by distilling semantic knowledge from Vision-Language Models (VLMs). However, they overlook the semantic misalignment between visual and textual contexts, which arises due to the rigidity of a fixed context prompt learned on a single source domain. To this end, we present a novel domain generalization framework for semantic segmentation, namely Domain-aware Prompt-driven Masked Transformer (DPMFormer). Firstly, we introduce domain-aware prompt learning to facilitate semantic alignment between visual and textual cues. To capture various domain-specific properties with a single source dataset, we propose domain-aware contrastive learning along with the texture perturbation that diversifies the observable domains. Lastly, to establish a framework resilient against diverse environmental changes, we have proposed the domain-robust consistency learning which guides the model to minimize discrepancies of prediction from original and the augmented images. Through experiments and analyses, we demonstrate the superiority of the proposed framework, which establishes a new state-of-the-art on various DGSS benchmarks.
Poster
Yudong Liu · Jingwei Sun · Yueqian Lin · Jingyang Zhang · Ming Yin · Qinsi Wang · Jianyi Zhang · Hai Li · Yiran Chen
[ Exhibit Hall I ]
Abstract
Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach's effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.
Poster
Han Wang · Yuxiang Nie · Yongjie Ye · Yanjie Wang · SHUAI LI · Haiyang Yu · Jinghui Lu · Can Huang
[ Exhibit Hall I ]
Abstract
The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos.In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed Dynamic-VLM achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, Dynamic-VLM delivers an absolute improvement of 2.7% over LLaVA-OneVision on VideoMME and 10.7% on MuirBench.
Poster
Min Yang · Zihan Jia · Zhilin Dai · Sheng Guo · Limin Wang
[ Exhibit Hall I ]
Abstract
Although big models have achieved good results in increasing numbers of vision tasks, efficient lightweight neural networks have received increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video models still focus on the larger ViT architecture, and few works attempt to build efficient architecture. Since many efficient contrastive language-image pre-training (CLIP) models have shown strong zero-shot classification and retrieval capability, we attempt to fill the gap in video-text understanding models and propose a fast and efficient video-text model \textbf{MobileViCLIP} with strong zero-shot reasoning capability that can be deployed on mobile devices. In particular, our MobileViCLIP-Small obtains similar zero-shot retrieval performance as InternVideo2-L14 on text-to-video dataset MSR-VTT while being $46.7\times$ faster when deployed on the mobile device. Furthermore, MobileViCLIP-Small can generalize to zero-shot action recognition task and obtains 1.0\% better Top-1 accuracy than InternVideo2-S14 while being $5.6\times$ faster on the mobile device.
Poster
Junpeng Jing · Weixun Luo · Ye Mao · Krystian Mikolajczyk
[ Exhibit Hall I ]
Abstract
This paper introduces Stereo Any Video, a powerful framework for video stereo matching. It can estimate spatially accurate and temporally consistent disparities without relying on auxiliary information such as camera poses or optical flow. The strong capability is driven by rich priors from monocular video depth models, which are integrated with convolutional features to produce stable representations. To further enhance performance, key architectural innovations are introduced: all-to-all-pairs correlation, which constructs smooth and robust matching cost volumes, and temporal convex upsampling, which improves temporal coherence. These components collectively ensure robustness, accuracy, and temporal consistency, setting a new standard in video stereo matching. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple datasets both qualitatively and quantitatively in zero-shot settings, as well as strong generalization to real-world indoor and outdoor scenarios. Code and models will be publicly released.
Poster
Yehao Lu · Minghe Weng · Zekang Xiao · Rui Jiang · Wei Su · Guangcong Zheng · Luping Luping · Xi Li
[ Exhibit Hall I ]
Abstract
The Mixture of Experts (MoE) architecture has excelled in Large Vision-Language Models (LVLMs), yet its potential in real-time open-vocabulary object detectors, which also leverage large-scale vision-language datasets but smaller models, remains unexplored. This work investigates this domain, revealing intriguing insights. In the shallow layers, experts tend to cooperate with diverse peers to expand the search space. While in the deeper layers, fixed collaborative structures emerge, where each expert maintains 2-3 fixed partners and distinct expert combinations are specialized in processing specific patterns. Concretely, we propose Dynamic-DINO, which extends Grounding DINO 1.5 Edge from a dense model to a dynamic inference framework via an efficient MoE-Tuning strategy. Additionally, we design a granularity decomposition mechanism to decompose the Feed-Forward Network (FFN) of base model into multiple smaller expert networks, expanding the subnet search space. To prevent performance degradation at the start of fine-tuning, we further propose a pre-trained weight allocation strategy for the experts, coupled with a specific router initialization. During inference, only the input-relevant experts are activated to form a compact subnet. Experiments show that, pretrained with merely 1.56M open-source data, Dynamic-DINO outperforms Grounding DINO 1.5 Edge, pretrained on the private Grounding20M dataset.
Poster
Qizhe Zhang · Aosong Cheng · Ming Lu · Renrui Zhang · Zhiyong Zhuo · Jiajun Cao · Shaobo Guo · Qi She · Shanghang Zhang
[ Exhibit Hall I ]
Abstract
Large vision-language models (VLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for visual token pruning. Based on the analysis, We propose **VisPruner**, a plug-and-play method that utilizes visual cues for more effective token pruning in visual language models (VLMs). Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable …
Poster
Qiao Zhang · Mingwen Shao · Xinyuan Chen · Xiang Lv · Kai Xu
[ Exhibit Hall I ]
Abstract
The Mamba model excels in anomaly detection through efficient long-range dependency modeling and linear complexity. However, Mamba-based anomaly detectors still face two critical challenges: (1) insufficient modeling of diverse local features leading to inaccurate detection of subtle anomalies; (2) spatial-wise scanning mechanism disrupting the spatial continuity of large-scale anomalies, resulting in incomplete localization. To address these challenges, we propose Wave-MambaAD, a wavelet-driven state space model for unified subtle and large-scale anomaly detection. Firstly, to capture subtle anomalies, we design a high-frequency state space model that employs horizontal, vertical, and diagonal scanning mechanisms for processing directionally aligned high-frequency components, enabling precise anomaly detection through multidimensional feature extraction. Secondly, for comprehensive localization of large-scale anomalies, we propose a low-frequency state space model implementing channel-adaptive dynamic scanning mechanisms to maintain structural coherence in global contexts, which facilitates large-scale anomaly detection via adaptive feature integration. Finally, we develop a dynamic spatial enhancement block to improve anomalous feature representation by enhancing feature diversity through coordinated inter-channel communication and adaptive gating mechanisms. Comprehensive experiments on benchmark anomaly detection datasets show that Wave-MambaAD achieves competitive performance at lower parameters and computational costs.
Poster
Yingyue Li · Bencheng Liao · Wenyu Liu · Xinggang Wang
[ Exhibit Hall I ]
Abstract
With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, leading to slow convergence, high resource demands, and suboptimal performance on downstream understanding and complex reasoning tasks. In this work, we introduce MaTVLM, a hybrid model that replaces a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. By leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. We further enhance training efficiency through a single-stage distillation process, using the pre-trained VLM as a teacher model to transfer knowledge to MaTVLM. Additionally, we explore the impact of differential distillation losses within our training framework. Evaluations across multiple benchmarks demonstrate that MaTVLM achieves competitive performance against the teacher model and existing VLMs while outperforming both Mamba-based VLMs and models with similar parameter scales. Remarkably, MaTVLM attains up to 3.6× faster inference than the teacher model and reduces GPU memory consumption by 27.5%, all without compromising performance.
Poster
Tianyuan Qu · Longxiang Tang · Bohao PENG · Senqiao Yang · Bei Yu · Jiaya Jia
[ Exhibit Hall I ]
Abstract
The rise of Large Vision-Language Models (LVLMs) has significantly advanced video understanding. However, efficiently processing long videos remains a challenge due to the "Sampling Dilemma'': low-density sampling risks missing critical information, while high-density sampling introduces redundancy. To address this issue, we introduce LSDBench, the first benchmark designed to evaluate LVLMs on long-video tasks by constructing high Necessary Sampling Density (NSD) questions—where NSD represents the minimum sampling density required to accurately answer a given question. LSDBench focuses on dense, short-duration actions to rigorously assess the sampling strategies employed by LVLMs. To tackle the challenges posed by high-NSD questions, we propose a novel Reasoning-Driven Hierarchical Sampling (RHS) framework, which combines global localization of question-relevant cues with local dense sampling for precise inference. Additionally, we develop a lightweight Semantic-Guided Frame Selector to prioritize informative frames, enabling RHS to achieve comparable or superior performance with significantly fewer sampled frames. Together, our LSDBench and RHS framework address the unique challenges of high-NSD long-video tasks, setting a new standard for evaluating and improving LVLMs in this domain.
Poster
Yongjian Wu · Yang Zhou · Jiya Saiyin · Bingzheng Wei · Yan Xu
[ Exhibit Hall I ]
Abstract
We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization —-- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO. The code will be released at VisTex-OVLM.
Poster
Yuxuan Yuan · Luyao Tang · Chaoqi Chen · Yixin Chen · Yue Huang · Xinghao Ding
[ Exhibit Hall I ]
Abstract
Albeit existing Single-Domain Generalized Object Detection (Single-DGOD) methods enable models to generalize to unseen domains, most assume that the training and testing data share the same label space. In real-world scenarios, unseen domains often introduce previously unknown objects, a challenge that has been largely overlooked. In this paper, we tackle the practical problem of Single-domain Generalizable Open-Set Object Detection (SG-OSOD), which addresses both unseen domains and unknown classes. We identify two key challenges: (1) detecting unknown classes with only known-class data, and (2) learning robust features to mitigate domain shift. To address these challenges, we propose the framework termed $\texttt{ASGS}$, which leverages adaptive subgraph structures to enhance the understanding of unknown scenes and classes. $\texttt{ASGS}$ consists of Subgraph-wise Unknown-class Learning (SUL) and Class-wise Embedding Compaction (CEC). SUL employs non-parametric methods to detect unknown samples and performs Adaptive Subgraph Searching (ASS) for high-order structural feature extraction, enabling domain-robust unknown class learning. Moreover, the CEC module enhances class discrimination robustness through contrastive learning, which results in more compact class clusters in unknown scenarios. Experimental results demonstrate the effectiveness of the proposed $\texttt{ASGS}$.
Poster
Yuhao Wang · Wei Xi
[ Exhibit Hall I ]
Abstract
Convolutional neural networks (ConvNets) with large effective receptive field (ERF), still in their early stages, have demonstrated promising effectiveness while constrained by high parameters and FLOPs costs and disrupted asymptotically Gaussian distribution (AGD) of ERF. This paper proposes an alternative paradigm: rather than merely employing extremely large ERF, it is more effective and effcient to expand the ERF while maintaining AGD of ERF by proper combination of smaller kernels, such as $7\times{7}$, $9\times{9}$, $11\times{11}$. This paper introduces a Three-layer Receptive Field Aggregator and designs a Layer Operator as the fundamental operator from the perspective of receptive field. The ERF can be expanded to the level of existing large-kernel ConvNets through the stack of proposed modules while maintaining AGD of ERF. Using these designs, we propose a universal ConvNet, termed UniConvNet. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K demonstrate that UniConvNet outperforms state-of-the-art CNNs and ViTs across various vision recognition tasks for both lightweight and large-scale models with comparable throughput. Surprisingly, UniConvNet-T achieves $84.2\%$ ImageNet top-1 accuracy with $30M$ parameters and $5.1G$ FLOPs. UniConvNet-XL also shows competitive scalability to big data and large models, acquiring $88.4\%$ top-1 accuracy on ImageNet and $56.9\%$ on COCO.
Poster
Rakshith Madhavan · Federica Arrigoni
[ Exhibit Hall I ]
Abstract
The viewing graph is a compact tool to encode the geometry of multiple views: nodes represent uncalibrated cameras and edges represent fundamental matrices (when available). Most research focuses on theoretical analyses, exploring for which viewing graphs it is possible (in principle) to retrieve cameras from fundamental matrices, in the sense that the problem admits a unique solution for noiseless data. However, the practical task of recovering cameras from noisy fundamental matrices is still open, as available methods are limited to special graphs (such as those covered by triplets). In this paper, we develop the first method that can deal with the recovery of cameras from noisy fundamental matrices in a general viewing graph. Experimental results demonstrate the promise of the proposed approach on a variety of synthetic and real scenarios.
Poster
Alex Costanzino · Pierluigi Zama Ramirez · Luigi Lella · Matteo Ragaglia · Alessandro Oliva · Giuseppe Lisanti · Luigi Stefano
[ Exhibit Hall I ]
Abstract
We propose SiM3D, the first benchmark considering the integration of multiview and multimodal information for comprehensive 3D anomaly detection and segmentation (ADS) where the task is to produce a voxel-based Anomaly Volume. Moreover, SiM3D focuses on a scenario of high interest in manufacturing: single-instance anomaly detection, where only one object, either real or synthetic, is available for training. In this respect, SiM3D stands out as the first ADS benchmark that addresses the challenge of generalizing from synthetic training data to real test data. SiM3D includes a novel multimodal multiview dataset acquired using top-tier industrial sensors and robots. The dataset features multiview high-resolution images (12 ${\tt Mpx}$) and point clouds ($\sim$7M points) for 333 instances of eight types of objects, alongside a CAD model for each type. We also provide manually annotated 3D segmentation GTs for anomalous test samples. To establish reference baselines for the proposed multiview 3D ADS task, we adapt prominent singleview methods and assess their performance using novel metrics that operate on Anomaly Volumes.
Poster
Xiwei Xuan · Ziquan Deng · Kwan-Liu Ma
[ Exhibit Hall I ]
Abstract
Training-free open-vocabulary semantic segmentation (OVS) aims to segment images given a set of arbitrary textual categories without costly model fine-tuning. Existing solutions often explore attention mechanisms of pre-trained models, such as CLIP, or generate synthetic data and design complex retrieval processes to perform OVS. However, their performance is limited by the capability of reliant models or the suboptimal quality of reference sets. In this work, we investigate the largely overlooked data quality problem for this challenging dense scene understanding task, and identify that a high-quality reference set can significantly benefit training-free OVS. With this observation, we introduce a data-quality-oriented framework, comprising a data pipeline to construct a reference set with well-paired segment-text embeddings and a simple similarity-based retrieval to unveil the essential effect of data. Remarkably, extensive evaluations on ten benchmark datasets demonstrate that our method outperforms all existing training-free OVS approaches, highlighting the importance of data-centric design for advancing OVS without training.
Poster
Hallee Wong · Jose Javier Gonzalez Ortiz · John Guttag · Adrian Dalca
[ Exhibit Hall I ]
Abstract
Medical researchers and clinicians often need to perform novel segmentation tasks on a set of related images. Existing methods for segmenting a new dataset are either interactive, requiring substantial human effort for each image, or require an existing set of previously labeled images. We introduce a system, MultiverSeg, that enables practitioners to rapidly segment an entire new dataset without requiring access to any existing labeled data from that task or domain. Along with the image to segment, the model takes user interactions such as clicks, bounding boxes or scribbles as input, and predicts a segmentation. As the user segments more images, those images and segmentations become additional inputs to the model, providing context. As the context set of labeled images grows, the number of interactions required to segment each new image decreases. We demonstrate that MultiverSeg enables users to interactively segment new datasets efficiently, by amortizing the number of interactions per image to achieve an accurate segmentation. Compared to using a state-of-the-art interactive segmentation method, MultiverSeg reduced the total number of clicks by 40% and scribble steps by 29% to achieve 90% Dice on sets of images from unseen tasks. We will release code and model weights.
Poster
Xinyang Zhou · Fanyue Wei · Lixin Duan · Angela Yao · Wen Li
[ Exhibit Hall I ]
Abstract
Given a textual query along with a corresponding video, the objective of moment retrieval aims to localize the moments relevant to the query within the video. While commendable results have been demonstrated by existing transformer-based approaches, predicting the accurate temporal span of the target moment is still a major challenge. This paper reveals that a crucial reason stems from the spurious correlation between the text query and the moment context. Namely, the model makes predictions by overly associating queries with background frames rather than distinguishing target moments. To address this issue, we propose a dynamic learning approach for moment retrieval, where two strategies are designed to mitigate the spurious correlation. First, we introduce a novel video synthesis approach to construct a dynamic context for the queried moment, enabling the model to attend to the target moment of the corresponding query across dynamic backgrounds. Second, to alleviate the over-association with backgrounds, we enhance representations temporally by incorporating text-dynamics interaction, which encourages the model to align text with target moments through complementary dynamic representations. With the proposed method, our model significantly alleviates the spurious correlation issue in moment retrieval and establishes new state-of-the-art performance on two popular benchmarks, \ie, QVHighlights and Charades-STA. …
Poster
Soorena Salari · Arash Harirpoush · Hassan Rivaz · Yiming Xiao
[ Exhibit Hall I ]
Abstract
Anatomical landmark detection in medical images is essential for various clinical and research applications, including disease diagnosis and surgical planning. However, manual landmark annotation is time-consuming and requires significant expertise. Existing deep learning (DL) methods often require large amounts of well-annotated data, which are costly to acquire. In this paper, we introduce CABLD, a novel self-supervised DL framework for 3D brain landmark detection in unlabeled scans with varying contrasts by using only a single reference example. To achieve this, we employed an inter-subject landmark consistency loss with an image registration loss while introducing a 3D convolution-based contrast augmentation strategy to promote model generalization to new contrasts. Additionally, we utilize an adaptive mixed loss function to schedule the contributions of different sub-tasks for optimal outcomes. We demonstrate the proposed method with the intricate task of MRI-based 3D brain landmark detection. With comprehensive experiments on four diverse clinical and public datasets, including both T1w and T2w MRI scans at different MRI field strengths, we demonstrate that CABLD outperforms the state-of-the-art methods in terms of mean radial errors (MREs) and success detection rates (SDRs). Our framework provides a robust and accurate solution for anatomical landmark detection, reducing the need for extensively annotated datasets …
Poster
Tinghan Yang · Md Ashiqur Rahman · Raymond A. Yeh
[ Exhibit Hall I ]
Abstract
Symmetry is one of the most fundamental geometric cues in computer vision, and detecting it has been an ongoing challenge. With the recent advances in vision-language models,~i.e., CLIP, we investigate whether a pre-trained CLIP model can aid symmetry detection by leveraging the additional symmetry cues found in the natural image descriptions. We propose CLIPSym, which leverages CLIP's image and language encoders and a rotation-equivariant decoder based on a hybrid of Transformer and $G$-Convolution to detect rotation and reflection symmetries. To fully utilize CLIP's language encoder, we have developed a novel prompting technique called Semantic-Aware Prompt Grouping (SAPG), which aggregates a diverse set of frequent object-based prompts to better integrate the semantic cues for symmetry detection. Empirically, we show that CLIPSym outperforms the current state-of-the-art on three standard symmetry detection datasets (DENDI, SDRW, and LDRS). Finally, we conduct detailed ablations verifying the benefits of CLIP's pre-training, the proposed equivariant decoder, and the SAPG technique.
Poster
Haiwen Diao · Xiaotong Li · Yufeng Cui · Yueze Wang · Haoge Deng · Ting Pan · Wenxuan Wang · Huchuan Lu · Xinlong Wang
[ Exhibit Hall I ]
Abstract
Existing encoder-free vision-language models (VLMs) are rapidly narrowing the performance gap with their encoder-based counterparts, highlighting the promising potential for unified multimodal systems with structural simplicity and efficient deployment. We systematically clarify the performance gap between VLMs using pre-trained vision encoders, discrete tokenizers, and minimalist visual layers from scratch, deeply excavating the under-examined characteristics of encoder-free VLMs. We develop efficient strategies for encoder-free VLMs that rival mainstream encoder-based ones. After an in-depth investigation, we launch EVEv2.0, a new and improved family of encoder-free VLMs. We show that: (i) Properly decomposing and hierarchically associating vision and language within a unified model reduces interference between modalities. (ii) A well-designed training strategy enables effective optimization for encoder-free VLMs. Through extensive evaluation, our EVEv2.0 represents a thorough study for developing a decoder-only architecture across modalities, demonstrating superior data efficiency and strong vision-reasoning capability.
Poster
Zhichuan Wang · Yang Zhou · Zhe Liu · Rui Yu · Song Bai · Yulong Wang · Xinwei He · Xiang Bai
[ Exhibit Hall I ]
Abstract
Open-set 3D object retrieval (3DOR) is an emerging task aiming to retrieve 3D objects of unseen categories beyond the training set. Existing methods typically utilize all modalities (i.e., voxels, point clouds, multi-view images) and train specific backbones before fusion. However, they still struggle to produce generalized representations due to insufficient 3D training data. Being contrastively pre-trained on web-scale image-text pairs, CLIP inherently produces generalized representations for a wide range of downstream tasks. Building upon it, we present a simple yet effective framework named Describe, Adapt and Combine (DAC) by taking only multi-view images for open-set 3DOR. DAC innovatively synergizes a CLIP model with a multi-modal large language model (MLLM) to learn generalized 3D representations, where the MLLM is used for dual purposes. First, it describes the seen category information to align with CLIP's training objective for adaptation during training. Second, it provides external hints about unknown objects complementary to visual cues during inference. To improve the synergy, we introduce an Additive-Bias Low-Rank adaptation (AB-LoRA), which alleviates overfitting and further enhances the generalization to unseen categories. With only multi-view images, DAC significantly surpasses prior arts by an average of +10.01\% mAP on four open-set 3DOR datasets. Moreover, its generalization is also …
Poster
Junhao Dong · Piotr Koniusz · Liaoyuan Feng · Yifei Zhang · Hao Zhu · Weiming Liu · Xinghua Qu · YEW-SOON ONG
[ Exhibit Hall I ]
Abstract
Vision-Language Models (VLMs) enjoy superb zero-shot performance but are vulnerable to adversarial attacks posing security risks. Adversarially robust fine-tuning enhances zero-shot robustness on new datasets while preserving the natural performance of pre-trained VLMs. However, prior methods use sample-wise adversarial fine-tuning, neglecting the underlying second-order statistics that represent entire groups of samples. This leads to a feature-level discrepancy between clean and adversarial samples of their augmented variants. Thus, we propose to represent groups of samples as subspaces to capture distributions and turn the traditional sample-wise adversarial fine-tuning into its distributional counterpart. For each image, we build distributions from (i) a clean sample with its augmentations and (ii) their adversarial counterparts. For text, we build distributions from (iii) a clean prompt and its synonymous prompts and (iv) their adversarial counterparts. We then perform alignment between image and text subspaces, and "adversarial" subspaces are also aligned toward "clean" subspaces. Thus, all samples underlying these distributions (think infinite number) also get aligned, leading to generalizable robustness. Evaluations on 15 datasets are provided.
Poster
Jungeun Kim · Hyeongwoo Jeon · Jongseong Bae · Ha Young Kim
[ Exhibit Hall I ]
Abstract
Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we use MLLMs to generate detailed textual descriptions of sign language components. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be utilized effectively in SLT.
Poster
Haoji Zhang · Yiqin Wang · Yansong Tang · Yong Liu · Jiashi Feng · Xiaojie Jin
[ Exhibit Hall I ]
Abstract
Benefiting from the advances in large language models and cross-modal alignment, existing multimodal large language models have achieved prominent performance in image and short video understanding. However, the understanding of long videos is still challenging, as their long-context nature results in significant computational and memory overhead. Most existing work treats long videos in the same way as short videos, which is not efficient enough for real-world applications and is difficult to generalize to even longer videos. To address these issues, we propose Flash-VStream, an efficient video language model capable of processing extremely long videos and responding to user queries in real time. Particularly, we design a Flash Memory module, containing a low-capacity context synopsis memory to aggregate long-context temporal information and model the distribution of information density, and a high-capacity detail augmentation memory to retrieve detailed spatial information based on this distribution. Compared to existing models, Flash-VStream achieves significant reductions in inference latency. Extensive experiments on long video benchmarks and comprehensive video benchmarks, i.e., EgoSchema, MLVU, LVBench, MVBench and Video-MME, demonstrate the state-of-the-art performance and outstanding efficiency of our method. All code, models, and datasets will be made publicly available.
Poster
Junqi Ge · Ziyi Chen · Jintao Lin · Jinguo Zhu · Xihui Liu · Jifeng Dai · Xizhou Zhu
[ Exhibit Hall I ]
Abstract
Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly tasks involving videos, high-resolution images, or lengthy image-text documents. In our work, we first conduct an empirical analysis of VLMs' long-context capabilities using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model's context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE in enhancing VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to fine-tune the open-source VLMs. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications. We shall release the code, model weights, …
Poster
Runhao Zeng · Jiaqi Mao · Minghao Lai · Vu Phan · Yanjie Dong · Wei Wang · Qi Chen · Xiping Hu
[ Exhibit Hall I ]
Abstract
Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that uses neural network parameters to dynamically retain past context and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@$n$, IoU=$m$, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. We will release …
Poster
Shiwei Zhang · Qi Zhou · Wei Ke
[ Exhibit Hall I ]
Abstract
Text-guided zero-shot object counting leverages vision-language models (VLMs) to count objects of an arbitrary class given by a text prompt. Existing approaches for this challenging task only utilize local patch-level features to fuse with text feature, ignoring the important influence of the global image-level feature. In this paper, we propose a universal strategy that can exploit both local patch-level features and global image-level feature simultaneously. Specifically, to improve the localization ability of VLMs, we propose Text-guided Local Ranking. Depending on the prior knowledge that foreground patches have higher similarity with the text prompt, a new local-text rank loss is designed to increase the differences between the similarity scores of foreground and background patches which push foreground and background patches apart. To enhance the counting ability of VLMs, Number-evoked Global Attention is introduced to first align global image-level feature with multiple number-conditioned text prompts. Then, the one with the highest similarity is selected to compute cross-attention with the global image-level feature. Through extensive experiments on widely used datasets and methods, the proposed approach has demonstrated superior advancements in performance, generalization, and scalability. Furthermore, to better evaluate text-guided zero-shot object counting methods, we propose a dataset named ZSC-8K, which is larger and …
Poster
Yichi Zhang · Le Xue · Wenbo zhang · Lanlan Li · Yuchen Liu · Chen Jiang · Yuan Cheng · Yuan Qi
[ Exhibit Hall I ]
Abstract
Positron Emission Tomography (PET) is a powerful molecular imaging tool that plays a crucial role in modern medical diagnostics by visualizing radio-tracer distribution to reveal physiological processes. Accurate organ segmentation from PET images is essential for comprehensive multi-systemic analysis of interactions between different organs and pathologies. Existing segmentation methods are limited by insufficient annotation data and varying levels of annotation, resulting in weak generalization ability and difficulty in clinical application. Recent developments in segmentation foundation models have shown superior versatility across diverse segmentation tasks. Despite the efforts of medical adaptations, these works primarily focus on structural medical images with detailed physiological structural information and exhibit limited generalization performance on molecular PET imaging. In this paper, we collect and construct PETS-5k, the largest PET segmentation dataset to date, comprising 5,731 three-dimensional whole-body PET images and encompassing over 1.3M 2D images. Based on the established dataset, we develop SegAnyPET, a modality-specific 3D foundation model for universal promptable segmentation from PET images. To issue the challenge of discrepant annotation quality, we adopt a cross prompting confident learning (CPCL) strategy with an uncertainty-guided self-rectification process to robustly learn segmentation from high-quality labeled data and low-quality noisy labeled data for promptable segmentation. Experimental results demonstrate …
Poster
Jeongmin Yu · Susang Kim · Kisu Lee · Taekyoung Kwon · Won-Yong Shin · Ha Young Kim
[ Exhibit Hall I ]
Abstract
Recent face anti-spoofing (FAS) methods have shown remarkable cross-domain performance by employing vision-language models like CLIP. However, existing CLIP-based FAS models do not fully exploit CLIP’s patch embedding tokens, failing to detect critical spoofing clues. Moreover, these models rely on a single text prompt per class (e.g, 'live' or 'fake'), which limits generalization. To address these issues, we propose MVP-FAS, a novel framework incorporating two key modules: Multi-View Slot attention (MVS) and Multi-Text Patch Alignment (MTPA). Both modules utilize multiple paraphrased texts to generate generalized features and reduce dependence on domain-specific text. MVS extracts local detailed spatial features and global context from patch embeddings by leveraging diverse texts with multiple perspectives. MTPA aligns patches with multiple text representations to improve semantic robustness. Extensive experiments demonstrate that MVP-FAS achieves superior generalization performance, outperforming previous state-of-the-art methods on cross-domain datasets.
Poster
Wenxuan Zhu · Bing Li · Cheng Zheng · Jinjie Mai · Jun Chen · Letian Jiang · Abdullah Hamdi · Sara Rojas Martinez · Chia-Wen Lin · Mohamed Elhoseiny · Bernard Ghanem
[ Exhibit Hall I ]
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated impressive 2D image/video understanding capabilities.However, there are no publicly standardized benchmarks to assess the abilities of MLLMs in understanding the 4D objects.In this paper, we introduce 4D-Bench, the first benchmark to evaluate the capabilities of MLLMs in 4D object understanding, featuring tasks in 4D object Question Answering (4D object QA) and 4D object captioning.4D-Bench provides 4D objects with diverse categories, high-quality annotations, and tasks necessitating multi-view spatial-temporal understanding, different from existing 2D image/video-based benchmarks.With 4D-Bench, we evaluate a wide range of open-source and closed-source MLLMs.The results from the 4D object captioning experiment indicate that MLLMs generally exhibit weaker temporal understanding compared to their appearance understanding, notably, while open-source models approach closed-source performance in appearance understanding, they show larger performance gaps in temporal understanding.4D object QA yields surprising findings: even with simple single-object videos, MLLMs perform poorly, with state-of-the-art GPT-4o achieving only 63\% accuracy compared to the human baseline of 91\%.These findings highlight a substantial gap in 4D object understanding and the need for further advancements in MLLMs.
Poster
Xiao Liang · Di Wang · Zhicheng Jiao · Ronghan Li · Pengfei Yang · Quan Wang · Tat-Seng Chua
[ Exhibit Hall I ]
Abstract
The rapid advancements in Vision Language Models (VLMs) have prompted the development of multi-modal medical assistant systems. Despite this progress, current models still have inherent probabilistic uncertainties, often producing erroneous or unverified responses—an issue with serious implications in medical applications. Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning. However, these training-dependent strategies are costly and still lack sufficient alignment with clinical expertise. To address these issues, we propose an expert-in-the-loop framework named Expert-Controlled Classifier-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training. This framework introduces an uncertainty estimation strategy to identify unreliable outputs. It then retrieves relevant references to assist experts in highlighting key terms and applies classifier-free guidance to refine the token embeddings of MedVLM, ensuring that the adjusted outputs are correct and align with expert highlights. Evaluations across three medical visual question answering benchmarks demonstrate that the proposed Expert-CFG, with 4.2B parameters and limited expert annotations, outperforms state-of-the-art models with 13B parameters. The results demonstrate the feasibility of deploying such a system in resource-limited settings for clinical use. The anonymous link to our project can be found in …
Poster
Jie Liu · Jiayi Shen · Pan Zhou · Jan-Jakob Sonke · Stratis Gavves
[ Exhibit Hall I ]
Abstract
Generalized Few-Shot Semantic Segmentation (GFSS) aims to extend a segmentation model to novel classes with only a few annotated examples while maintaining performance on base classes. Recently, pretrained vision-language models (VLMs) such as CLIP have been leveraged in GFSS to improve generalization on novel classes through multi-modal prototypes learning. However, existing prototype-based methods are inherently deterministic, limiting the adaptability of learned prototypes to diverse samples, particularly for novel classes with scarce annotations. To address this, our work propose Probabilistic Prototype Calibration Network (PPCN) - a probabilistic modeling framework over multi-modal prototypes from the pretrained CLIP, thus providing more adaptive prototype learning for GFSS. Specifically, PPCN first introduces a prototype calibration mechanism, which refines frozen textual prototypes with learnable visual calibration prototypes, leading to a more discriminative and adaptive representation. Furthermore, unlike deterministic prototype learning techniques, PPCN introduces distribution regularization over these calibration prototypes. This probabilistic formulation ensures structured and uncertainty-aware prototype learning, effectively mitigating overfitting to limited novel class data while enhancing generalization. Extensive experimental results on PASCAL-5$^i$ and COCO-20$^i$ datasets demonstrate that our proposed PPCN significantly outperforms state-of-the-art approaches across both GFSS and class-incremental setting. The source code will be released publicly.
Poster
Xu Zheng · Yuanhuiyi Lyu · Lutao Jiang · Danda Pani Paudel · Luc Gool · Xuming Hu
[ Exhibit Hall I ]
Abstract
Fusing and balancing multi-modal inputs from novel sensors for dense prediction tasks, particularly semantic segmentation, is critically important yet remains a significant challenge. One major limitation is the tendency of multi-modal frameworks to over-rely on easily learnable modalities, a phenomenon referred to as unimodal dominance or bias. This issue becomes especially problematic in real-world scenarios where the dominant modality may be unavailable, resulting in severe performance degradation. To this end, we apply a simple but effective plug-and-play regularization term based on functional entropy, which introduces no additional parameters or modules. This term is designed to intuitively balance the contribution of each visual modality to the segmentation results. Specifically, we leverage the log-Sobolev inequality to bound functional entropy using functional-Fisher-information. By maximizing the information contributed by each visual modality, our approach mitigates unimodal dominance and establishes a more balanced and robust segmentation framework. A multi-scale regularization module is proposed to apply our proposed plug-and-paly term on high-level features and also segmentation predictions for more balanced multi-modal learning. Extensive experiments on three datasets demonstrate that our proposed method achieves superior performance, i.e., +13.94%, +3.25% and +3.64%, without introducing any additional parameters.
Poster
Akshat Ramachandran · Mingyu Lee · Huan Xu · Souvik Kundu · Tushar Krishna
[ Exhibit Hall I ]
Abstract
We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code and synthetic dataset will be released upon acceptance.
Poster
Sabbir Ahmed · Jingtao Li · Weiming Zhuang · Chen Chen · Lingjuan Lyu
[ Exhibit Hall I ]
Abstract
Vision transformers (ViTs) have become widely popular due to their strong performance across various computer vision tasks. However, deploying ViTs on edge devices remains a persistent challenge due to their high computational demands primarily caused by the over use of self-attention layers with quadratic complexity together with the resource-intensive softmax operation. To resolve this challenge, linear self-attention approach has emerged as an efficient alternative. Nonetheless, current linear attention methods experience considerable performance degradation compared to the softmax-based quadratic attention. Hence, we propose MixA, a novel mixed attention approach that enhances efficiency of ViT models while maintaining comparable performance to softmax-based quadratic attention. MixA takes a pretrained ViT model and analyzes the significance of each attention layer, and selectively apply ReLU-based quadratic attention in the critical layers to ensure high model performance. To enhance efficiency, MixA selects the less critical layers and replaces them with our novel ReLU-based linear attention module called \emph{Stable Lightweight Linear Attention} (SteLLA). SteLLA utilizes theoretically motivated normalization terms that improve stability of prior ReLU-based linear attention, resulting in better performance (see Figure 1) while achieving significant speedup compared to softmax based quadratic attention (see Figure 2). Experiments conducted on three benchmark vision tasks show that MixA …
Poster
Weiming Ren · Wentao Ma · Huan Yang · Cong Wei · Ge Zhang · Wenhu Chen
[ Exhibit Hall I ]
Abstract
State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.6% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks. Our code and model will be fully released to facilitate open research.
Poster
Shuhang Chen · Hangjie Yuan · Pengwei Liu · Hanxue Gu · Tao Feng · Dong Ni
[ Exhibit Hall I ]
Abstract
The Segment Anything Model (SAM) has demonstrated significant potential in medical image segmentation, yet its performance is limited when only a small amount of labeled data is available, while there are abundance of valuable yet often overlooked hierarchical information inherent in medical data. To address this limitation, we draw inspiration from self-supervised learning and propose SAMora, an innovative framework that captures hierarchical medical knowledge by applying complementary self-supervised learning objectives at the image, patch, and pixel levels. To fully exploit the complementarity of hierarchical knowledge within LoRAs, we introduce HL-Attn, a hierarchical fusion module that integrates multi-scale features while maintaining their distinct characteristics. SAMora is compatible with various SAM variants, including SAM2, SAMed and H-SAM. Experimental results on the Synapse, LA, and PROMISE12 datasets demonstrate that SAMora outperforms existing SAM variants, achieving state-of-the-art performance in both few-shot and fully-supervised settings, while reducing fine-tuning epochs by 90\%.
Poster
Tao Gong · Qi Chu · Bin Liu · Zhou Wei · Nenghai Yu
[ Exhibit Hall I ]
Abstract
Zero-shot anomaly detection (ZSAD) requires detection models trained using auxiliary data to detect anomalies without any training sample in a target dataset. It is challenging since the models need to generalize to anomalies across different domains. Recently, CLIP-based anomaly detection methods, such as WinCLIP and AnomalyCLIP, have demonstrated superior performance in the ZSAD task, due to the strong zero-shot recognition of the CLIP model. However, they overlook the utilization of frequency information of images. In this paper, we find that frequency information could benefit the ZSAD task, since some properties of the anomaly area, such as appearance defects, can also be reflected based on its frequency information. To this end, We propose Frequency Enhanced CLIP (FE-CLIP), taking advantage of two different but complementary frequency-aware clues, (1) Frequency-aware Feature Extraction adapter, and (2) Local Frequency Statistics adapter, in the visual encoder of CLIP, to deeply mine frequency information for the ZSAD task. We apply DCT as the frequency-domain transformation. Through comprehensive experiments, we show that the proposed FE-CLIP has good generalization across different domains and achieves superior zero-shot performance of detecting and segmenting anomalies in 10 datasets of highly diverse class semantics from various defect inspections and medical domains. Besides, the …
Poster
Kanoko Goto · Takumi Hirose · Mahiro Ukai · Shuhei Kurita · Nakamasa Inoue
[ Exhibit Hall I ]
Abstract
Referring expression comprehension (REC) aims to localize the target object described by a natural language expression.Recent advances in vision-language learning have led to significant performance improvements in REC tasks.However, localizing extremely small objects remains a considerable challenge despite its importance in real-world applications such as autonomous driving.To address this issue, we introduce a novel dataset and method for REC targeting small objects.First, we present the small object REC (SOREC) dataset, which consists of 100,000 pairs of referring expressions and corresponding bounding boxes for small objects in driving scenarios.Second, we propose the progressive-iterative zooming adapter (PIZA), an adapter module for parameter-efficient fine-tuning that enables models to progressively zoom in and localize small objects.In a series of experiments, we apply PIZA to GroundingDINO and demonstrate a significant improvement in accuracy on the SOREC dataset.Our dataset, codes and pre-trained models are provided in the supplementary material and will be publicly released.
Poster
Zhen Xing · Qi Dai · Zejia Weng · Zuxuan Wu · Yu-Gang Jiang
[ Exhibit Hall I ]
Abstract
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction, which has wide applications in virtual reality, robotics, and content creation. Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task. However, they struggle with frame consistency and temporal stability primarily due to the limited scale of video datasets.We observe that pretrained Image2Video diffusion models possess good video dynamics priors but lack fine-grained textual control.Hence, transferring pretrained models to leverage their video dynamic priors while injecting fine-grained control to generate controllable videos is both a meaningful and challenging task.To achieve this, we introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions. More specifically, we design a dual query transformer (DQFormer) architecture, which integrates the instructions and frames into the conditional embeddings for future frame prediction. Additionally, we develop Temporal and Spatial Adapters that can quickly transfer general video diffusion models to specific scenarios with minimal training costs. Experimental results show that our method significantly outperforms state-of-the-art techniques on four datasets: Something Something V2, Epic Kitchen-100, Bridge Data, and UCF-101. Notably, AID achieves 91.2\% and 55.5\% FVD improvements on …
Poster
Xinyue Hao · Li · Shreyank Gowda · Robert Fisher · Jonathan Huang · Anurag Arnab · Laura Sevilla-Lara
[ Exhibit Hall I ]
Abstract
Video understanding has made huge strides in recent years, relying largely on the power of transformers. As this architecture is notoriously expensive and video data is highly redundant, research into improving efficiency has become particularly relevant. Some creative solutions include token selection and merging. While most methods succeed in reducing the cost of the model and maintaining accuracy, an interesting pattern arises: most methods do not outperform the baseline of randomly discarding tokens. In this paper we take a closer look at this phenomenon and observe 5 principles of the nature of visual tokens. For example, we observe that the value of tokens follows a clear Pareto-distribution where most tokens have remarkably low value, and just a few carry most of the perceptual information. We build on these and further insights to propose a lightweight video model, LITE, that can select a small number of tokens effectively, outperforming state-of-the-art and existing baselines across datasets (Kinetics-400 and Something-Something-V2) in the challenging trade-off of computation (GFLOPs) vs accuracy. Experiments also show that LITE generalizes across datasets and even other tasks without the need for retraining.
Poster
Leon Sick · Dominik Engel · Sebastian Hartwig · Pedro Hermosilla · Timo Ropinski
[ Exhibit Hall I ]
Abstract
Traditionally, algorithms that learn to segment object instances in 2D images have heavily relied on large amounts of human-annotated data. Only recently, novel approaches have emerged tackling this problem in an unsupervised fashion. Generally, these approaches first generate pseudo-masks and then train a class-agnostic detector. While such methods deliver the current state of the art, they often fail to correctly separate instances overlapping in 2D image space since only semantics are considered. To tackle this issue, we instead propose to cut the semantic masks in 3D to obtain the final 2D instances by utilizing a point cloud representation of the scene. Furthermore, we derive a Spatial Importance function, which we use to resharpen the semantics along the 3D borders of instances. Nevertheless, these pseudo-masks are still subject to mask ambiguity. To address this issue, we further propose to augment the training of a class-agnostic detector with three Spatial Confidence components aiming to isolate a clean learning signal. With these contributions, our approach outperforms competing methods across multiple standard benchmarks for unsupervised instance segmentation and object detection.
Poster
Yun Wang · Longguang Wang · Chenghao Zhang · Yongjian Zhang · Zhanjie Zhang · Ao Ma · Chenyou Fan · Tin Lun Lam · Junjie Hu
[ Exhibit Hall I ]
Abstract
Recently, learning-based stereo matching networks have advanced significantly.However, they often lack robustness and struggle to achieve impressive cross-domain performance due to domain shifts and imbalanced disparity distributions among diverse datasets.Leveraging Vision Foundation Models (VFMs) can intuitively enhance the model's robustness, but integrating such a model into stereo matching cost-effectively to fully realize their robustness remains a key challenge.To address this, we propose SMoEStereo, a novel framework that adapts VFMs for stereo matching through a tailored, scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) modules.SMoEStereo introduces MoE-LoRA with adaptive ranks and MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal experts within MoE to adapt varying scenes across domains, while the latter injects inductive bias into frozen VFMs to improve geometric feature extraction.Importantly, to mitigate computational overhead, we further propose a lightweight decision network that selectively activates MoE modules based on input complexity, balancing efficiency with accuracy. Extensive experiments demonstrate that our method exhibits state-of-the-art cross-domain and joint generalization across multiple benchmarks without dataset-specific adaptation.
Poster
Yuchen Guan · Chong Sun · Canmiao Fu · Zhipeng Huang · Chun Yuan · Chen Li
[ Exhibit Hall I ]
Abstract
Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybird prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose \modelName, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts, and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the \textit{\rapLongName (\rapName)} model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5\% compared to conventional approaches. Extensive experiments demonstrate that \modelName achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data\&Code will be made available.
Poster
Kaisi Guan · Zhengfeng Lai · Yuchong Sun · Peng Zhang · Wei Liu · Xiaojiang Liu · Meng Cao · Ruihua Song
[ Exhibit Hall I ]
Abstract
Precisely evaluating semantic alignment between text prompts and generated videos remains a challenge in Text-to-Video (T2V) Generation. Existing text-to-video alignment metrics like CLIPScore only generate coarse-grained scores without fine-grained alignment details, failing to align with human preference. To address this limitation, we propose ETVA, a novel Evaluation method of Text-to-Video Alignment via fine-grained question generation and answering. First, a multi-agent system parses prompts into semantic scene graphs to generate atomic questions. Then we design a knowledge-augmented multi-stage reasoning framework for question answering, where an auxiliary LLM first retrieves relevant common-sense knowledge (e.g., physical laws), and then video LLM answers the generated questions through a multi-stage reasoning mechanism . Extensive experiments demonstrate that ETVA achieves a Spearman's correlation coefficient of 58.47, showing a much higher correlation with human judgment than existing metrics which attain only 31.0. We also construct a comprehensive benchmark specifically designed for text-to-video alignment evaluation, featuring 2k diverse prompts and 12k atomic questions spanning 10 categories. Through a systematic evaluation of 15 existing text-to-video models, we identify their key capabilities and limitations, paving the way for next-generation T2V generation. All codes and datasets will be publicly available soon.
Poster
Bo Liu · Ke Zou · Li-Ming Zhan · ZEXIN LU · Xiaoyu DONG · Chengqiang Xie · Yidi Chen · Jiannong Cao · Xiao-Ming Wu · Huazhu Fu
[ Exhibit Hall I ]
Abstract
Medical Visual Question Answering (Med-VQA) combines computer vision and natural language processing to automatically answer clinical inquiries about medical images. However, current Med-VQA datasets exhibit two significant limitations: (1) they often lack visual and textual explanations for answers, hindering comprehension for patients and junior doctors; (2) they typically offer a narrow range of question formats, inadequately reflecting the diverse requirements in practical scenarios. These limitations pose significant challenges to the development of a reliable and user-friendly Med-VQA system. To address these challenges, we introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX), featuring several innovative components: (1) a multi-modal explainability mechanism that offers detailed visual and textual explanations for each question-answer pair, thereby enhancing answer comprehensibility; (2) four question types—open-ended, closed-ended, single-choice, and multiple-choice—to better reflect practical needs. With 151,025 images and 1,605,575 questions, GEMeX is the currently largest chest X-ray VQA dataset. Evaluation of 12 representative large vision language models (LVLMs) on GEMeX reveals suboptimal performance, underscoring the dataset's complexity. Meanwhile, we propose a strong model by fine-tuning an existing LVLM on the GEMeX training set. The substantial performance improvement showcases the dataset's effectiveness. The benchmark is available at \url{https://anonymous.4open.science/r/GEMeX}.
Poster
Xianglin Qiu · Xiaoyang Wang · Zhen Zhang · Jimin XIAO
[ Exhibit Hall I ]
Abstract
Weakly supervised semantic segmentation (WSSS) aims to generate dense labels using sparse annotations, such as image-level labels. The existing class activation map (CAM) generation methods have been able to locate rough objects. However, due to the limited information provided by image level labels, the bias activation problem, including over-activation, becomes another key obstacle in WSSS. To rectify such bias activation, we attempt to mine pixel level class feature distribution information from the entire dataset. Specifically, we propose to use normalizing flow to model the class feature distribution of all pixels across the entire dataset and design a Bias-Resilient WSSS framework based on Normalizing Flow (BRNF). Normalizing flow has the ability to map complex distributions to normal distributions. Building upon it, we designed an additional Gaussian mixture classifier which classifies pixels from the perspective of feature distributions, providing supplementary information to the conventional MLP based classifier. In addition, we use this distribution to sample low bias features as positive anchors for contrastive learning, thereby encouraging feature optimization toward the correct low-bias direction. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving state-of-the-art performance on WSSS benchmarks. Code will be released soon.
Poster
Tim Elsner · Paula Usinger · Julius Nehring-Wirxel · Gregor Kobsik · Victor Czech · Yanjiang He · Isaak Lim · Leif Kobbelt
[ Exhibit Hall I ]
Abstract
In language processing, transformers benefit greatly from characters being condensed into word fragments, building outputs from a larger vocabulary of bigger pieces. This is often done with Byte Pair Encoding. In the context of images, tokenisation of visual data is usually limited to regular grids obtained from quantisation methods, not using such further abstraction of regions.Our work improves tokenisation of visual data by bringing Byte Pair Encoding from 1D to multiple dimensions, as a complementary add-on to existing compression. We achieve this through counting constellations of token pairs and replacing the most frequent token pair with a newly introduced token. Our approach only increases computation time by a factor of 2 for images, making it applicable even to large datasets like ImageNet within minutes on consumer hardware. This is a lossless preprocessing step. We further propose how networks can digest the new tokens that are no longer in a regular grid.Our evaluation shows improved training and inference performance of transformers on visual data achieved by compressing frequent constellations of tokens: The resulting sequences have more uniformly distributed information content, e.g. by condensing empty regions in an image into single tokens. As our experiments show, these condensed sequences are easier to …
Poster
Jaeseok Byun · Young Kyun Jang · Seokhyeon Jeong · Donghyun Kim · Taesup Moon
[ Exhibit Hall I ]
Abstract
Composed Image Retrieval (CIR) seeks to retrieve a target image by using a reference image and conditioning text specifying desired modifications. While recent approaches have shown steady performance improvements on existing CIR benchmarks, we argue that it remains unclear whether these gains genuinely reflect an enhanced compositional understanding of both visual and textual information.For example, current benchmarks do not explicitly consider negation cases and offer limited semantic diversity, with insufficient hard negatives to thoroughly evaluate the CIR task.To bridge this gap, we introduce Multimodal Arithmetic Benchmark for CIR (MA-CIR), a challenging CIR benchmark that integrates arithmetic types (negation, replacement, and addition) across seven complex semantic categories (e.g., spatial reasoning, object reasoning, etc). Moreover, carefully constructed hard negatives are incorporated to assess models in a controlled setting.In MA-CIR, we observe that current CIR models struggle with negation (or replacement) arithmetic types and semantic types that require complex reasoning, indicating a potential reliance on object or entity information.To address this challenge, we propose leveraging strong text encoders, particularly those based on large language models (LLMs), in conjunction with carefully constructed text triplets that incorporate hard negatives to enhance compositional understanding.As a result, MA-CIR achieves a 14\% gain while also improving R@1 on …
Poster
Xiwen Chen · Peijie Qiu · Wenhui Zhu · Hao Wang · Huayu Li · XUANZHAO DONG · Xiaotong Sun · Xiaobing Yu · Yalin Wang · Abolfazl Razi · Aristedis Sotiras
[ Exhibit Hall I ]
Abstract
While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but better capture spatial correlations between instances can offer more effective solutions. In light of these findings, we propose a novel alternative to existing MIL for WSI analysis by learning to restore the order of instances from their randomly shuffled arrangement. We term this task as cracking an instance jigsaw puzzle problem, where semantic correlations between instances are uncovered. To tackle the instance jigsaw puzzles, we propose a novel Siamese network solution, which is theoretically justified by optimal transport theory. We validate the proposed method on WSI classification and survival prediction tasks, where the proposed method outperforms the recent state-of-the-art MIL competitors.
Poster
Guilian Chen · Huisi Wu · Jing Qin
[ Exhibit Hall I ]
Abstract
Automatic segmentation of polyps from colonoscopy videos is of great clinical significance as it can assist clinicians in making more accurate diagnoses and precise interventions. However, video polyp segmentation (VPS) poses significant challenges due to ambiguous boundaries between polyps and surrounding mucosae tissues, as well as variations in polyp scale, contrast, and position across consecutive frames. Moreover, to meet clinical requirements, the inference process must operate in real-time to enable intraoperative tracking and guidance. In this paper, we propose a novel and efficient segmentation network, STDDNet, which integrates a spatial-aligned temporal modeling strategy and a discriminative dynamic representation learning mechanism, to comprehensively address these challenges by harnessing the advantages of mamba. Specifically, a spatial-aligned temporal dependency propagation (STDP) module is developed to model temporal consistency from the consecutive frames based on a bidirectional scanning mamba block. Furthermore, we design a discriminative dynamic feature extraction (DDFE) module to explore frame-wise dynamic information from the structural feature generated by the mamba block. Such dynamic features can effectively deal with the variations across colonoscopy frames, providing more details for refined segmentation. We extensively evaluate STDDNet on two benchmark datasets, SUN-SEG and CVC-ClinicDB, demonstrating superior segmentation performance of our method over state-of-the-art methods while …
Poster
Seonghoon Yu · Junbeom Hong · Joonseok Lee · Jeany Son
[ Exhibit Hall I ]
Abstract
Visual grounding tasks, such as referring image segmentation (RIS) and referring expression comprehension (REC), aim to localize a target object based on a given textual description. The target object in an image can be described in multiple ways, reflecting diverse attributes such as color, position, and more. However, most existing methods rely on a single textual input, which captures only a fraction of the rich information available in the visual domain. This mismatch between rich visual details and sparse textual cues can lead to the misidentification of similar objects. To address this, we propose a novel visual grounding framework that $\textbf{leverages multiple latent expressions}$ generated from a single textual input by incorporating complementary visual details absent from the original description. Specifically, we introduce subject distributor and visual concept injector modules to embed both $\textbf{shared-subject and distinct-attributes}$ concepts into the latent representations, thereby capturing unique and target-specific visual cues. We also propose a positive-margin contrastive learning strategy to align all latent expressions with the original text while preserving subtle variations. Experimental results show that our method not only outperforms state-of-the-art RIS and REC approaches on multiple benchmarks but also achieves outstanding performance on the generalized referring expression segmentation (GRES) benchmark.
Poster
Weili Zeng · Ziyuan Huang · Kaixiang Ji · Yichao Yan
[ Exhibit Hall I ]
Abstract
Transformer-based models have driven significant advancements in Multimodal Large Language Models (MLLMs), yet their computational costs surge drastically when scaling resolution, training data, and model parameters. A key bottleneck stems from the proliferation of visual tokens required for fine-grained image understanding. We propose Skip-Vision, a unified framework addressing both training and inference inefficiencies in vision-language models. On top of conventional token compression approaches, our method introduces two complementary acceleration strategies. For training acceleration, we observe that Feed-Forward Network (FFN) computations on visual tokens induce marginal feature updates. This motivates our Skip-FFN strategy, which bypasses FFN layers for redundant visual tokens. For inference acceleration, we design a selective KV-cache removal mechanism that prunes the skipped key-value pairs during decoding while preserving model performance. Experimental results demonstrate that Skip-Vision reduces training time by up to 35\%, inference FLOPs by 75\%, and latency by 45\%, while achieving comparable or superior performance to existing methods. Our work provides a practical solution for scaling high-performance MLLMs with enhanced efficiency.
Poster
Ilan Naiman · Emanuel Baruch Baruch · Oron Anschel · Alon Shoshan · Igor Kviatkovsky · Manoj Aggarwal · Gerard Medioni
[ Exhibit Hall I ]
Abstract
In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation.Our approach treats short- and long-span dependencies as two separate tasks.Such decoupling allows for a more intuitive video processing where short-span spatiotemporal primitives are first encoded and are then used to capture long-range dependencies across consecutive video segments. To achieve this, we leverage advanced off-the-shelf multimodal encoders to extract representations from short segments within the long video, followed by pre-training a masked-embedding autoencoder capturing high-level interactions across segments.LV-MAE is highly efficient to train and enables the processing of much longer videos by alleviating the constraint on the number of input frames.Furthermore, unlike existing methods that typically pre-train on short-video datasets, our approach offers self-supervised pre-training using long video samples (e.g., 20+ minutes video clips) at scale.Using LV-MAE representations, we achieve state-of-the-art results on three long-video benchmarks -- LVU, COIN, and Breakfast -- employing only a simple classification head for either attentive or linear probing.Finally, to assess LV-MAE pre-training and visualize its reconstruction quality, we leverage the video-language aligned space of short video representations to monitor LV-MAE through video-text retrieval.Our code will be made available upon publication.
Poster
Feixiang Wang · Shuang Yang · Shiguang Shan · Xilin Chen
[ Exhibit Hall I ]
Abstract
Audio-Visual Speech Enhancement (AVSE) leverages both audio and visual information to improve speech quality.Despite noisy real-world conditions, humans are generally able to perceive and interpret corrupted speech segments as clear. Researches in cognitive science have shown how the brain merges auditory and visual inputs to achieve this.These studies uncover four key insights for AVSE, reflecting a hierarchical synergy of semantic and signal processes with visual cues enriching both levels:(1) Humans utilize high-level semantic context to reconstruct corrupted speech signals.(2) Visual cues are shown to strongly correlate with semantic information, enabling visual cues to facilitate semantic context modeling.(3) Visual appearance and vocal information jointly benefit identification, implying that visual cues strengthen low-level signal context modeling.(4) High-level semantic knowledge and low-level auditory processing operate concurrently, allowing the semantics to guide signal-level context modeling.Motivated by these insights, we propose CogCM, a cognition-inspired hierarchical contextual modeling framework. The CogCM framework includes three core modules: (1) A semantic context modeling module (SeCM) to capture high-level semantic context from both audio and visual modalities; (2) A signal context modeling module (SiCM) to model fine-grained temporal-spectral structures under multi-modal semantic context guidance; (3) A semantic-to-signal guidance module (SSGM) to leverage semantic context in guiding signal context modeling …
Poster
Lei Fan · Junjie Huang · Donglin Di · Anyang Su · Tianyou Song · Maurice Pagnucco · Yang Song
[ Exhibit Hall I ]
Abstract
For anomaly detection (AD), early approaches often train separate models for individual classes, yielding high performance but posing challenges in scalability and resource management. Recent efforts have shifted toward training a single model capable of handling multiple classes. However, directly extending early AD methods to multi-class settings often results in degraded performance. In this paper, we investigate this performance degradation observed in reconstruction-based methods, identifying the key issue: inter-class confusion. This confusion emerges when a model trained in multi-class scenarios incorrectly reconstructs samples from one class as another, thereby exacerbating reconstruction errors. To this end, we propose a simple yet effective modification, called class-aware contrastive learning (CCL). By explicitly leveraging raw object category information (e.g., carpet or wood) as supervised signals, we introduce local CL to refine multiscale dense features, and global CL to obtain more compact feature representations of normal patterns, thereby effectively adapting the models to multi-class settings. Experiments across four datasets (over 60 categories) validate the effectiveness of our approach, demonstrating significant improvements and superior performance compared to state-of-the-art methods. Notably, ablation studies indicate that pseudo-class labels can achieve comparable performance.
Poster
Qi Fan · Kaiqi Liu · Nian Liu · Hisham Cholakkal · Rao Anwer · Wenbin Li · Yang Gao
[ Exhibit Hall I ]
Abstract
Cross-domain few-shot segmentation (CD-FSS) aims to segment objects of novel classes in new domains, which is often challenging due to the diverse characteristics of target domains and the limited availability of support data. Most CD-FSS methods redesign and retrain in-domain FSS models using various domain-generalization techniques, which are effective but costly to train. To address these issues, we propose adapting informative model structures of the well-trained FSS model for target domains by learning domain characteristics from few-shot labeled support samples during inference, thereby eliminating the need for retraining. Specifically, we first adaptively identify domain-specific model structures by measuring parameter importance using a novel structure Fisher score in a data-dependent manner. Then, we progressively train the selected informative model structures with hierarchically constructed training samples, progressing from fewer to more support shots. The resulting Informative Structure Adaptation (ISA) method effectively addresses domain shifts and equips existing well-trained in-domain FSS models with flexible adaptation capabilities for new domains, eliminating the need to redesign or retrain CD-FSS models on base data. Extensive experiments validate the effectiveness of our method, demonstrating superior performance across multiple CD-FSS benchmarks.
Poster
Yuan Liu · Saihui Hou · Saijie Hou · Jiabao Du · Shibei Meng · Yongzhen Huang
[ Exhibit Hall I ]
Abstract
Image Difference Captioning (IDC) aims to generate natural language descriptions of subtle differences between image pairs, requiring both precise visual change localization and coherent semantic expression. Despite recent advancements, existing datasets often lack breadth and depth, limiting their applicability in complex and dynamic environments: (1) from a breadth perspective, current datasets are constrained to limited variations of objects in specific scenes, and (2) from a depth perspective, prior benchmarks often provide overly simplistic descriptions. To address these challenges, we introduce $\textbf{OmniDiff}$, a comprehensive dataset comprising 324 diverse scenarios—spanning real-world complex environments and 3D synthetic settings—with fine-grained human annotations averaging 60 words in length and covering 12 distinct change types. Building on this foundation, we propose $\textbf{M$^3$Diff}$, a $\textbf{M}$ulti$\textbf{M}$odal large language model enhanced by a plug-and-play $\textbf{M}$ulti-scale $\textbf{Diff}$erential Perception (MDP) module. This module improves the model's ability to accurately identify and describe inter-image differences while maintaining the foundational model's generalization capabilities. With the addition of the OmniDiff dataset, M$^3$Diff achieves state-of-the-art performance across multiple benchmarks, including Spot-the-Diff, IEdit, CLEVR-Change, CLEVR-DC, and OmniDiff, demonstrating significant improvements in cross-scenario difference recognition accuracy compared to existing methods. The dataset, code, and models will be made publicly available to support further research.
Poster
Tao Lei · Ziyao Yang · Xingwu wang · Yi Wang · Xuan Wang · FeimanSun FeimanSun · Asoke Nandi
[ Exhibit Hall I ]
Abstract
Existing semi-supervised learning methods typically mitigate the impact of unreliable predictions by suppressing low-confidence regions. However, these methods fail to explore which regions hold higher learning value and how to design adaptive learning strategies for these regions, thereby limiting the model's performance in critical areas. To address this issue, we propose a novel adaptive learning of high-value regions (ALHVR) framework. By exploiting the diversity of predictions from mutli-branch networks, the prediction regions are classified into three types: reliable stable region, reliable unstable region, and unreliable stable region. For high-value regions (reliable unstable region and unreliable stable region), different training strategies are designed. Specifically, for reliable unstable region, we propose a confidence-guided cross-prototype consistency learning (CG-CPCL) module, which enforces prototype consistency constraints in the feature space. By leveraging confidence information, the high-confidence predictions from one network selectively supervise the low-confidence predictions of the other, thus helping the model learn inter-class discrimination more stably. Additionally, for unreliable stable region, we design a dynamic teacher competition teaching (DTCT) module, which dynamically selects the most reliable pixels as teachers by evaluating the unperturbed predictions from both networks in real-time. These selected pixels are then used to supervise perturbed predictions, thereby enhancing the model's learning …
Poster
Lingyu Chen · Yawen Zeng · Yue Wang · Peng Wan · Guo-chen Ning · Hongen Liao · Daoqiang Zhang · Fang Chen
[ Exhibit Hall I ]
Abstract
Conventional single-dataset training often fails with new data distributions, especially in ultrasound (US) image analysis due to limited data, acoustic shadows, and speckle noise.Therefore, constructing a universal framework for multi-heterogeneous US datasets is imperative. However, a key challenge arises: how to effectively mitigate inter-dataset interference while preserving dataset-specific discriminative features for robust downstream task? Previous approaches utilize either a single source-specific decoder or a domain adaptation strategy, but these methods experienced a decline in performance when applied to other domains. Considering this, we propose a Universal Collaborative Mixture of Heterogeneous Source-Specific Experts (COME). Specifically, COME establishes dual structure-semantic shared experts that create a universal representation space and then collaborate with source-specific experts to extract discriminative features through providing complementary features. This design enables robust generalization by leveraging cross-datasets experience distributions and providing universal US priors for small-batch or unseen data scenarios. Extensive experiments under three evaluation modes (single-dataset, intra-organ, and inter-organ integration datasets) demonstrate COME's superiority, achieving significant mean AP improvements over state-of-the-art methods.
Poster
George Ciubotariu · Zhuyun Zhou · Zongwei Wu · Radu Timofte
[ Exhibit Hall I ]
Abstract
We introduce MIORe and VAR-MIORe, novel multi-task datasets that address critical limitations in current benchmarks for motion restoration tasks. Our datasets capture a broad spectrum of motion scenarios—including complex ego-camera movements, dynamic multi-subject interactions, and depth-dependent blur effects—using high-frame-rate (1000 FPS) acquisition and professional-grade optics. By averaging variable numbers of frames based on computed optical flow metrics, MIORe generates consistent motion blur while preserving sharp inputs for video frame interpolation and optical flow estimation. VAR-MIORe further extends this framework by spanning a variable range of motion magnitudes, from minimal to extreme, establishing the first benchmark of its kind. Together, these datasets provide high-resolution, scalable ground truth that challenges existing algorithms under both controlled and adverse conditions, paving the way for next-generation research in non-uniform deblurring, video interpolation, and optical flow analysis.
Poster
Jiaxu Zhang · Xianfang Zeng · Xin Chen · Wei Zuo · Gang YU · Zhigang Tu
[ Exhibit Hall I ]
Abstract
We propose MikuDance, a diffusion-based pipeline incorporating mixed motion dynamics to animate stylized character art. MikuDance consists of two key techniques: Mixed Motion Modeling and Mixed-Control Diffusion, to address the challenges of high-dynamic motion and reference-guidance misalignment in character art animation. Specifically, a Scene Motion Tracking strategy is presented to explicitly model the dynamic camera in pixel-wise space, enabling unified character-scene motion modeling. Building on this, the Mixed-Control Diffusion implicitly aligns the scale and body shape of diverse characters with motion guidance, allowing flexible control of local character motion. Subsequently, a Motion-Adaptive Normalization module is incorporated to effectively inject global scene motion, paving the way for comprehensive character art animation. Through extensive experiments, we demonstrate the effectiveness and generalizability of MikuDance across various character art and motion guidance, consistently producing high-quality animations with remarkable motion dynamics.
Poster
Yasser Benigmim · Mohammad Fahes · Tuan-Hung Vu · Andrei Bursuc · Raoul de Charette
[ Exhibit Hall I ]
Abstract
Recent Open-Vocabulary Semantic Segmentation (OVSS) models extend the CLIP model to segmentation while maintaining the use of multiple templates (e.g., a photo of <class>, a sketch of a <class>, etc.) for constructing class-wise averaged text embeddings, acting as a classifier. In this paper, we challenge this status quo and investigate the impact of templates for OVSS. Empirically, we observe that for each class, there exist single-template classifiers significantly outperforming the conventional averaged classifier. We refer to them as class-experts. Given access to unlabeled images and without any training involved, we estimate these experts by leveraging the class-wise prediction entropy of single-template classifiers, selecting as class-wise experts those which yield the lowest entropy. All experts, each specializing in a specific class, collaborate in a newly proposed fusion method to generate more accurate OVSS predictions. Our plug-and-play method, coined FLOSS, is orthogonal and complementary to existing OVSS methods, offering a “free lunch” to systematically improve OVSS without labels and additional training. Extensive experiments demonstrate that FLOSS consistently boosts state-of-the-art methods on various OVSS benchmarks. Moreover, the selected expert templates can generalize well from one dataset to others sharing the same semantic categories, yet exhibiting distribution shifts. Additionally, we obtain satisfactory improvements under …</class></class>
Poster
Weihao Yu · Xiaoqing Guo · Xinyu Liu · Yifan Liu · Hao Zheng · Yawen Huang · Yixuan Yuan
[ Exhibit Hall I ]
Abstract
Intraoperative 2D/3D registration, which aligns preoperative CT scans with intraoperative X-ray images, is critical for surgical navigation. However, existing methods require extensive preoperative training (several hours), making them unsuitable for emergency surgeries where minutes significantly impact patient outcomes. We present GaussianReg, a novel registration framework that achieves clinically acceptable accuracy within minutes of preprocessing. Unlike prior approaches that learn primarily from 2D projections, we explicitly utilize 3D information by representing CT volumes as sparse Gaussian primitives and propose an innovative ray-based registration approach. These primitives emit rays toward potential camera positions, creating a hypothesis space of viewpoints. The registration problem then reduces to identifying rays that best match the target X-ray through our cross-modality attention mechanism. We further introduce canonical ellipsoid ray parameterization for stable optimization, bipartite matching-based patch aggregation for computational efficiency, and network pruning to accelerate training. Extensive experiments demonstrate that GaussianReg achieves 10mm-level accuracy with only 10 minutes of training, compared to hours required by existing methods. Our approach thus offers a promising solution for emergency surgical scenarios where rapid adaptation to patient-specific anatomy is critical.
Poster
Yuchen Liu · Yaoming Wang · Bowen Shi · XIAOPENG ZHANG · Wenrui Dai · Chenglin Li · Hongkai Xiong · Qi Tian
[ Exhibit Hall I ]
Abstract
Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduces prohibitive computational overhead to achieve superior performance using complementary visual representations from multiple vision encoders. To address this, we propose a progressive pruning framework, namely **M**ulti-**E**ncoder Collabora**T**iv**E** t**O**ken p**R**uning (**METEOR**), that eliminates redundant visual tokens across the encoding, fusion, and decoding stages for multi-encoder MLLMs. For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collborative token assignment strategy. Subsequently, for multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning. Finally, we propose an adaptive token pruning method in the LLM decoding stage to further discard irrelevant tokens based on the text prompts with dynamically adjusting pruning ratio for specific task demands. To our best knowledge, this is the first successful attempt that achieves an efficient multi-encoder based vision language model with multi-stage pruning strategies. Extensive experiments on 11 benchmarks demonstrate the effectiveness of our proposed approach. Compared with EAGLE, a typical multi-encoder MLLMs, **METEOR** reduces 76\% visual tokens with only 0.3\% performance drop in average.
Poster
Qihang Fan · Huaibo Huang · Yuang Ai · Ran He
[ Exhibit Hall I ]
Abstract
As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation with Softmax Attention while achieving linear complexity, enabling efficient global information modeling. Nevertheless, Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention. In this paper, we analyze the underlying causes of this issue based on the formulation of Linear Attention. We find that, unlike Softmax Attention, Linear Attention entirely disregards the magnitude information of the Query($Q$ or $\phi(Q)$). The absence of magnitude information prevents the attention score distribution from dynamically adapting as the Query scales. As a result, despite its structural similarity to Softmax Attention, Linear Attention exhibits a significantly different attention score distribution. Based on this observation, we propose **Magnitude-Aware Linear Attention** (MALA), which modifies the computation of Linear Attention to fully incorporate the Query’s magnitude. This adjustment allows MALA to generate an attention score distribution that closely resembles Softmax Attention while exhibiting a more well-balanced structure. As a result, MALA surpasses Softmax Attention in performance while maintaining only linear complexity. We build Magnitude-Aware Vision Transformer (MAViT) based on MALA, achieving **84.7%** accuracy on …
Poster
David Pujol-Perich · Sergio Escalera · Albert Clapés
[ Exhibit Hall I ]
Abstract
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning --and particularly side-tuning (ST)-- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention --a key limitation of existing anchor-free methods. Additionally, we present the first effective integration of InternVideo2 backbone into an ST framework, showing its profound implications in performance. Overall, our method significantly improves existing ST methods, achieving highly competitive or SOTA results on QVHighlights, TACoS, and Charades-STA, while reducing up to a 73% the parameter count w.r.t. the existing SOTA methods. The code will be made publicly available upon acceptance.
Poster
Jiawei Mao · Yuhan Wang · Yucheng Tang · Daguang Xu · Kang Wang · Yang Yang · Zongwei Zhou · Yuyin Zhou
[ Exhibit Hall I ]
Abstract
This paper presents **MedSegFactory**, a versatile medical synthesis framework trained across multiple modalities and tasks. The core of MedSegFactory is a dual-stream diffusion model, where one stream synthesizes medical images and the other generates corresponding segmentation masks. To ensure precise alignment between image-mask pairs, we introduce Joint Cross-Attention (JCA), enabling a collaborative denoising paradigm by dynamic cross-conditioning between streams. This bidirectional interaction allows both representations to guide each other's generation, enhancing consistency between generated pairs. MedSegFactory unlocks on-demand generation of paired medical images and segmentation masks through user-defined prompts that specify the target labels, imaging modalities, anatomical regions, and pathological conditions, facilitating scalable and high-quality data generation. This new paradigm of medical image synthesis enables seamless integration into diverse medical imaging workflows, enhancing both efficiency and accuracy. Extensive experiments show that MedSegFactory generates data of superior quality and usability, achieving competitive or state-of-the-art performance in 2D and 3D segmentation tasks while addressing data scarcity and regulatory constraints.
Poster
Shuai Liu · Peng Zhang · Shiwei Zhang · Wei Ke
[ Exhibit Hall I ]
Abstract
Open-set counting is garnering increasing attention due to its capability to enumerate objects of arbitrary category. It can be generally categorized into two methodologies: text-guided zero-shot counting methods and exemplar-guided few-shot counting methods. Previous text-guided zero-shot methods only provide limited object information through text, resulting in poor performance. Besides, though exemplar-guided few-shot approaches gain better results, they rely heavily on manually annotated visual exemplars, resulting in low efficiency and high labor intensity. Therefore, we propose CountSE, which simultaneously achieves high efficiency and high performance. CountSE is a new text-guided zero-shot object counting algorithm that generates multiple precise soft exemplars at different scales to enhance counting models driven solely by semantics. Specifically, to obtain richer object information and address the diversity in object scales, we introduce Semantic-guided Exemplar Selection, a module that generates candidate soft exemplars at various scales and selects those with high similarity scores. Then, to ensure accuracy and representativeness, Clustering-based Exemplar Filtering is introduced to refine the candidate exemplars by effectively eliminating inaccurate exemplars through clustering analysis. In the text-guided zero-shot setting, CountSE outperforms all state-of-the-art methods on the FSC-147 benchmark by at least 15\%. Additionally, experiments on two other widely used datasets demonstrate that CountSE significantly outperforms …
Poster
Xinyao Liu · Diping Song
[ Exhibit Hall I ]
Abstract
Multimodal large language models (MLLMs) demonstrate significant potential in the field of medical diagnosis. However, they face critical challenges in specialized domains such as ophthalmology, particularly the fragmentation of annotation granularity and inconsistencies in clinical reasoning logic, which hinder precise cross-modal understanding. This paper introduces **FundusExpert**, the first ophthalmology-specific MLLM with integrated positioning-diagnosis reasoning capabilities, along with **FundusGen**, a dataset constructed through the intelligent **Fundus-Engine** system.Fundus-Engine automates localization and leverages MLLM-based semantic expansion to integrate global disease classification, local object detection, and fine-grained feature analysis within a single fundus image. Additionally, by constructing a clinically aligned cognitive chain, it guides the model to generate interpretable reasoning paths.FundusExpert, fine-tuned with instruction data from FundusGen, achieves the best performance in ophthalmic question-answering tasks, surpassing the average accuracy of the 40B MedRegA by 26.6\%. It also excels in zero-shot report generation tasks, achieving a clinical consistency of 77.0\%, significantly outperforming GPT-4o's 47.6\%. Furthermore, we reveal a scaling law between data quality and model capability($L \propto N^{0.33}$), demonstrating that the cognitive alignment annotations in FundusGen enhance data utilization efficiency. By integrating region-level localization with diagnostic reasoning chains, our work develops a scalable, clinically-aligned MLLM and explores a pathway toward bridging the visual-language gap in …
Poster
Yang Liu · Yufei Yin · Chenchen Jing · Muzhi Zhu · Hao Chen · Yuling Xi · Bo Feng · Hao Wang · Shiyu Li · Chunhua Shen
[ Exhibit Hall I ]
Abstract
In this work, we present COSINE, a unified open-world segmentation model that consolidates open-vocabulary segmentation and in-context segmentation with multi-modal prompts (e.g. text and image). COSINE exploits foundation models to extract representations for an input image and corresponding multi-modal prompts, and a SegDecoder to align these representations, model their interaction, and obtain masks specified by input prompts across different granularities. In this way, COSINE overcomes architectural discrepancies, divergent learning objectives, and distinct representation learning strategies of previous pipelines for open-vocabulary segmentation and in-context segmentation. Comprehensive experiments demonstrate that COSINE has significant performance improvements in both open-vocabulary and in-context segmentation tasks. Our exploratory analyses highlight that the synergistic collaboration between using visual and textual prompts leads to significantly improved generalization over single-modality approaches.
Poster
Xiaolei Wang · Xiaoyang Wang · Huihui Bai · ENG Gee LIM · Jimin XIAO
[ Exhibit Hall I ]
Abstract
Recent unsupervised distillation-based and reconstruction-based methods rely on the feature inconsistency of a frozen encoder and the corresponding learnable decoder to achieve anomaly localization. However, these methods have a critical limitation: decoders trained exclusively on normal samples unexpectedly well reconstruct abnormal features, leading to degraded detection performance. We identify this phenomenon as 'anomaly leakage' (AL): the decoder optimized by reconstruction loss tends to directly copy the encoded input, regardless of whether the input is a normal or abnormal feature. To address this challenge, we propose a novel framework that explicitly decouples encoded features into normal and abnormal components through a bounded invertible mapping in a prior latent space. Compared to previous methods, the invertible structure can eliminate anomalous information point-to-point without damaging the information of neighboring patches, improving reconstruction. Moreover, the framework suppresses the abnormal component before reconstructing features through inverse mapping. In this process, effective synthetic abnormal features are essential for training the decoupling process. Therefore, we propose to apply adversarial training to find suitable perturbations to simulate feature-level anomalies. Extensive experimental evaluations on benchmark datasets, including MVTec AD, VisA, and Real-IAD, demonstrate that our method achieves competitive performance compared to state-of-the-art approaches. The code will be made publicly …
Poster
Eunchan Jo · Dahyun Kang · Sanghyun Kim · Yunseon Choi · Minsu Cho
[ Exhibit Hall I ]
Abstract
We address the problem of few-shot pattern detection, which aims to detect all instances of a given pattern, typically represented by a few exemplars, from an input image.Although similar problems have been studied in few-shot object counting and detection (FSCD), previous methods and their benchmarks have narrowed patterns of interest to object categories and often fail to localize non-object patterns. In this work, we propose a simple yet effective detector based on template matching and regression, dubbed \ours.While previous FSCD methods typically represent given target exemplars into a spatially collapsed prototype, we revisit classic template matching and regression. It effectively preserves and leverages the spatial layout of exemplars in our minimalistic architecture, which consists of a few learnable layers of either convolutions or projections.We also introduce a new dataset, dubbed RPINE, which covers a wider range of patterns than existing object-centric datasets.Experiments on three benchmarks, RPINE, FSCD-147, FSCD-LVIS, demonstrate that our method outperforms recent state-of-the-art methods, showing an outstanding generalization ability on cross-dataset evaluation.
Poster
Minghang Zheng · Yuxin Peng · Benyuan Sun · Yi Yang · Yang Liu
[ Exhibit Hall I ]
Abstract
In this paper, we tackle the task of online video temporal grounding (OnVTG), which requires the model to locate events related to a given natural language query within a video stream. Unlike regular video temporal grounding, OnVTG requires the model to make predictions without observing future frames. As online videos are streaming inputs and can go on indefinitely, it is impractical and inefficient to store all historical inputs. The existing OnVTG model employs memory to store recent historical video frame features and predict scores indicating whether the current frame corresponds to the start or end time of the target event. However, these methods lack effective event modeling and cannot retain long-term historical information, leading to low performance. To tackle these challenges, we propose hierarchical event memory for online video temporal grounding. We propose an event-based OnVTG framework that makes predictions based on event proposals that model event-level information with various durations. To efficiently preserve historically valuable event information, we introduce a hierarchical event memory that retains long-term low-redundant historical events, allowing the model to access both recent fine-grained information and long-term coarse-grained information. To enable the real-time prediction of the start time, we further propose a future prediction branch that …
Poster
Emmanuelle Bourigault · Amir Jamaludin · Abdullah Hamdi
[ Exhibit Hall I ]
Abstract
In the medical imaging domain, it is a fundamental challenge to collect large-scale labeled data due to privacy, involved logistics, and the high cost of labeling medical images. In this work, we present the UK Biobank Organs and Bones (UKBOB), the largest labeled dataset of body organs of 51,761 MRI 3D samples (17.9 M 2D images) and a total of more than 1.37 billion 2D segmentation masks of 72 organs based on the UK Biobank MRI dataset. We utilize automatic labeling, filter the labels with organ-specific filters, and manually annotate a subset of 300 MRIs with 11 abdominal classes to validate the quality (UKBOB-manual). This approach allows for scaling up the dataset collection while maintaining confidence in the labels. We further confirm the validity of the labels by the zero-shot generalization of trained models on the filtered UKBOB to other small labeled datasets from a similar domain ( _E.g._ abdominal MRI). To further elevate the effect of the noisy labels, we propose a novel Entropy Test-time Adaptation (ETTA) to refine the segmentation output. We use UKBOB to train a foundation model (_Swin-BOB_) for 3D medical image segmentation based on Swin-UNetr, achieving state-of-the-art results in several benchmarks in 3D medical imaging, …
Poster
Chunwei Wang · Guansong Lu · Junwei Yang · Runhui Huang · Jianhua Han · Lu Hou · Wei Zhang · Hang Xu
[ Exhibit Hall I ]
Abstract
In this paper, we introduce ILLUME, a unified multimodal large language model (MLLM) that seamlessly integrates multimodal understanding and generation capabilities within a single large language model through a unified next-token prediction formulation.To address the large dataset size typically required for image-text alignment, we propose to enhance data efficiency through the design of a vision tokenizer that incorporates semantic information and a progressive multi-stage training procedure. This approach reduces the dataset size to just 15M for pretraining -- over four times fewer than what is typically needed -- while achieving competitive or even superior performance with existing unified MLLMs, such as Janus. Additionally, to promote synergistic enhancement between understanding and generation capabilities, which is under-explored in previous works, we introduce a novel self-enhancing multimodal alignment scheme. This scheme supervises the MLLM to self-assess the consistency between text descriptions and self-generated images, facilitating the model to interpret images more accurately and avoid unrealistic and incorrect predictions caused by misalignment in image generation. Based on our extensive experiments, our proposed ILLUME stands out and competes with state-of-the-art unified MLLMs and specialized models across various benchmarks for multimodal understanding, generation, and editing.
Poster
Ruitao Wu · Yifan Zhao · Jia Li
[ Exhibit Hall I ]
Abstract
Class-Incremental Semantic Segmentation (CISS) requires continuous learning of newly introduced classes while retaining knowledge of past classes. By abstracting mainstream methods into two stages (visual feature extraction and prototype-feature matching), we identify a more fundamental challenge termed catastrophic semantic entanglement. This phenomenon involves Prototype-Feature Entanglement caused by semantic misalignment during the incremental process, and Background-Increment Entanglement due to dynamic data evolution. Existing techniques, which rely on visual feature learning without sufficient cues to distinguish targets, introduce significant noise and errors. To address these issues, we introduce a Language-inspired Bootstrapped Disentanglement framework (LBD). We leverage the prior class semantics of pre-trained visual-language models (e.g., CLIP) to guide the model in autonomously disentangling features through Language-guided Prototypical Disentanglement and Manifold Mutual Background Disentanglement. The former guides the disentangling of new prototypes by treating hand-crafted text features as topological templates, while the latter employs multiple learnable prototypes and mask-pooling-based supervision for background-incremental class disentanglement. By incorporating soft prompt tuning and encoder adaptation modifications, we further bridge the capability gap of CLIP between dense and sparse tasks, achieving state-of-the-art performance on both Pascal VOC and ADE20k, particularly in multi-step scenarios.
Poster
Liwei Che · Qingze T Liu · Jing Jia · Weiyi Qin · Ruixiang Tang · Vladimir Pavlovic
[ Exhibit Hall I ]
Abstract
Despite their remarkable potential, Large Vision-Language Models (LVLMs) still face challenges with object hallucination, a problem where their generated outputs mistakenly incorporate objects that do not actually exist. Although most works focus on addressing this issue within the language-model backbone, our work shifts the focus to the image input source, investigating how specific image tokens contribute to hallucinations. Our analysis reveals that a small subset of image tokens with high attention scores are the main drivers of object hallucination. By removing these hallucinatory image tokens (only 1.5% of all image tokens), the issue can be effectively mitigated. This finding holds consistently across different models. Building on this insight, we introduce \eazy, a novel, training-free method that automatically identifies and Eliminates hAllucinations by Zeroing out hallucinator Y image tokens. We utilize EAZY for unsupervised object hallucination detection, achieving a 15% improvement compared to previous methods. Additionally, EAZY demonstrates remarkable effectiveness in mitigating hallucinations while preserving model utility and seamlessly adapting to various LVLM architectures.
Poster
Jian Wang · Tianhong Dai · Bingfeng Zhang · Siyue Yu · ENG Gee LIM · Jimin XIAO
[ Exhibit Hall I ]
Abstract
Weakly Supervised Semantic Segmentation (WSSS) utilizes Class Activation Maps (CAMs) to extract spatial cues from image-level labels. However, CAMs highlight only the most discriminative foreground regions, leading to incomplete results. Recent Vision Transformer-based methods leverage class-patch attention to enhance CAMs, yet they still suffer from partial activation due to the token gap: classification-focused class tokens prioritize discriminative features, while patch tokens capture both discriminative and non-discriminative characteristics. This mismatch prevents class tokens from activating all relevant features, especially when discriminative and non-discriminative regions exhibit significant differences. To address this issue, we propose Optimal Transport-assisted Proxy Learning (OTPL), a novel framework that bridges the token gap by learning adaptive proxies. OTPL introduces two key strategies: (1) optimal transport-assisted proxy learning, which combines class tokens with their most relevant patch tokens to produce comprehensive CAMs, and (2) optimal transport-enhanced contrastive learning, aligning proxies with confident patch tokens for bounded proxy exploration. Our framework overcomes the limitation of class tokens in activating patch tokens, providing more complete and accurate CAM results. Experiments on WSSS benchmarks (PASCAL VOC and MS COCO) demonstrate that our method significantly improves the CAM quality and achieves state-of-the-art performances. The source code will be released.
Poster
Jiashuo Yu · Yue Wu · Meng Chu · Zhifei Ren · Zizheng Huang · Pei Chu · Ruijie Zhang · Yinan He · Qirui Li · Songze Li · Zhenxiang Li · Zhongying Tu · Conghui He · Yu Qiao · Yali Wang · Yi Wang · Limin Wang
[ Exhibit Hall I ]
Abstract
We present VRBench, the first long narrative video benchmark crafted for evaluating large models' multi-step reasoning capabilities, addressing limitations in existing evaluations that overlook temporal reasoning and procedural validity. It comprises 1,010 long videos (average duration 1.6 hours) along with 9,468 human-labeled multi-step question-answering pairs and 30,292 reasoning steps. These videos are curated via a multi-stage filtering process including expert inter-rater reviewing to prioritize plot coherence. We develop a human-AI collaborative framework that generates coherent reasoning processes, each requiring multiple temporally grounded steps, spanning seven types (e.g., event attribution, implicit inference). VRBench designs a multi-phase evaluation pipeline that both evaluates models from the outcome and process level. Apart from the MCQs for the final results, we propose two metrics for progress-level evaluation: (1) LLM-guided scoring for logical coherence and factual accuracy, and (2) Stepwise multiple choice question decomposition to validate causal progression. Through extensive evaluations of 12 LLMs and 16 VLMs on VRBench, we undertake a thorough analysis and provide valuable insights that advance the field of multi-step reasoning. VRBench will be publicly available.
Poster
Qing Jiang · Lin Wu · Zhaoyang Zeng · Tianhe Ren · Yuda Xiong · Yihao Chen · Liu Qin · Lei Zhang
[ Exhibit Hall I ]
Abstract
Humans are undoubtedly the most important participants in computer vision, and the ability to detect any individual given a natural language description, a task we define as referring to any person, holds substantial practical value. However, we find that existing models generally fail to achieve real-world usability, and current benchmarks are limited by their focus on one-to-one referring, that hinder progress in this area. In this work, we revisit this task from three critical perspectives: task definition, dataset design, and model architecture. We first identify five aspects of referable entities and three distinctive characteristics of this task. Next, we introduce HumanRef, a novel dataset designed to tackle these challenges and better reflect real-world applications. From a model design perspective, we integrate a multimodal large language model with an object detection framework, constructing a robust referring model named RexSeek. Experimental results reveal that state-of-the-art models, which perform well on commonly used benchmarks like RefCOCO/+/g, struggle with HumanRef due to their inability to detect multiple individuals. In contrast, RexSeek not only excels in human referring but also generalizes effectively to common object referring, making it broadly applicable across various perception tasks.
Poster
Yang Liu · Wentao Feng · Zhuoyao Liu · Shudong Huang · Jiancheng Lv
[ Exhibit Hall I ]
Abstract
Enabling Visual Semantic Models to effectively handle multi-view description matching has been a longstanding challenge. Existing methods typically learn a set of embeddings to find the optimal match for each view's text and compute similarity. However, the visual and text embeddings learned through these approaches have limited information capacity and are prone to interference from locally similar negative samples.To address this issue, we argue that the information capacity of embeddings is crucial and propose Dense-to-Sparse Feature Distilled Visual Semantic Embedding (D2S-VSE), which enhances the information capacity of sparse text by leveraging dense text distillation.Specifically, D2S-VSE is a two-stage framework. In the pre-training stage, we align images with dense text to enhance the information capacity of visual semantic embeddings.In the fine-tuning stage, we optimize two tasks simultaneously, distilling dense text embeddings to sparse text embeddings while aligning images and sparse texts, enhancing the information capacity of sparse text embeddings.Our proposed D2S-VSE model is extensively evaluated on the large-scale MS-COCO and Flickr30K datasets, demonstrating its superiority over recent state-of-the-art methods.
Poster
Wanting ZHANG · Zhenhui Ding · Guilian Chen · Huisi Wu · Jing Qin
[ Exhibit Hall I ]
Abstract
Accurate breast ultrasound (BUS) image segmentation is crucial for precise diagnosis and surgical planning, but it remains challenging largely due to the scarcity of labeled BUS images. Semi-supervised methods show promise by leveraging pseudo-labels to mitigate reliance on large-scale annotations. However, their performance is highly dependent on the quality of pseudo-labels, which is difficult to guarantee in BUS images due to inherent complexities such as low contrast, speckle noise, and artifacts. Previous studies primarily focus on refining pseudo-labels in one way or the other, or introducing auxiliary supervision; yet they overlook the potential of harnessing intrinsic and inherent pixel relations to enhance the robustness of semi-supervised segmentation. In this paper, we present a novel relation-aware semi-supervised model for BUS image segmentation, which is composed of two innovative components: an adjacent relation propagation (ARP) module and a cross-layer relation alignment (CRA) module, for comprehensively explore pixel relations to improve segmentation performance. The ARP propagates relations among adjacent pixels to reinforce the collaborative prediction of correlated pixels and enhance the model's awareness of local semantic consistency. The CRA aligns cross-layer pixel relations, employing deep-layer guidance to rectify erroneous correlations in shallow layers for noise suppression, while integrating multi-scale contexts to enable robust …
Poster
Chandan Yeshwanth · David Rozenberszki · Angela Dai
[ Exhibit Hall I ]
Abstract
Generating text descriptions of objects in 3D indoor scenes is an important building block of embodied understanding. Existing methods do this by describing objects at a single level of detail, which often does not capture fine-grained details such as varying textures, materials, and shapes of the parts of objects.We propose the task of expressive 3D captioning: given an input 3D scene, describe objects at multiple levels of detail: a high-level object description, and a low-level description of the properties of its parts.To produce such captions, we present ExCap3D, an expressive 3D captioning model which takes as input a 3D scan, and for each detected object in the scan, generates a fine-grained collective description of the parts of the object, along with an object-level description conditioned on the part-level description.We design ExCap3D to encourage semantic consistency between the generated text descriptions, as well as textual similarity in the latent space, to further increase the quality of the generated captions.To enable this task, we generated the ExCap3D Dataset by leveraging a visual-language model (VLM) for multi-view captioning. ExCap3D Dataset contains captions on the ScanNet++ dataset with varying levels of detail,comprising 190k text descriptions of 34k 3D objects in 947 indoor scenes.Our experiments …
Poster
JIAHE ZHAO · rongkun Zheng · Yi Wang · Helin WANG · Hengshuang Zhao
[ Exhibit Hall I ]
Abstract
In video Multimodal Large Language Models (video MLLMs), the visual encapsulation process plays a pivotal role in converting video contents into representative tokens for LLM input. While linear projectors are widely employed for encapsulation, they introduce semantic indistinctness and temporal incoherence when applied to videos. Conversely, the structure of resamplers shows promise in tackling these challenges, but an effective solution remains unexplored. Drawing inspiration from resampler structures, we introduce **DisCo**, a novel visual encapsulation method designed to yield semantically **dis**tinct and temporally **co**herent visual tokens for video MLLMs. DisCo integrates two key components: (1) A Visual Concept Discriminator (VCD) module, assigning unique semantics for visual tokens by associating them in pair with discriminative concepts in the video. (2) A Temporal Focus Calibrator (TFC) module, ensuring consistent temporal focus of visual tokens to video elements across every video frame. Through extensive experiments on multiple video MLLM frameworks, we demonstrate that DisCo remarkably outperforms previous state-of-the-art methods across a variety of video understanding benchmarks, while also achieving higher token efficiency thanks to the reduction of semantic indistinctness.
Poster
Jiesi Hu · Hanyang Peng · Yanwu Yang · Xutao Guo · Yang Shang · Pengcheng Shi · Chenfei Ye · Ting Ma
[ Exhibit Hall I ]
Abstract
In-context learning (ICL), a type of universal model, demonstrates exceptional generalization across a wide range of tasks without retraining by leveraging task-specific guidance from context, making it particularly effective for the intricate demands of neuroimaging. However, current ICL models, limited to 2D inputs and thus exhibiting suboptimal performance, struggle to extend to 3D inputs due to the high memory demands of ICL. In this regard, we introduce Neuroverse3D, an ICL model capable of performing multiple neuroimaging tasks in 3D (e.g., segmentation, denoising, inpainting). Neuroverse3D overcomes the large memory consumption associated with 3D inputs through adaptive parallel-sequential context processing and a U-shaped fusion strategy, allowing it to handle an unlimited number of context images. Additionally, we propose an optimized loss function to balance multi-task training and enhance focus on anatomical boundaries. Our study incorporates 43,674 3D multi-modal scans from 19 neuroimaging datasets and evaluates Neuroverse3D on 14 diverse tasks using held-out test sets. The results demonstrate that Neuroverse3D significantly outperforms existing ICL models and closely matches task-specific models, enabling flexible adaptation to medical center variations without retraining. The code and model weights will be made publicly available.
Poster
Trong-Thang Pham · AKASH AWASTHI · Saba Khan · Esteban Marti · Tien-Phat Nguyen · Khoa Vo · Minh Tran · Ngoc Son Nguyen · Cuong Van · Yuki Ikebe · Anh Nguyen · Anh Nguyen · Zhigang Deng · Carol Wu · Hien Nguyen · Ngan Le
[ Exhibit Hall I ]
Abstract
Understanding radiologists' eye movement during Computed Tomography (CT) reading is crucial for developing effective interpretable computer-aided diagnosis systems. However, CT research in this area has been limited by the lack of publicly available eye-tracking datasets and the three-dimensional complexity of CT volumes. To address these challenges, we present the first publicly available eye gaze dataset on CT, called CT-ScanGaze. Then, we introduce CT-Searcher, a novel 3D scanpath predictor designed specifically to process CT volumes and generate radiologist-like 3D fixation sequences, overcoming the limitations of current scanpath predictors that only handle 2D inputs. Since deep learning models benefit from a pretraining step, we develop a pipeline that converts existing 2D gaze datasets into 3D gaze data to pretrain CT-Searcher. Through both qualitative and quantitative evaluations on CT-ScanGaze, we demonstrate the effectiveness of our approach and provide a comprehensive assessment framework for 3D scanpath prediction in medical imaging.Code and data will be available for research purposes.
Poster
Zhibo Yang · Jun Tang · Zhaohai Li · Pengfei Wang · Jianqiang Wan · Humen Zhong · Xuejing Liu · Mingkun Yang · Peng Wang · Shuai Bai · Lianwen Jin · Junyang Lin
[ Exhibit Hall I ]
Abstract
Large Multimodal Models ( LMMs ) have demonstrated impressive performance in recognizing document images with natural language instructions. However, it remains unclear to what extent capabilities in literacy with rich structure and fine-grained visual challenges. The current landscape lacks a comprehensive benchmark to effectively measure the literate capabilities of LMMs. Existing benchmarks are often limited by narrow scenarios and specified tasks. To this end, we introduce CC-OCR, a comprehensive benchmark that possesses a diverse range of scenarios, tasks, and challenges. CC-OCR comprises four OCR-centric tracks: multi-scene text reading, multilingual text reading, document parsing, and key information extraction. It includes 39 subsets with 7,058 full annotated images, of which 41% are sourced from real applications, and released for the first time. We evaluate ten prominent LMMs and reveal both the strengths and weaknesses of these models, particularly in text grounding, multi-orientation, and hallucination of repetition. CC-OCR aims to comprehensively evaluate the capabilities of LMMs on OCR-centered tasks, facilitating continued progress in this crucial area.
Poster
I-Hsiang Chen · Hua-En Chang · Wei-Ting Chen · Jenq-Newng Hwang · Sy-Yen Kuo
[ Exhibit Hall I ]
Abstract
Domain Generalized Semantic Segmentation (DGSS) is a critical yet challenging task, as domain shifts in unseen environments can severely compromise model performance. While recent studies enhance feature alignment by projecting features into the source domain, they often neglect intrinsic latent domain priors, leading to suboptimal results. In this paper, we introduce PDAF, a Probabilistic Diffusion Alignment Framework that enhances the generalization of existing segmentation networks through probabilistic diffusion modeling. PDAF introduces a Latent Domain Prior (LDP) to capture domain shifts and uses this prior as a conditioning factor to align both source and unseen target domains. To achieve this, PDAF integrates into a pre-trained segmentation model and utilizes paired source and pseudo-target images to simulate latent domain shifts, enabling LDP modeling. The framework comprises three modules: the Latent Prior Extractor (LPE) predicts the LDP by supervising domain shifts; the Domain Compensation Module (DCM) adjusts feature representations to mitigate domain shifts; and the Diffusion Prior Estimator (DPE) leverages a diffusion process to estimate the LDP without requiring paired samples. This design enables PDAF to iteratively model domain shifts, progressively refining feature representations to enhance generalization under complex target conditions. Extensive experiments validate the effectiveness of PDAF across diverse and challenging urban …
Poster
Long Lian · Yifan Ding · Yunhao Ge · Sifei Liu · Hanzi Mao · Boyi Li · Marco Pavone · Ming-Yu Liu · Trevor Darrell · Adam Yala · Yin Cui
[ Exhibit Hall I ]
Abstract
Generating detailed and accurate descriptions for specific regions in images and videos remains a fundamental challenge for vision-language models. We introduce the Describe Anything Model (DAM), a model designed for detailed localized captioning (DLC). DAM preserves both local details and global context through two key innovations: a focal prompt, which ensures high-resolution encoding of targeted regions, and a localized vision backbone, which integrates precise localization with its broader context. To tackle the scarcity of high-quality DLC data, we propose a Semi-supervised learning (SSL)-based Data Pipeline (DLC-SDP). DLC-SDP starts with existing segmentation datasets and expands to unlabeled web images using SSL. We introduce DLC-Bench, a benchmark designed to evaluate DLC without relying on reference captions. DAM sets new state-of-the-art on 10 benchmarks spanning keyword-level, phrase-level, and detailed multi-sentence localized image and video captioning.
Poster
Shenghao Fu · Qize Yang · Yuan-Ming Li · Yi-Xing Peng · Kun-Yu Lin · Xihan Wei · Jian-Fang Hu · Xiaohua Xie · Wei-Shi Zheng
[ Exhibit Hall I ]
Abstract
Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research. The model, code, and datasets will be made publicly available.
Poster
WonJun Moon · Cheol-Ho Cho · Woojin Jun · Minho Shim · Taeoh Kim · Inwoong Lee · Dongyoon Wee · Jae-Pil Heo
[ Exhibit Hall I ]
Abstract
In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs.To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes.We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding.Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations.Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.
Poster
Yana Hasson · Pauline Luc · Liliane Momeni · Maks Ovsjanikov · Guillaume Le Moing · Alina Kuznetsova · Ira Ktena · Jennifer J. Sun · Skanda Koppula · Dilara Gokay · Joseph Heyward · Etienne Pot · Andrew Zisserman
[ Exhibit Hall I ]
Abstract
In recent years, there has been a proliferation of spatiotemporal foundation models for different scientific domains. While promising, these models are often domain-specific, limiting their applicability. Given that many spatiotemporal tasks can be represented as video modeling problems, video foundation models (ViFMs) hold considerable promise.However, it remains an open question to what extent the knowledge acquired on large-scale but potentially out-of-domain data can be effectively transferred across diverse scientific domains, and whether a single, pretrained ViFM can be competitive with domain-specific baselines. To address this, we introduce SciVid, a comprehensive benchmark comprising five **Sci**entific **Vid**eo tasks, across medical computer vision, animal behavior, and weather forecasting.We adapt six leading video models to SciVid using simple trainable readout modules, establishing strong baselines and demonstrating the potential for effective transfer learning. Specifically, we show that state-of-the-art results can be obtained in several applications by effectively transferring general-purpose representations from ViFM backbones. Furthermore, our results shed light on limitations of existing ViFMs, and highlight opportunities for the development of generalizable models for high-impact scientific applications.We will release our code to facilitate further research in cross-domain development of ViFMs.
Poster
Zheyuan Zhang · Wanying Dou · Linkai Peng · Hongyi Pan · Ulas Bagci · Boqing Gong
[ Exhibit Hall I ]
Abstract
Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by manually annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35\% accuracy on VideoAds, outperforming GPT-4o (66.82\%) and Gemini-1.5 Pro (69.66\%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27\%. These results underscore the necessity of advancing MLLMs' temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in video-language understanding. The dataset and evaluation code will be publicly …
Poster
Runpeng Yu · Xinyin Ma · Xinchao Wang
[ Exhibit Hall I ]
Abstract
In MLLMs, Visual perception refers to the process by which MLLMs encode visual inputs, such as images, and align them with the text embedding space. Currently, MLLMs still lack the capability to autonomously control their own visual perception processes. For example, they cannot selectively re-encode specific regions of an image or focus on information related to specific object categories.In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate natural language tokens, and use them to trigger additional visual perception process. The Region Selection Token explicitly identifies regions of interest that require further processing, while the Vision Re-Encoding Token utilizes its hidden states to guide an additional vision encoding process. Extensive experiments highlight the effectiveness of these tokens in enhancing spatial reasoning, fine-grained understanding, Text/OCR-related VQA, and a wide range of other visual tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 30.9%, increasing its score from 0.572 to 0.749, and even …
Poster
Haochen Zhao · Jianwei Niu · Xuefeng Liu · Xiaozheng Xie · Li Kuang · Haotian Yang · Bin Dai · Hui Meng · Yong Wang
[ Exhibit Hall I ]
Abstract
Based on pseudo-labels, voxel-wise contrastive learning (VCL) is a prominent approach designed to learn effective feature representations for semi-supervised medical image segmentation. However, in multi-organ segmentation (MoS), the complex anatomical structures of certain organs often lead to many unreliable pseudo-labels. Directly applying VCL can introduce confirmation bias, resulting in poor segmentation performance. A common practice is to first transform these unreliable pseudo-labels into complementary ones, which represent classes that voxels are least likely to belong to, and then push voxels away from the generated complementary labels. However, we find that this approach may fail to allow voxels with unreliable pseudo-labels (unreliable voxels) to fully benefit from the advantages of VCL. In this paper, we propose DVCL, a novel distance-aware VCL method for semi-supervised MoS. DVCL is based on the observation that unreliable voxels, which may not form discriminative feature boundaries, still form clear clusters. Hence, voxels close to each other in the feature space ('neighbors') likely belong to the same semantic class, while distant ones ('outsiders') likely belong to different classes. In DVCL, we first identify neighbors and outsiders for all unreliable voxels, and then pull their neighbors into the same clusters while pushing outsiders away. In this way, unreliable …
Poster
Matic Fučka · Vitjan Zavrtanik · Danijel Skocaj
[ Exhibit Hall I ]
Abstract
Recent surface anomaly detection methods excel at identifying structural anomalies, such as dents and scratches, but struggle with logical anomalies, such as irregular or missing object components. The best-performing logical anomaly detection approaches rely on aggregated pretrained features or handcrafted descriptors (most often derived from composition maps), which discard spatial and semantic information, leading to suboptimal performance. We propose SALAD, a semantics-aware discriminative logical anomaly detection method that incorporates a newly proposed composition branch to explicitly model the distribution of object composition maps, consequently learning important semantic relationships. Additionally, we introduce a novel procedure for extracting composition maps that requires no hand-made labels or category-specific information, in contrast to previous methods. By effectively modelling the composition map distribution, SALAD significantly improves upon state-of-the-art methods on the standard benchmark for logical anomaly detection, MVTec LOCO, achieving an impressive image-level AUROC of 96.1\%. Code: \textcolor{magenta}{Upon acceptance}
Poster
Shengcao Cao · Zijun Wei · Jason Kuen · Kangning Liu · Lingzhi Zhang · Jiuxiang Gu · HyunJoon Jung · Liangyan Gui · Yu-Xiong Wang
[ Exhibit Hall I ]
Abstract
Recent image segmentation models have advanced to segment images into high-quality masks for visual entities, and yet they cannot provide comprehensive semantic understanding for complex queries based on both language and vision. This limitation reduces their effectiveness in applications that require user-friendly interactions driven by vision-language prompts. To bridge this gap, we introduce a novel task of flexible referring expression segmentation (FRES). In this task, a model produces a group of masks based on arbitrary prompts specified by text only or text plus reference visual entities. To address this new challenge, we propose a novel framework to "Refer to Any Segmentation Mask Group" (RAS), which augments segmentation models with complex multimodal interactions and comprehension via a mask-centric large multimodal model. For training and benchmarking FRES models, we create datasets MaskGroups-2M and MaskGroups-HQ to include diverse mask groups specified by text and reference entities. Through extensive evaluation, we demonstrate superior performance of RAS on our new FRES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks.
Poster
Ragav Sachdeva · Andrew Zisserman
[ Exhibit Hall I ]
Abstract
Comics have long been a popular form of storytelling, offering visually engaging narratives that captivate audiences worldwide. However, the visual nature of comics presents a significant barrier for visually impaired readers, limiting their access to these engaging stories. In this work, we provide a pragmatic solution to this accessibility challenge by developing an automated system that generates text-based literary narratives from manga comics. Our approach aims to create an evocative and immersive prose that not only conveys the original narrative but also captures the depth and complexity of characters, their interactions, and the vivid settings in which they reside.To this end we make the following contributions: (1) We present a unified model, Magiv3, that excels at various functional tasks pertaining to comic understanding, such as localising panels, characters, texts, and speech-bubble tails, performing OCR, grounding characters etc. (2) We release human-annotated captions for over 3300 Japanese comic panels, along with character grounding annotations, and benchmark large vision-language models in their ability to understand comic images. (3) Finally, we demonstrate how integrating large vision-language models with Magiv3, can generate seamless literary narratives that allows visually impaired audiences to engage with the depth and richness of comic storytelling. Our code, trained model …
Poster
Pan Liu · Jinshi Liu
[ Exhibit Hall I ]
Abstract
While significant advances exist in pseudo-label generation for semi-supervised semantic segmentation, pseudo-label selection remains understudied. Existing methods typically use fixed confidence thresholds to retain high-confidence predictions as pseudo-labels. However, these methods cannot cope with network overconfidence tendency, where correct and incorrect predictions overlap significantly in high-confidence regions, making separation challenging and amplifying model cognitive bias. Meanwhile, the direct discarding of low-confidence predictions disrupts spatial-semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. CSL formulates pseudo-label selection as a convex optimization problem within the confidence distribution feature space, establishing sample-specific decision boundaries to distinguish reliable from unreliable predictions. Additionally, CSL introduces random masking of reliable pixels to guide the network in learning contextual relationships from low-reliability regions, thereby mitigating the adverse effects of discarding uncertain predictions. Extensive experimental results on the Pascal VOC 2012 and Cityscapes benchmarks show that CSL performs favorably against state-of-the-art methods.
Poster
Seokho Han · Seoyeon Yoon · Jinhee Kim · Dongwei Wang · Kang Jeon · Huanrui Yang · Jong Hwan Ko
[ Exhibit Hall I ]
Abstract
As deep neural networks (DNNs) see increased deployment on mobile and edge devices, optimizing model efficiency has become crucial. Mixed-precision quantization is widely favored, as it offers a superior balance between efficiency and accuracy compared to uniform quantization. However, finding the optimal precision for each layer is challenging. Recent studies using bit-level training have shown promise, yet they often introduce substantial training complexity and high GPU memory requirements. In this paper, we propose Memory-Efficient Bit Sparsification Quantization (MSQ), a novel approach that addresses these limitations. MSQ applies a round-clamp quantizer and leverages least significant bit (LSB) regularization to induce sparsity in LSBs, enabling effective precision reduction without splitting parameters at the bit level, thereby minimizing memory use and training time. Additionally, MSQ incorporates Hessian information, allowing the simultaneous pruning of multiple LSBs to further enhance training efficiency. Experimental results show that MSQ effectively reduces resource demands while maintaining competitive accuracy and compression rates, making it a practical solution for training efficient DNNs on resource-constrained devices.
Poster
Zhaorui Tan · Xi Yang · Tan Pan · TIANYI LIU · Chen Jiang · Xin Guo · Qiufeng Wang · Anh Nguyen · Yuan Qi · Kaizhu Huang · Yuan Cheng
[ Exhibit Hall I ]
Abstract
Variations in medical imaging modalities and individual anatomical differences pose challenges to cross-modality generalization in multi-modal tasks. Existing methods often concentrate exclusively on common anatomical patterns, thereby neglecting individual differences and consequently limiting their generalization performance. This paper emphasizes the critical role of learning individual-level invariance, i.e., personalized representation $\mathbb{X}_h$, to enhance multi-modality generalization under both homogeneous and heterogeneous settings.It reveals that mappings from individual anatomy to different medical modalities remain static across the population, which is implied in the personalization process.We propose a two-stage approach: pre-training with invariant representation $\mathbb{X}_h$ for personalization, then fine-tuning for diverse downstream tasks.We provide both theoretical and empirical evidence demonstrating the feasibility and advantages of personalization, showing that our approach yields greater generalizability and transferability across diverse multi-modal medical tasks compared to methods lacking personalization. Extensive experiments further validate that our approach significantly enhances performance in various generalization scenarios.
Poster
Jiaming Liu · Linghe Kong · Guihai Chen
[ Exhibit Hall I ]
Abstract
Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images, but its performance on camouflaged object detection (COD) is unsatisfactory. In this paper, we propose SAM-COD that performs camouflaged object detection for RGB-D inputs. While keeping the SAM architecture intact, dual stream adapters are expanded on the image encoder to learn potential complementary information from RGB images and depth images, and fine-tune the mask decoder and its depth replica to perform dual-stream mask prediction. In practice, the dual stream adapters are embedded into the attention block of the image encoder in a parallel manner to facilitate the refinement and correction of the two types of image embeddings. To mitigate channel discrepancies arising from dual stream embeddings that do not directly interact with each other, we augment the association of dual stream embeddings using bidirectional knowledge distillation including a model distiller and a modal distiller. In addition, to predict the masks for RGB and depth attention maps, we hybridize the two types of image embeddings which are jointly learned with the prompt embeddings to update the initial prompt, and then feed them into the mask decoders to synchronize the consistency of image embeddings and prompt embeddings. Experimental results …
Poster
Yuanze Li · Shihao Yuan · Haolin Wang · Qizhang Li · Ming Liu · Chen Xu · Guangming Shi · Wangmeng Zuo
[ Exhibit Hall I ]
Abstract
Although recent methods have tried to introduce large multimodal models (LMMs) into industrial anomaly detection (IAD), their generalization in the IAD field is far inferior to that for general purposes. We summarize the main reasons for this gap into two aspects. On one hand, general-purpose LMMs lack cognition of defects in the visual modality, thereby failing to sufficiently focus on defect areas. Therefore, we propose to modify the AnyRes structure of the LLaVA model, providing the potential anomalous areas identified by existing IAD models to the LMMs. On the other hand, existing methods mainly focus on identifying defects by learning defect patterns or comparing with normal samples, yet they fall short of understanding the causes of these defects. Considering that the generation of defects is closely related to the manufacturing process, we propose a manufacturing-driven IAD paradigm. An instruction-tuning dataset for IAD (InstructIAD) and a data organization approach for Chain-of-Thought with manufacturing (CoT-M) are designed to leverage the manufacturing process for IAD. Based on the above two modifications, we present Triad, a novel LMM-based method incorporating an expert-guided region-of-interest tokenizer and manufacturing process for industrial anomaly detection. Extensive experiments show that our Triad not only demonstrates competitive performance against current …
Poster
Zhixuan Li · Hyunse Yoon · Sanghoon Lee · Weisi Lin
[ Exhibit Hall I ]
Abstract
Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable. However, current amodal segmentation methods lack the capability to interact with users through text input and struggle to understand or reason about implicit and complex purposes. While methods like LISA integrate multi-modal large language models (LLMs) with segmentation for reasoning tasks, they are limited to predicting only visible object regions and face challenges in handling complex occlusion scenarios. To address these limitations, we propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects while providing answers with elaborations based on user text input. We develop a generalizable dataset generation pipeline and introduce a new dataset focusing on daily life scenarios, encompassing diverse real-world occlusions. Furthermore, we present AURA (Amodal Understanding and Reasoning Assistant), a novel model with advanced global and spatial-level designs specifically tailored to handle complex occlusions. Extensive experiments validate AURA's effectiveness on the proposed dataset. The code, model, and dataset will be publicly released.
Poster
Jiaxuan Chen · Yu Qi · Yueming Wang · Gang Pan
[ Exhibit Hall I ]
Abstract
Neural decoding has recently made significant progress in reconstructing images and text from brain activity, yet seeking biologically valid semantic alignment between artificial models and the brain remains challenging. Large pre-trained foundation models such as CLIP excel at capturing rich semantic details in complex visual scenes. In contrast, due to selective attention, only part of the visual semantics in the stimulus may be preferentially represented in the neural patterns when subjects view images. Past studies have generally assumed that stimulus images and their evoked brain recordings are strictly semantically equivalent, potentially leading to semantic misalignment between supervision signals and neural recordings. In order to address this, we propose a novel self-adaptive semantic decoding method (Mind-SA), designed to dynamically detect the regions within stimulus images that the brain actually focuses on and use them as supervision to guide brain-to-text reconstruction. We find that the proposed Mind-SA can be used to reduce the semantic gap between supervision signals (i.e., stimulus images) and neural representations, thus enabling the reconstruction model to focus on the parts that the brain actually perceives. Experiments demonstrate that Mind-SA improves the quality of neural representations and achieves the state-of-the-art brain-to-text performance.
Poster
HAILONG YAN · Ao Li · Xiangtao Zhang · Zhe Liu · Zenglin Shi · Ce Zhu · Le Zhang
[ Exhibit Hall I ]
Abstract
Recent advancements in deep neural networks have driven significant progress in image enhancement (IE). However, deploying deep learning models on resource-constrained platforms, such as mobile devices, remains challenging due to high computation and memory demands. To address these challenges and facilitate real-time IE on mobile, we introduce an extremely lightweight Convolutional Neural Network (CNN) framework with around 4K parameters. Our approach integrates re-parameterization with an Incremental Weight Optimization strategy to ensure efficiency. Additionally, we enhance performance with a Feature Self-Transform module and a Hierarchical Dual-Path Attention mechanism, optimized with a Local Variance-Weighted loss. With this efficient framework, we are the first to achieve real-time IE inference at up to 1,100 frames per second (FPS) while delivering competitive image quality, achieving the best trade-off between speed and performance across multiple IE tasks. The code will be released soon.
Poster
yingsen zeng · Zepeng Huang · Yujie Zhong · Chengjian Feng · Jie Hu · Lin Ma · Yang Liu
[ Exhibit Hall I ]
Abstract
Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. This approach uses a learnable token to create a continuous embedding space for all time points and incorporates a Distribution-based Time Tokenizer that decodes timestamps into probability distributions. These distributions effectively resolve boundary ambiguities and translate into continuous time values. Additionally, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models to overcome temporal granularity limitations in existing datasets. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks.
Poster
Ahmed Nassar · Matteo Omenetti · Maksym Lysak · Nikolaos Livathinos · Christoph Auer · Lucas Morin · Rafael Teixeira de Lima · Yusik Kim · A. Said Gurbuz · Michele Dolfi · Peter Staar
[ Exhibit Hall I ]
Abstract
We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms — significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition.Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model weights and supplementary datasets will be publicly available upon acceptance.
Poster
Jiajia Li · Huisi Wu · Jing Qin
[ Exhibit Hall I ]
Abstract
histopathology images is a fundamental task in computational pathology. It is also a very challenging task due to complex nuclei morphologies, ambiguous boundaries, and staining variations. Existing methods often struggle to precisely delineate overlapping nuclei and handle class imbalance. We introduce WeaveSeg, a novel deep learning model for nuclei instance segmentation that significantly improves segmentation performance via synergistic integration of adaptive spectral feature refinement and iterative contrast-weaving. WeaveSeg features an adaptive spectral detail refinement (SAR) module for multi-scale feature enhancement via adaptive frequency component fusion, and an iterative contrast-weaving (ICW) module that progressively refines features through integrating contrastive attention, decoupled semantic context, and adaptive gating. Furthermore, we introduce a specialized uncertainty loss to explicitly model ambiguous regions, and a novel local contrast-based self-adaptive adjustment mechanism to accommodate dynamic feature distributions. Extensive experiments on MoNuSeg and CoNSeP demonstrate WeaveSeg's SOTA performance over existing models. Code will be publicly available.
Poster
Zitian Tang · Shijie Wang · Junho Cho · Jaewook Yoo · Chen Sun
[ Exhibit Hall I ]
Abstract
How multimodal large language models (MLLMs) perceive the visual world remains a mystery. To one extreme, object and relation modeling may be implicitly implemented with inductive biases, for example by treating objects as tokens. To the other extreme, empirical results reveal the surprising finding that simply performing visual captioning, which tends to ignore spatial configuration of the objects, serves as a strong baseline for video understanding. We aim to answer the question: how can objects help video-language understanding in MLLMs? We tackle the question from the object representation and adaptation perspectives. Specifically, we investigate the trade-off between representation expressiveness (e.g. distributed versus symbolic) and integration difficulty (e.g. data-efficiency when learning the adapters). Through extensive evaluations on five video question answering datasets, we confirm that explicit integration of object-centric representation remains necessary, and the symbolic objects can be most easily integrated while being performant for question answering. We hope our findings can encourage the community to explore the explicit integration of perception modules into MLLM design. Our code and models will be publicly released.
Poster
G Thomas Hudson · Dean Slack · Thomas Winterbottom · Jamie Stirling · Chenghao Xiao · Junjie Shentu · Noura Al Moubayed
[ Exhibit Hall I ]
Abstract
Multimodal learning, which involves integrating information from various modalities such as text, images, audio, and video, is pivotal for numerous complex tasks like visual question answering, cross-modal retrieval, and caption generation. Traditional approaches rely on modality-specific encoders and late fusion techniques, which can hinder flexibility when adapting to new tasks or modalities. To address these limitations, we introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. This method treats all inputs and outputs as sequential frames in a video, enabling seamless integration of modalities and effective knowledge transfer across tasks. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-text, and audio-to-text, demonstrating the model's ability to generalize across modalities with minimal adaptation. We show that task reformulation can significantly simplify multimodal model design across various tasks, laying the groundwork for more generalized multimodal foundation models.
Poster
Haoran Lou · Chunxiao Fan · Ziyan Liu · Yuexin Wu · Xinliang Wang
[ Exhibit Hall I ]
Abstract
The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1)We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: "from central region to global" and "from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and …
Poster
Luca Barsellotti · Lorenzo Bianchi · Nicola Messina · Fabio Carrara · Marcella Cornia · Lorenzo Baraldi · Fabrizio Falchi · Rita Cucchiara
[ Exhibit Hall I ]
Abstract
Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
Poster
Minjoo Ki · Dae Jung Kim · Kisung Kim · Seon Joo Kim · Jinhan Lee
[ Exhibit Hall I ]
Abstract
Text-to-video retrieval serves as a powerful tool for navigating vast video databases. This is particularly useful in autonomous driving to retrieve scenes from a text query to simulate and evaluate the driving system in desired scenarios. However, traditional ranking-based retrieval methods often return partial matches that do not satisfy all query conditions. To address this, we introduce Inclusive Text-to-Video Retrieval, which retrieves only videos that meet all specified conditions, regardless of additional irrelevant elements. We propose CARIM, a framework for driving scene retrieval that employs inclusive text matching. By utilizing Vision-Language Model (VLM) and Large Language Model (LLM) to generate compressed captions for driving scenes, we transform text-to-video retrieval into a more efficient text-to-text retrieval problem, eliminating modality mismatches and heavy annotation costs. We introduce a novel positive and negative data curation strategy and an attention-based scoring mechanism tailored for driving scene retrieval. Experimental results on the DRAMA dataset demonstrate that CARIM outperforms state-of-the-art retrieval methods, excelling in edge cases where traditional models fail.
Poster
Yiming Zhang · Zhuokai Zhao · Zhaorun Chen · Zenghui Ding · Xianjun Yang · Yining Sun
[ Exhibit Hall I ]
Abstract
Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DyTo, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DyTointegrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DyTo, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.
Poster
Shaojie Zhang · Jiahui Yang · Jianqin Yin · Zhenbo Luo · Jian Luan
[ Exhibit Hall I ]
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate Q-Frame's effectiveness through extensive experiments on benchmark datasets, including MLVU, LongVideoBench, and Video-MME, illustrating its superiority over existing methods and its applicability across various video understanding tasks.
Poster
Jinsol Song · Jiamu Wang · Anh Nguyen · Keunho Byeon · Sangjeong Ahn · Sung Hak Lee · Jin Tae Kwak
[ Exhibit Hall I ]
Abstract
Anomaly detection aims to identify rare and scarce anomalies, which is particularly challenging in computational pathology, where disease-related data are often limited or nonexistent. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a normal and abnormal pathology knowledge-augmented vision-language model for anomaly detection in pathology images. Ano-NAViLa utilizes a pre-trained vision-language model with a lightweight trainable MLP, facilitating computationally efficiency. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.
Poster
Matthias Kümmerer · Harneet Singh Khanuja · Matthias Bethge
[ Exhibit Hall I ]
Abstract
Recent advances in image-based saliency prediction are approaching gold standard performance levels on existing benchmarks. Despite this success, we show that predicting fixations across multiple saliency datasets remains challenging due to dataset bias. We find a significant performance drop (around 40%) when models trained on one dataset are applied to another. Surprisingly, increasing dataset diversity does not resolve this *inter-dataset gap*, with close to 60% attributed to dataset-specific biases. To address this remaining *generalization gap*, we propose a novel architecture extending a mostly dataset-agnostic encoder-decoder structure with fewer than 20 dataset-specific parameters that govern interpretable mechanisms such as multi-scale structure, center bias, and fixation spread. Adapting only these parameters to new data accounts for more than 75% of the generalization gap, with a large fraction of the improvement achieved with as few as 50 samples. Our model sets a new state-of-the-art on all three datasets of the MIT/Tuebingen Saliency Benchmark (MIT300, CAT2000, and COCO-Freeview), even when purely generalizing from unrelated datasets, but with a substantial boost when adapting to the respective training datasets. The model also provides valuable insights into spatial saliency properties, revealing complex multi-scale effects that combine both absolute and relative sizes.
Poster
Bingchen Gong · Diego Gomez · Abdullah Hamdi · Abdelrahman Eldesokey · Ahmed Abdelreheem · Peter Wonka · Maks Ovsjanikov
[ Exhibit Hall I ]
Abstract
We propose a novel zero-shot approach for keypoint detection on 3D shapes. Point-level reasoning on visual data is challenging as it requires precise localization capability, posing problems even for powerful models like DINO or CLIP. Traditional methods for 3D keypoint detection rely heavily on annotated 3D datasets and extensive supervised training, limiting their scalability and applicability to new categories or domains. In contrast, our method utilizes the rich knowledge embedded within Multi-Modal Large Language Models (MLLMs). Specifically, we demonstrate, for the first time, that pixel-level annotations used to train recent MLLMs can be exploited for both extracting and naming salient keypoints on 3D models without any ground truth labels or supervision. Experimental evaluations demonstrate that our approach achieves competitive performance on standard benchmarks compared to supervised methods, despite not requiring any 3D keypoint annotations during training. Our results highlight the potential of integrating language models for localized 3D shape understanding. This work opens new avenues for cross-modal learning and underscores the effectiveness of MLLMs in contributing to 3D computer vision challenges.
Poster
Fengzhe Zhou · Humphrey Shi
[ Exhibit Hall I ]
Abstract
Recently, Mask2Former has achieved significant success as a universal image segmentation framework, with its Multi-Scale Deformable Attention (MSDeformAttn) Pixel Decoder becoming a widely adopted component in current segmentation models. However, the inefficiency of MSDeformAttn has become a performance bottleneck for segmenters. To address this, we propose the Hyper Pixel Decoder (HyPiDecoder), an improved Pixel Decoder design that replaces parts of the MSDeformAttn layers with convolution-based FPN layers, introducing explicit locality information and significantly boosting inference speed. Experimental results show that HyPiDecoder can be applied to both universal segmentation models and unified segmentation and detection models, achieving improvements in both speed and accuracy across object detection, semantic, instance, and panoptic segmentation tasks. The Mask DINO model integrated with HyPiDecoder achieves a new SOTA of 58.8 PQ on COCO panoptic segmentation with SwinL-scale backbone and no extra training data, with a 127\% increase in inference speed compared to the original model. Code will be released in the future.
Poster
Yanqi Li · Jianwei Niu · Tao Ren
[ Exhibit Hall I ]
Abstract
Open-Vocabulary Object Detection (OVOD) aims to localize and recognize objects from both known and novel categories. However, existing methods rely heavily on internal knowledge from Vision-Language Models (VLMs), restricting their generalization to unseen categories due to limited contextual understanding. To address this, we propose CODet, a plug-and-play framework that enhances OVOD by integrating object co-occurrence —-- a form of external contextual knowledge pervasive in real-world scenes. Specifically, CODet extracts visual co-occurrence patterns from images, aligns them with textual dependencies validated by Large Language Models (LLMs), and injects contextual co-occurrence pseudo-labels as external knowledge to guide detection. Without architectural changes, CODet consistently improves five state-of-the-art VLM-based detectors across two benchmarks, achieving notable gains (up to +2.3 AP on novel categories). Analyses further confirm its ability to encode meaningful contextual guidance, advancing open-world perception by bridging visual and textual co-occurrence knowledge.
Poster
Bingqing Zhang · Zhuo Cao · Heming Du · Yang Li · Xue Li · Jiajun Liu · Sen Wang
[ Exhibit Hall I ]
Abstract
Despite recent advances, Text-to-video retrieval (TVR) is still hindered by multiple inherent uncertainties, such as ambiguous textual queries, indistinct text-video mappings, and low-quality video frames. Although interactive systems have emerged to address these challenges by refining user intent through clarifying questions, current methods typically rely on heuristic or ad-hoc strategies without explicitly quantifying these uncertainties, limiting their effectiveness. Motivated by this gap, we propose UMIVR, an Uncertainty-Minimizing Interactive Text-to-Video Retrieval framework that explicitly quantifies three critical uncertainties—text ambiguity, mapping uncertainty, and frame uncertainty—via principled, training-free metrics: semantic entropy-based Text Ambiguity Score (TAS), Jensen–Shannon divergence-based Mapping Uncertainty Score (MUS), and a Temporal Quality-based Frame Sampler (TQFS). By adaptively generating targeted clarifying questions guided by these uncertainty measures, UMIVR iteratively refines user queries, significantly reducing retrieval ambiguity. Extensive experiments on multiple benchmarks validate UMIVR's effectiveness, achieving notable gains in Recall@1 (69.2\% after 10 interactive rounds) on the MSR-VTT-1k dataset, thereby establishing an uncertainty-minimizing foundation for interactive TVR.
Poster
Ji Du · Xin WANG · Fangwei Hao · Mingyang Yu · Chunyuan Chen · Jiesheng Wu · Bin Wang · Jing Xu · Ping Li
[ Exhibit Hall I ]
Abstract
At the core of Camouflaged Object Detection (COD) lies segmenting objects from their highly similar surroundings. Previous efforts navigate this challenge primarily through image-level modeling or annotation-based optimization. Despite advancing considerably, this commonplace practice hardly taps valuable dataset-level contextual information or relies on laborious annotations. In this paper, we propose RISE, a RetrIeval SElf-augmented paradigm that exploits the entire training dataset to generate pseudo-labels for single images, which could be used to train COD models. RISE begins by constructing prototype libraries for environments and camouflaged objects using training images (without ground truth), followed by K-Nearest Neighbor (KNN) retrieval to generate pseudo-masks for each image based on these libraries. It is important to recognize that using only training images without annotations exerts a pronounced challenge in crafting high-quality prototype libraries. In this light, we introduce a Clustering-then-Retrieval (CR) strategy, where coarse masks are first generated through clustering, facilitating subsequent histogram-based image filtering and cross-category retrieval to produce high-confidence prototypes. In the KNN retrieval stage, to alleviate the effect of artifacts in feature maps, we propose Multi-View KNN Retrieval (MVKR), which integrates retrieval results from diverse views to produce more robust and precise pseudo-masks. Extensive experiments demonstrate that RISE significantly outperforms state-of-the-art …
Poster
Weitian Wang · Shubham rai · Cecilia De la Parra · Akash Kumar
[ Exhibit Hall I ]
Abstract
In this paper, we propose MixA-Q, a mixed-precision activation quantization framework that leverages intra-layer activation sparsity (a concept widely explored in activation pruning methods) for efficient inference of quantized window-based vision transformers. For a given uniform-bit quantization configuration, MixA-Q separates the batched window computations within Swin blocks and assigns a lower bit width to the activations of less important windows,improving the trade-off between model performance and efficiency. We introduce a Two-Branch Swin Block that processes activations separately in high- and low-bit precision, enabling seamless integration of our method with most quantization-aware training (QAT) and post-training quantization (PTQ) methods, or with simple modifications. Our experimental evaluations over the COCO dataset demonstrate that MixA-Q achieves a training-free 1.35× computational speedup without accuracy loss in PTQ configuration. With QAT, MixA-Q achieves a lossless 1.25× speedup and a 1.53× speedup with only a 1\% mAP drop by incorporating activation pruning. Notably, by reducing the quantization error in important regions, our sparsity-aware quantization adaptation improves the mAP of the quantized W4A4 model (with both weights and activations in 4-bit precision) by 0.7\%, reducing quantization degradation by 24\%.
Poster
Wentao Xiang · Haoxian Tan · Cong Wei · Yujie Zhong · Dengjie Li · Yujiu Yang
[ Exhibit Hall I ]
Abstract
Perception is a fundamental task in the field of computer vision, encompassing a diverse set of subtasks that can be systematically categorized into four distinct groups based on two critical dimensions: prediction type and instruction type. Notably, existing researches often focus solely on a limited subset of these potential combinations, which constrains their applicability and versatility across various contexts. In response to this challenge, we present MVP, a novel and unified Visual Large Language Model (VLLM) framework designed to integrate both word-based and sentence-based perception tasks alongside box and mask predictions, all within a single framework. MVP employs an innovative multi-granularity decoder coupled with a unified prompt template, which together enable the seamless joint training of a wide array of tasks, including but not limited to panoptic segmentation, detection, grounding, and referring expression segmentation. Furthermore, we introduce a query enhancement strategy aimed at harnessing the decoding and generative capabilities inherent in large language models. Extensive experiments conducted across a range of benchmarks in both word-based and sentence-based perception tasks substantiate the efficacy of our framework.
Poster
Sofiène Boutaj · Marin Scalbert · Pierre Marza · Florent Couzinie-Devy · Maria Vakalopoulou · Stergios Christodoulidis
[ Exhibit Hall I ]
Abstract
Whole slide image (WSI) analysis in digital pathology presents unique challenges due to the gigapixel resolution of WSIs and the scarcity of dense supervision signals. While Multiple Instance Learning (MIL) is a natural fit for slide-level tasks, training robust models requires large and diverse datasets. Even though image augmentation techniques could be utilized to increase data variability and reduce overfitting, implementing them effectively is not a trivial task. Traditional patch-level augmentation is prohibitively expensive due to the large number of patches extracted from each WSI, and existing feature-level augmentation methods lack control over transformation semantics. We introduce HistAug, a fast and efficient generative model for controllable augmentations in the latent space for digital pathology. By conditioning on explicit patch-level transformations (e.g., hue, erosion), HistAug generates realistic augmented embeddings while preserving initial semantic information. Our method allows the processing of a large number of patches in a single forward pass efficiently, while at the same time consistently improving MIL model performance. Experiments across multiple slide-level tasks and diverse organs show that HistAug outperforms existing methods, particularly in low-data regimes. Ablation studies confirm the benefits of learned transformations over noise-based perturbations and highlight the importance of uniform WSI-wise augmentation.
Poster
Manahil Raza · Ayesha Azam · Talha Qaiser · Nasir Rajpoot
[ Exhibit Hall I ]
Abstract
Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole-slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms …
Poster
Chenghao Xiao · Isaac Chung · Imene Kerboua · Jamie Stirling · Xin Zhang · Márton Kardos · Roman Solomatin · Noura Al Moubayed · Kenneth Enevoldsen · Niklas Muennighoff
[ Exhibit Hall I ]
Abstract
Image representation learning and image-text alignment have advanced rapidly, becoming key components in multi-modal research. However, these advancements are often evaluated through distinct, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear how capabilities measured by linear probing translate to retrieval and vice-versa. We introduce the Massive Image Embedding Benchmark (MIEB), a comprehensive benchmark designed to evaluate the capabilities of image embeddings across the broadest spectrum of tasks to date. MIEB spans 8 task categories, covering 130 tasks and a total of 39 languages. By benchmarking the performance of 50 models, MIEB uncovers hidden capabilities of advanced vision models beyond semantic alignment, such as their accurate visual representation of text; but also reveals their yet limited capabilities in robust compositionality and interleaved encoding. The benchmark aims to provide insights for guiding the design of universal image embeddings that encode multi-modal information. Additionally, we show that vision encoders' performance on MIEB tasks highly correlates with MLLMs' performance on downstream tasks, such as Visual STS tasks' over $99\%$ correlation with MLLMs' performance on OCRBench and TextVQA. Our findings underscore the importance of assessing vision embeddings beyond classification and retrieval tasks, highlighting their role in building multi-modal …
Poster
Takumi Kobayashi
[ Exhibit Hall I ]
Abstract
While deep models are effectively trained based on a softmax cross-entropy loss, a cosine-based softmax loss also works for producing favorable feature embedding.In the cosine-based softmax, temperature plays a crucial role in properly scaling the logits of cosine similarities, though being manually tuned in ad-hoc ways as there is less prior knowledge about the temperature.In this paper, we address the challenging problem to adaptively estimate the temperature of cosine-based softmax in the framework of supervised image classification.By analyzing the cosine-based softmax representation from a geometrical viewpoint regarding features and classifiers, we construct a criterion in a least-square fashion which enables us to optimize the temperature at each sample via simple greedy search.Besides, our thorough analysis about temperature clarifies that feature embedding by the cosine-based softmax loss is endowed with diverse characteristics which are controllable by the temperature in an explainable way.The experimental results demonstrate that our optimized temperature contributes to determine a feasible range of temperature to control the feature characteristics and produces favorable performance on various image classification tasks.
Poster
Matt De Vries · Reed Naidoo · Olga Fourkioti · Lucas Dent · Nathan Curry · Chris Dunsby · Chris Bakal
[ Exhibit Hall I ]
Abstract
Understanding 3D cell shape is crucial in biomedical research, where morphology serves as a key indicator of disease, cellular state, and drug response. However, existing 3D point cloud classification models often lack interpretability, making it difficult to extract biologically meaningful insights. To address this, we propose PointMIL, an inherently interpretable point cloud classifier using Multiple Instance Learning (MIL). Unlike other methods that rely on global interpretations, PointMIL simultaneously improves accuracy of point cloud-based classifier backbones and provides fine-grained, point-specific explanations, pinpointing the most informative regions of 3D shapes, without requiring $\textit{post-hoc}$ analysis. We demonstrate PointMIL on two publicly available datasets of biological cells showing state-of-the-art mACC (97.3\%) and F1 (97.5\%) on the IntrA biomedical dataset. Additionally, we introduce a novel dataset of drug-treated cancer cells (Morph3DCell), to show PointMIL's ability to reveal the morphological effects of drug treatments at a fine-grained level, with implications for drug discovery and mechanism-of-action prediction. Beyond biomedical applications, we show that PointMIL also offers quality interpretations and improves the classification accuracy on standard shape benchmarks such as ModelNet40 and ScanObjectNN, demonstrating its generalisation to broader 3D object recognition tasks.
Poster
Wenliang Zhong · Rob Barton · Weizhi An · Feng Jiang · Hehuan Ma · Yuzhi Guo · Abhishek Dan · Shioulin Sam · Karim Bouyarmane · Junzhou Huang
[ Exhibit Hall I ]
Abstract
Composed Image Retrieval (CIR) targets the retrieval of images conditioned on a reference image and a textual modification, but constructing labeled triplets (reference image, textual modification, target image) is inherently challenging. Existing Zero-Shot CIR (ZS-CIR) approaches often rely on well-aligned vision-language models (VLMs) to combine visual and textual inputs, or use large language models (LLMs) for richer modification understanding. While LLM-based methods excel in capturing textual details, they are computationally costly, slow to infer, and often restricted by proprietary constraints. In this paper, we argue that the superior performance of LLM-based ZS-CIR methods primarily stems from their capacity to follow instructions, an aspect largely missing in more efficient projection-based models built upon VLMs. To bridge this gap, we introduce DistillCIR, a dual-stream distillation framework that transfers LLMs’ instruction-following capability into compact, projection-based architectures. By synthesizing triplet data with an LLM and incorporating a novel reasoning process, DistillCIR learns both composed retrieval and instruction awareness. In addition, we train an open-source multimodal LLM on the generated data, and further distill its instruction-aware embeddings into the projection-based model. Without any reliance on LLMs at inference, DistillCIR significantly surpasses state-of-the-art ZS-CIR methods in both performance and efficiency, offering a promising direction for instruction-aware, …
Poster
Zishu Qin · Junhao Xu · Weifeng Ge
[ Exhibit Hall I ]
Abstract
Deep learning algorithms are highly data-intensive, particularly for tasks requiring pixel-level annotations, such as semantic segmentation, which makes achieving pixel-level image understanding costly. Few-shot segmentation seeks to address this challenge by enabling models to segment novel objects using only a limited number of labeled support images as references. In this paper, we argue that the traditional image-to-mask decoding framework places excessive reliance on the quality of the support sample, which is prone to errors when encountering class bias. Thus, we propose a novel image-to-mask denoising learning paradigm for few-shot segmentation, transforming mask decoding into a denoising process to reduce the support reliance problem with the help of denoising diffusion models. We formulate our image-to-mask denoising learning process in two stages: an image corruption stage and a mask denoising stage. In the first stage, we introduce an adaptive image corruption method that perturbs the image based on regional semantics, motivated by the insight of perturbing data to populate low data density regions. In the second stage, we employ an in-model denoising paradigm, designing a network to facilitate support-to-query semantic propagation and mask denoising in a single forward pass. To enhance categorical discrimination for the denoising network, we incorporate discriminative attribute learning, …
Poster
Jiawen Zhu · YEW-SOON ONG · Chunhua Shen · Guansong Pang
[ Exhibit Hall I ]
Abstract
Current zero-shot anomaly detection (ZSAD) methods show remarkable success in prompting large pre-trained vision-language models to detect anomalies in a target dataset without using any dataset-specific training or demonstration. However, these methods often focus on crafting/learning prompts that capture only coarse-grained semantics of abnormality, e.g., high-level semantics like "damaged", "imperfect", or "defective" objects. They therefore have limited capability in recognizing diverse abnormality details that deviate from these general abnormal patterns in various ways. To address this limitation, we propose FAPrompt, a novel framework designed to learn Fine-grained Abnormality Prompts for accurate ZSAD. To this end, a novel Compound Abnormality Prompt learning (CAP) module is introduced in FAPrompt to learn a set of complementary, decomposed abnormality prompts, where abnormality prompts are enforced to model diverse abnormal patterns derived from the same normality semantic. On the other hand, the fine-grained abnormality patterns can be different from one dataset to another. To enhance the cross-dataset generalization, another novel module, namely Data-dependent Abnormality Prior learning (DAP), is introduced in FAPrompt to learn a sample-wise abnormality prior from abnormal features of each test image to dynamically adapt the abnormality prompts to individual test images. Comprehensive experiments on 19 real-world datasets, covering both industrial defects and …
Poster
Yi Chen · Yuying Ge · Weiliang Tang · Yizhuo Li · Yixiao Ge · Mingyu Ding · Ying Shan · Xihui Liu
[ Exhibit Hall I ]
Abstract
Recent developments in Large Language Models (LLMs) pre-trained on extensive corpora have shown significant success in various natural language processing (NLP) tasks with minimal fine-tuning. This success offers new promise for robotics, which has long been constrained by the high cost of action-labeled data. We ask: given the abundant video data containing interaction-related knowledge available as a rich "corpus", **can a similar generative pre-training approach be effectively applied to enhance robot learning?** The key challenge is to identify an effective representation for autoregressive pre-training that benefits robot manipulation tasks.Inspired by the way humans learn new skills through observing dynamic environments, we propose that effective robotic learning should emphasize motion-related knowledge, which is closely tied to low-level actions and is hardware-agnostic, facilitating the transfer of learned motions to actual robot actions. To this end, we introduce **Moto**, which converts video content into latent **Mo**tion **To**ken sequences by a Latent Motion Tokenizer, learning a bridging "language" of motion from videos in an unsupervised manner. We pre-train Moto-GPT through motion token autoregression, enabling it to capture diverse visual motion knowledge. After pre-training, Moto-GPT demonstrates the promising ability to produce semantically interpretable motion tokens, predict plausible motion trajectories, and assess trajectory rationality through output …
Poster
Hanyi Wang · Han Fang · Shi-Lin Wang · Ee-Chien Chang
[ Exhibit Hall I ]
Abstract
Generative image watermarking enables the proactive detection and traceability of generated images. Among existing methods, inversion-based frameworks achieve highly conceal ed watermark embedding by injecting watermarks into the latent representation before the diffusion process. The robustness of this approach hinges on both the embedding mechanism and inversion accuracy. However, prior works have predominantly focused on optimizing the embedding process while overlooking inversion errors, which significantly affect extraction fidelity. In this paper, we address the challenge of inversion errors and propose ROAR, a dual-domain optimization-based framework designed to mitigate errors arising from two key sources: 1) Latent-domain errors, which accumulate across inversion steps due to inherent approximation assumptions. 2) Pixel-domain errors, which result from channel distortions such as JPEG compression. To tackle these issues, we introduce two novel components: A \textbf{Regeneration-based Optimization (RO)} mechanism, which incorporates an optimizable starting latent to minimize latent-domain errors; A Mixture of Experts (MoE)-based \textbf{distortion-adaptive restoration (AR)} network, which effectively recovers watermarked distributions from pixel-level distortions.Extensive experiments demonstrate that ROAR significantly reduces inversion errors and enhances watermark extraction robustness, thereby improving the reliability of generative image watermarking.
Poster
Lujun Li · Cheng Lin · Dezhi Li · You-Liang Huang · Wei Li · Tianyu Wu · Jie Zou · Wei Xue · Sirui Han · Yike Guo
[ Exhibit Hall I ]
Abstract
Low-Rank Adaptation (LoRA) has become a popular paradigm for fine-tuning large models, but it still necessitates a substantial number of training parameters. To address this issue, we first conduct comprehensive empirical studies on parameter-efficient LoRA structure. Then, we establish design guidelines that emphasize the use of serial structures, optimal placements, and nested LoRA. Based on these insights, we present NoRA, a nested parameter-efficient LoRA structure that revolutionizes the initialization and fine-tuning of projection matrices. Our NoRA's innovative approach involves freezing outer layer LoRA weights and employing a serial inner layer design, enabling precise task-specific adaptations while maintaining compact training parameters. In addition, we propose an activation-aware Singular Value Decomposition (AwSVD) that adjusts the weight matrices based on activation distributions for initialization of outer layer LoRA weights. This schema enhances decomposition accuracy and mitigates computational errors. Extensive evaluations across multiple large models demonstrate that NoRA outperforms state-of-the-art LoRA variants, achieving significant improvements in performance-efficiency trade-off on visual few-shot tasks, visual instruction tuning and subject-driven generation. Codes are available in the supplementary materials.
Poster
Dohwan Ko · Ji Soo Lee · Minhyuk Choi · Zihang Meng · Hyunwoo Kim
[ Exhibit Hall I ]
Abstract
Text-Video Retrieval has been extensively studied to accurately retrieve the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. With the advancement of multi-modal large language models (MLLMs), recent studies have proposed MLLM-based retrieval systems to enhance retrieval performance, particularly for long and complex query-candidate pairs. However, we observe that the naive application of MLLMs, $\textit{i.e.}$, retrieval based on candidate likelihood, introduces $\textit{candidate prior bias}$, wherein candidates with inherently higher prior probabilities are favored over those that are more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM ($\textbf{BLiM}$), which leverages query likelihood as well as candidate likelihood by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization ($\textbf{CPN}$), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by an average margin of 6.4 in R@1, effectively alleviating candidate prior bias and emphasizing the relevance between the query and candidate. Our in-depth analysis across various multi-modal …
Poster
Haochen Wang · Qirui Chen · Cilin Yan · Jiayin Cai · Xiaolong Jiang · Yao Hu · Weidi Xie · Stratis Gavves
[ Exhibit Hall I ]
Abstract
Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multi-round interactions. In this paper, we make three contributions:(i) we address these limitations by introducing a VideoLLM, termed as **RGA3**, capable of performing both object referring and grounding for video reasoning tasks in a multi-round conversational manner, i.e., allowing users to iteratively interact with videos using both textual and visual queries; (ii) we propose **STOM** (Spatial-Temporal Overlay Module), a novel approach that allows arbitrary visual prompts to be processed at any timestamp within a video;(iii) we present **VideoInfer**, a manually curated object-centric video instruction dataset featuring question-answering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring video object segmentation. The results on 12 benchmarks spanning 6 tasks show that RGA3 consistently outperforms baseline models in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. The code, dataset, and web demo will be publicly released.
Poster
Zijie Xin · Minquan Wang · Jingyu Liu · Quan Chen · Ye Ma · Peng Jiang · Xirong Li
[ Exhibit Hall I ]
Abstract
Adding proper background music helps complete a short video to be shared. Previous research tackles the task by video-to-music retrieval (V2MR), which aims to find the most suitable music track from a collection to match the content of a given query video. In practice, however, music tracks are typically much longer than the query video, necessitating (manual) trimming of the retrieved music to a shorter segment that matches the video duration. In order to bridge the gap between the practical need for music moment localization and V2MR, we propose a new task termed Music Grounding by Short Video (MGSV). To tackle the new task, we introduce a new benchmark, MGSV-EC, which comprises a diverse set of 53K short videos associated with 35k different music moments from 4k unique music tracks. Furthermore, we develop a new baseline method, MaDe, which performs both video-to-music matching and music moment detection within a unifed end-to-end deep network. Extensive experiments on MGSV-EC not only highlight the challenging nature of MGSV but also sets MaDe as a strong baseline. Data and code will be released.
Poster
Hanyu Zhou · Gim Hee Lee
[ Exhibit Hall I ]
Abstract
Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations onto frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of our method. Our code will be made publicly available.
Poster
Mattia Soldan · Fabian Caba Heilbron · Bernard Ghanem · Josef Sivic · Bryan Russell
[ Exhibit Hall I ]
Abstract
Several video understanding tasks, such as natural language temporal video grounding, temporal activity localization, and audio description generation, require "temporally dense" reasoning over frames sampled at high temporal resolution. However, computing frame-level features for these tasks is computationally expensive given the temporal resolution requirements. In this paper, we make three contributions to reduce the cost of computing features for temporally dense tasks. First, we introduce a vision transformer (ViT) architecture, dubbed ResidualViT, that leverages the large temporal redundancy in videos to efficiently compute temporally dense frame-level features. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module that enhances processing speed by selectively discarding temporally redundant information while reusing weights of a pretrained foundation model.Second, we propose a lightweight distillation strategy to approximate the frame-level features of the original foundation model. Finally, we evaluate our approach across four tasks and five datasets, in both zero-shot and fully supervised settings, demonstrating significant reductions in computational cost (up to 60\%) and improvements in inference speed (up to 2.5$\times$ faster), all while closely approximating the accuracy of the original foundation model.
Poster
Shuchang Ye · Usman Naseem · Mingyuan Meng · jinman kim
[ Exhibit Hall I ]
Abstract
Medical language-guided segmentation, integrating textual clinical reports to enhance image segmentation, has demonstrated significant improvements over unimodal approaches. However, its inherent reliance on paired image-text input, which we refer to as textual reliance, presents two fundamental limitations: 1) many medical segmentation datasets lack paired reports, leaving a substantial portion of image-only data underutilized for training; and 2) inference is limited to retrospective analysis of cases with paired reports, limiting its applicability in most clinical scenarios where segmentation typically precedes reporting. To address these limitations, we propose ProLearn, the first Prototype-driven Learning framework for language-guided segmentation that fundamentally alleviates textual reliance. At its core, in ProLearn, we introduce a novel Prototype-driven Semantic Approximation (PSA) module to enable approximation of semantic guidance from textual input. PSA initializes a discrete and compact prototype space by distilling segmentation-relevant semantics from textual reports. Once initialized, it supports a query-and-respond mechanism which approximates semantic guidance for images without textual input, thereby alleviating textual reliance. Extensive experiments on QaTa-COV19 and MosMedData+ demonstrate that ProLearn outperforms state-of-the-art language-guided methods when limited text is available.
Poster
Yanguang Sun · Jiawei Lian · jian Yang · lei luo
[ Exhibit Hall I ]
Abstract
Large-scale pre-trained models provide powerful feature representations for downstream object segmentation tasks. However, when adapted to specific tasks through the full-parameter fine-tuning, the enormous parameters being updated often results in significant computational overhead, creating a bottleneck in training efficiency. Although existing methods attempt to fine-tune frozen models by directly embedding trainable prompts, these prompts lack inherent semantic priors, limiting the adaptability of large-scale models. In this paper, we propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE, which adaptively modulates frozen foundation models by dynamically controlling local priors to enhance fine-grained perception for specific segmentation tasks. More specifically, we construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions while employing a gating network to dynamically output expert priors required for the subsequent fine-tuning. Furthermore, we design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement to interact and restructure between frozen and trainable features, achieving efficient fine-tuning. Extensive experiments validate the superiority of our Controllable-LPMoE approach, demonstrating excellent segmentation performance compared to 31 state-of-the-art methods and adaptability to multiple binary object segmentation tasks.
Poster
Xiaoran Zhang · Byung-Woo Hong · Hyoungseob Park · Daniel Pak · Anne-Marie Rickmann · Lawrence Staib · James Duncan · Alex Wong
[ Exhibit Hall I ]
Abstract
We propose a model-agnostic, progressive test-time energy adaptation approach for medical image segmentation. Maintaining model performance across diverse medical datasets is challenging, as distribution shifts arise from inconsistent imaging protocols and patient variations. Unlike domain adaptation methods that require multiple passes through target data—impractical in clinical settings—our approach adapts pretrained models progressively as they process test data. Our method leverages a shape energy model trained on source data, which assigns an energy score at the patch level to segmentation maps: low energy represents in-distribution (accurate) shapes, while high energy signals out-of-distribution (erroneous) predictions. By minimizing this energy score at test time, we refine the segmentation model to align with the target distribution. To validate the effectiveness and adaptability, we evaluated our framework on eight public MRI (bSSFP, T1- and T2-weighted) and X-ray datasets spanning cardiac, spinal cord, and lung segmentation. We consistently outperform baselines both quantitatively and qualitatively.
Poster
Mingfeng Zha · Tianyu Li · Guoqing Wang · Peng Wang · Yangyang Wu · Yang Yang · Heng Tao Shen
[ Exhibit Hall I ]
Abstract
Audio-visual segmentation (AVS) aims to segment objects in videos based on audio cues. Existing AVS methods are primarily designed to enhance interaction efficiency but pay limited attention to modality representation discrepancies and imbalances. To overcome this, we propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding. Due to the lack of semantics, heterogeneous representations may lead to erroneous matches, especially in complex scenes with ambiguous visual content or interference from multiple audio sources. We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space, reducing modality gaps and providing prior guidance. Visual content carries more information and typically dominates, thereby marginalizing audio features in the decision-making. To mitigate knowledge preference, we propose the semantic counterfactual (SC) to learn orthogonal representations in the latent space, generating diverse counterfactual samples, thus avoiding biases introduced by complex functional designs and explicit modifications of text structures or attributes. We further formulate the collaborative distribution-aware contrastive learning (CDCL), incorporating factual-counterfactual and inter-modality contrasts to align representations, promoting cohesion and decoupling. Extensive experiments on three public datasets validate that the proposed method achieves state-of-the-art performance.
Poster
Shi-Chen Zhang · Yunheng Li · Yu-Huan Wu · Qibin Hou · Ming-Ming Cheng
[ Exhibit Hall I ]
Abstract
Semantic segmentation is fundamental to vision systems requiring pixel-level scene understanding, yet deploying it on resource-constrained devices demands efficient architectures. Although existing methods achieve real-time inference through lightweight designs, we reveal their inherent limitation:misalignment between class representations and image features caused by a per-pixel classification paradigm. With experimental analysis, we find that this paradigm results in a highly challenging assumption for efficient scenarios: Image pixel features should not vary for the same category in different images. To address this dilemma, we propose a coupled dual-branch offset learning paradigm that explicitly learns feature and class offsets to dynamically refine both class representations and spatial image features. Based on the proposed paradigm, we construct an efficient semantic segmentation network OffSeg. Notably, the offset learning paradigm can be adopted to existing methods with no additional architectural changes. Extensive experiments on four datasets, including ADE20K, Cityscapes, COCO-Stuff-164K, and Pascal Context, demonstrate consistent improvements with negligible parameters. For instance, on the ADE20K dataset, our proposed offset learning paradigm improves SegFormer-B0, SegNeXt-T, and Mask2Former-Tiny by 1.9%, 2.4%, and 2.6% mIoU, respectively, with only 0.1-0.2M additional parameters required.
Poster
Zhangjun Zhou · Yiping Li · Chunlin Zhong · Jianuo Huang · Jialun Pei · Hua Li · He Tang
[ Exhibit Hall I ]
Abstract
While the human visual system employs distinct mechanisms to perceive salient and camouflaged objects, existing models struggle to disentangle these tasks. Specifically, salient object detection (SOD) models frequently misclassify camouflaged objects as salient, while camouflaged object detection (COD) models conversely misinterpret salient objects as camouflaged. We hypothesize that this can be attributed to two factors: (i) the specific annotation paradigm of current SOD and COD datasets, and (ii) the lack of explicit attribute relationship modeling in current models. Prevalent SOD/COD datasets enforce a mutual exclusivity constraint, assuming scenes contain either salient or camouflaged objects, which poorly aligns with the real world. Furthermore, current SOD/COD methods are primarily designed for these highly constrained datasets and lack explicit modeling of the relationship between salient and camouflaged objects. In this paper, to promote the development of unconstrained salient and camouflaged object detection, we construct a large-scale dataset, USC12K, which features comprehensive labels and four different scenes that cover all possible logical existence scenarios of both salient and camouflaged objects. To explicitly model the relationship between salient and camouflaged objects, we propose a model called USCNet, which introduces two distinct prompt query mechanisms for modeling inter-sample and intra-sample attribute relationships. Additionally, to assess the …
Poster
Wenlun Zhang · Yunshan Zhong · Shimpei Ando · Kentaro Yoshioka
[ Exhibit Hall I ]
Abstract
The Segment Anything Model (SAM) has demonstrated strong versatility across various visual tasks. However, its large storage requirements and high computational cost pose challenges for practical deployment. Post-training quantization (PTQ) has emerged as an effective strategy for efficient deployment, but we identify two key challenges in SAM that hinder the effectiveness of existing PTQ methods: the heavy-tailed and skewed distribution of post-GELU activations, and significant inter-channel variation in linear projection activations. To address these challenges, we propose AHCPTQ, an accurate and hardware-efficient PTQ method for SAM. AHCPTQ introduces hardware-compatible Hybrid Log-Uniform Quantization (HLUQ) to manage post-GELU activations, employing log2 quantization for dense small values and uniform quantization for sparse large values to enhance quantization resolution. Additionally, AHCPTQ incorporates Channel-Aware Grouping (CAG) to mitigate inter-channel variation by progressively clustering activation channels with similar distributions, enabling them to share quantization parameters and improving hardware efficiency. The combination of HLUQ and CAG not only enhances quantization effectiveness but also ensures compatibility with efficient hardware execution. For instance, under the W4A4 configuration on the SAM-L model, AHCPTQ achieves 36.6\% mAP on instance segmentation with the DINO detector, while achieving a $7.89\times$ speedup and $8.64\times$ energy efficiency over its floating-point counterpart in FPGA implementation.
Poster
Dong Zhao · Qi Zang · Shuang Wang · Nicu Sebe · Zhun Zhong
[ Exhibit Hall I ]
Abstract
Pseudo-labeling is a key technique of semi-supervised and cross-domian semantic segmentation, yet its efficacy is often hampered by the intrinsic noise of pseudo-labels. This study introduces Pseudo-SD, a novel framework that redefines the utilization of pseudo-label knowledge through Stable Diffusion (SD). Our Pseudo-SD innovatively combines pseudo-labels and its text prompts to fine-tune SD models, facilitating the generation of high-quality, diverse synthetic images that closely mimic target data characteristics. Within this framework, two novel mechanisms, \textit{i.e.}, partial attention manipulation, and structured pseudo-labeling, are proposed to effectively spread text-to-image corresponding during SD fine-tuning process and to ensure controllable high-quality image synthesis respectively. Extensive results demonstrate that Pseudo-SD significantly improves the performance on semi-supervised and cross-domain segmentation scenarios. Moreover, our method is versatile and model-agnostic, which can complement existing methods. By injecting our Pseudo-SD into current methods, we establish new state of the arts in different datasets, offering a new way for the exploration of effective pseudo-label utilization.
Poster
Quanfeng Lu · Wenqi Shao · Zitao Liu · Lingxiao Du · Fanqing Meng · Boxuan Li · Botong Chen · Siyuan Huang · Kaipeng Zhang · Ping Luo
[ Exhibit Hall I ]
Abstract
Autonomous Graphical User Interface (GUI) navigation agents can enhance user experience in communication, entertainment, and productivity by streamlining workflows and reducing manual intervention. However, prior GUI agents often trained with datasets comprising tasks that can be completed within a single app, leading to poor performance in cross-app navigation. To address this problem, we present GUIOdyssey, a comprehensive dataset for cross-app mobile GUI navigation. GUIOdyssey comprises 8,334 episodes with an average of 15.3 steps per episode, covering 6 mobile devices, 212 distinct apps, and 1,357 app combinations. Each step is enriched with detailed semantic reasoning annotations, which aid the model in building cognitive processes and enhancing its reasoning abilities for complex cross-app tasks. Building on GUIOdyssey, we develop OdysseyAgent, an exploratory multimodal agent for long-step cross-app navigation equipped with a history resampler module that efficiently attends to historical screenshot tokens, balancing performance and inference speed. Extensive experiments conducted in both in-domain and out-of-domain scenarios validate the effectiveness of our approach. Moreover, we demonstrate that historial information involving actions, screenshots and context in our dataset can significantly enhances OdysseyAgent's performance on complex cross-app tasks.
Poster
Mahesh Bhosale · Abdul Wasi · Yuanhao Zhai · Yunjie Tian · Samuel Border · Nan Xi · Pinaki Sarder · Junsong Yuan · David Doermann · Xuan Gong
[ Exhibit Hall I ]
Abstract
Diffusion-based generative models have shown promise in synthesizing histopathology images to address data scarcity caused by privacy constraints. Diagnostic text reports provide high-level semantic descriptions, and masks offer fine-grained spatial structures essential for representing distinct morphological regions. However, public datasets lack paired text and mask data for the same histopathological images, limiting their joint use in image generation. This constraint restricts the ability to fully exploit the benefits of combining both modalities for enhanced control over semantics and spatial details. To overcome this, we propose PathDiff, a diffusion framework that effectively learns from unpaired mask-text data by integrating both modalities into a unified conditioning space. PathDiff allows precise control over structural and contextual features, generating high-quality, semantically accurate images. PathDiff also improves image fidelity, text-image alignment, and faithfulness, enhancing data augmentation for downstream tasks like nuclei segmentation and classification. Extensive experiments demonstrate its superiority over existing methods. Our code and models will be open-sourced.
Poster
Yiyuan Zhang · Handong Li · Jing Liu · Xiangyu Yue
[ Exhibit Hall I ]
Abstract
High-quality image-text data is critical in enhancing Vision-Language Models (VLMs), but traditional image-based pretraining approaches face limitations. These methods are resource-intensive, relying on curated, high-quality interleaved data that is costly and challenging to collect at scale. Additionally, while such datasets improve static image-text understanding, they fail to develop the temporal and motion comprehension needed for video understanding. To address these gaps, we propose incorporating video pretraining into VLMs to improve the model’s ability to capture temporal dynamics and general visual perception, which requires reconciling spatial redundancy with strict temporal causality. Therefore, we propose Causal Hierarchical Aggregation to separate computation-heavy spatial encoding from lightweight temporal propagation and construct hierarchical receptive fields at varying granularities. As we scale video context to more than 100B tokens, our method excels in high throughput and state-of-the-art performances on both Image and Video understanding, as shown in Figure 1, providing a scalable solution to enhance multimodal learning in dynamic contexts.
Poster
Raphaela Kang · Yue Song · Georgia Gkioxari · Pietro Perona
[ Exhibit Hall I ]
Abstract
Contrastive Language-Image Pre-Training (CLIP) is a popular method for learning multimodal latent spaces with well-organized semantics.Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions. Recent works attempt to address its shortcomings with data-centric or algorithmic approaches. But what if the problem is more fundamental, and lies in the geometry of CLIP?Toward this end, we rigorously analyze CLIP's latent space properties, and prove that no CLIP-like joint embedding space exists which can correctly do any two of the following at the same time: 1. represent basic descriptions and image content, 2. represent attribute binding, 3. represent spatial location and relationships, 4. represent negation. Informed by this analysis, we propose Dense Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models, which solves the fundamental limitations of CLIP by retaining the semantic topology of the image patches and text tokens. This method improves upon the performance of classical CLIP-like joint encoder models on a wide array of benchmarks. Code will be released upon acceptance.
Poster
ZHIXIANG WEI · Guangting Wang · Xiaoxiao Ma · Ke Mei · Fengyun Rao · Huaian Chen · Yi Jin
[ Exhibit Hall I ]
Abstract
Large-scale but noisy image-text pair data have paved the way for the success of Contrastive Language-Image Pretraining (CLIP). As the foundation vision encoder, CLIP in turn serves as the cornerstone for most large vision-language models (LVLMs). This interdependence naturally raises an interesting question: Can we reciprocally leverage LVLMs to enhance the quality of image-text pair data, thereby opening the possibility of a self-reinforcing cycle for continuous improvement? In this work, we take a significant step toward this vision by introducing an LVLM-driven data refinement pipeline. Our framework leverages LVLMs to process images and their raw alt-text, generating four complementary textual formulas: long positive descriptions, long negative descriptions, short positive tags, and short negative tags. Applying this pipeline to the curated DFN-Large dataset yields VLM-150M, a refined dataset enriched with multi-grained annotations. Based on this dataset, we further propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags as additional supervised signals. The resulting model, namely HQ-CLIP, demonstrates remarkable improvements across diverse benchmarks. Within a comparable training data scale, our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks. In retrieval benchmarks, HQ-CLIP even surpasses standard CLIP models trained on …
Poster
Xuechao Zou · Yue Li · Shun Zhang · Kai Li · Shiying Wang · Pin Tao · Junliang Xing · congyan lang
[ Exhibit Hall I ]
Abstract
Remote sensing image segmentation faces persistent challenges in distinguishing morphologically similar categories and adapting to diverse scene variations. While existing methods rely on implicit representation learning paradigms, they often fail to dynamically adjust semantic embeddings according to contextual cues, leading to suboptimal performance in fine-grained scenarios such as cloud thickness differentiation. This work introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. The core contribution lies in a novel dictionary construction mechanism, where class-aware semantic embeddings are progressively updated via multi-stage alternating cross-attention querying between image features and dictionary embeddings. This process enables adaptive representation learning tailored to input-specific characteristics, effectively resolving ambiguities in intra-class heterogeneity and inter-class homogeneity. To further enhance discriminability, a contrastive constraint is applied to the dictionary space, ensuring compact intra-class distributions while maximizing inter-class separability. Extensive experiments across both coarse- and fine-grained datasets demonstrate consistent improvements over state-of-the-art methods, particularly in two online test benchmarks (LoveDA and UAVid). Code is available at https://anonymous.4open.science/r/D2LS-8267/.
Poster
Zesen Cheng · Kehan Li · Yian Zhao · Hang Zhang · Chang Liu · Jie Chen
[ Exhibit Hall I ]
Abstract
With the rise of applications such as embodied intelligence, developing high real-time online video instance segmentation (VIS) has become increasingly important. However, through time profiling of the components in advanced online VIS architecture (i.e., transformer-based architecture), we find that the transformer decoder significantly hampers the inference speed. Further analysis of the similarities between the outputs from adjacent frames at each transformer decoder layer reveals significant redundant computations within the transformer decoder. To address this issue, we introduce Temporal-Aware query Routing (TAR) mechanism. We embed it before each transformer decoder layer. By fusing the optimal queries from the previous frame, the queries output by the preceding decoder layer, and their differential information, TAR predicts a binary classification score and then uses an argmax operation to determine whether the current layer should be skipped. Experimental results demonstrate that integrating TAR into the baselines achieves significant efficiency gains (24.7 → 34.6 FPS for MinVIS, 22.4 → 32.8 FPS for DVIS++) while also improving performance (e.g., on YoutubeVIS 2019, 47.4 → 48.4 AP for MinVIS, 55.5 → 55.7 AP for DVIS++). Furthermore, our analysis of the TAR mechanism shows that the number of skipped layers increases as the differences between adjacent video frames decrease, …
Poster
Cheonjun Park · Hyunjae Oh · Mincheol Park · Hyunchan Moon · Minsik Kim · Suhyun Kim · Myung Kuk Yoon · Won Woo Ro
[ Exhibit Hall I ]
Abstract
Recent GPUs leverage Winograd convolution and structured pruning to significantly accelerate inference.First, Winograd convolution is theoretically 2.25× faster than standard convolution.Second, structured pruning reduces inference time without additional overhead as the pruning ratio increases.However, applying conventional structured pruning alongside Winograd convolution is inefficient. Existing structured pruning methods, which do not account for how GPUs process Winograd convolution, require large pruning unit sizes, leading to significant information loss.In this paper, we propose Winograd Structured Pruning (WINS), \textbf{the first approach} to employ optimized structured pruning for Winograd convolution. WINS is designed based on an in-depth analysis of Winograd convolution's computational characteristics on GPUs.Additionally, we introduce two variants, WINS-B and WINS-AB, which further enhance performance. Experimental results show that WINS-AB achieves up to 2.8× practical speedup in Winograd convolution inference on GPUs while preserving the accuracy of ResNet-18 on ImageNet.
Poster
Hai Huang · Yan Xia · Sashuai Zhou · Hanting Wang · Shulei Wang · Zhou Zhao
[ Exhibit Hall I ]
Abstract
Domain Generalization (DG) aims to enhance model robustness in unseen or distributionally shifted target domains through training exclusively on source domains. Although existing DG techniques—such as data manipulation, learning strategies, and representation learning—have demonstrated significant progress, they predominantly address single-modal data. With the emergence of numerous multi-modal datasets and increasing demand for multi-modal tasks, a key challenge in Multi-modal Domain Generalization (MMDG) has emerged: enabling models trained on multi-modal sources to generalize to unseen target distributions within the same modality set.Due to the inherent differences between modalities, directly transferring methods from single-modal DG to MMDG typically yields disappointing results. These methods often exhibit randomness during generalization due to the invisibility of target domains and fail to consider inter-modal consistency. Applying these methods independently to each modality in the MMDG setting before combining them can lead to divergent generalization directions across different modalities, resulting in degraded generalization capabilities. To address these challenges, we propose a novel approach that leverages Unified Representations to map different paired modalities together, effectively adapting DG methods to MMDG by enabling synchronized multi-modal improvements within the unified space. Additionally, we introduce a supervised disentanglement framework that separates modal-general and modal-specific information, further enhancing the alignment of unified …
Poster
KUO WANG · Quanlong Zheng · Junlin Xie · Yanhao Zhang · Jinguo Luo · Haonan Lu · Liang Lin · Fan Zhou · Guanbin Li
[ Exhibit Hall I ]
Abstract
Video Multimodal Large Language Models~(Video-MLLM) have achieved remarkable advancements in video understanding tasks. However, constrained by the context length limitation in the underlying LLMs, existing Video-MLLMs typically exhibit suboptimal performance on long video scenarios. To understand extended input frames, common solutions span token compression and streaming inference techniques, which sacrifice feature granularity or inference efficiency. Differently, to efficiently achieve comprehensive understanding of longer frame inputs, we draw ideas from MoE and propose a training-free approach Free-MoRef, which instantly multiplexes the context perception capabilities of Video-MLLMs within one inference pass. Specifically, Free-MoRef reconstructs the vision tokens into several short sequences as multi-references. Subsequently, we introduce MoRef-attention, which gathers clues from the multi-reference chunks in parallel to summarize unified query activations. After the shadow layers in LLMs, a reference fusion step is derived to compose a final mixed reasoning sequence with key tokens from parallel chunks, which compensates the cross-reference vision interactions that are neglected in MoRef-attention. By splitting and fusing the long vision token sequences, Free-MoRef achieves improved performance under much lower computing costs in reasoning multiplexed context length, demonstrating strong efficiency and effectiveness. Experiments on VideoMME, MLVU, LongVideoBench show that Free-MoRef achieves full perception of 2$\times$ to 8$\times$ longer input …
Poster
Yuan Yao · Qiushi Yang · Miaomiao Cui · Liefeng Bo
[ Exhibit Hall I ]
Abstract
The recent Segment Anything Models (SAMs) have emerged as foundational visual models for general interactive segmentation. Despite demonstrating robust generalization abilities, they still suffer from performance degradations in scenarios that demand accurate masks. Existing methods for high-precision interactive segmentation face a trade-off between perceiving intricate local details and maintaining stable prompting capability, which hinders the applicability and effectiveness of foundational segmentation models. In this paper, we present a SAM2Refiner framework built upon the SAM2 backbone. This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos while preserving its inherent strengths. Specifically, we design a localization augment module, which incorporates local contextual cues to enhance global features via a cross-attention mechanism, thereby exploiting potential detailed patterns while maintaining semantic information. Moreover, to strengthen the prompting ability toward the enhanced object embeddings, we introduce a prompt retargeting module that renews the embedding with spatially aligned prompt features. In addition, to obtain accurate high resolution segmentation masks, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder. Extensive experiments demonstrate the effectiveness of our approach, revealing that the proposed method can produce highly precise masks for both …
Poster
Wei Liao · Chunyan Xu · Chenxu Wang · Zhen Cui
[ Exhibit Hall I ]
Abstract
Sparse annotation in remote sensing object detection poses significant challenges due to dense object distributions and category imbalances. Although existing Dense Pseudo-Label methods have demonstrated substantial potential in pseudo-labeling tasks, they remain constrained by selection ambiguities and inconsistencies in confidence estimation. In this paper, we introduce an LLM-assisted semantic guidance framework tailored for sparsely annotated remote sensing object detection, exploiting the advanced semantic reasoning capabilities of large language models (LLMs) to distill high-confidence pseudo-labels. By integrating LLM-generated semantic priors, we propose a Class-Aware Dense Pseudo-Label Assignment mechanism that adaptively assigns pseudo-labels for both unlabeled and sparsely labeled data, ensuring robust supervision across varying data distributions. Additionally, we develop an Adaptive Hard-Negative Reweighting Module to stabilize the supervised learning branch by mitigating the influence of confounding background information. Extensive experiments on DOTA and HRSC2016 demonstrate that the proposed method outperforms existing single-stage detector-based frameworks, significantly improving detection performance under sparse annotations.
Poster
Qin Zhou · Guoyan Liang · Xindi Li · Jingyuan CHEN · Zhe Wang · Chang Yao · Sai Wu
[ Exhibit Hall I ]
Abstract
Automated radiology report generation is essential for improving diagnostic efficiency and reducing the workload of medical professionals. However, existing methods face significant challenges, such as disease class imbalance and insufficient cross-modal fusion. To address these issues, we propose the learnable Retrieval Enhanced Visual-Text Alignment and Fusion (REVTAF) framework, which effectively tackles both class imbalance and visual-text fusion in report generation. REVTAF incorporates two core components: (1) a Learnable Retrieval Enhancer (LRE) that utilizes semantic hierarchies from hyperbolic space and intra-batch context through a ranking-based metric. LRE adaptively retrieves the most relevant reference reports, enhancing image representations, particularly for underrepresented (tail) class inputs; and (2) a fine-grained visual-text alignment and fusion strategy that ensures consistency across multi-source cross-attention maps for precise alignment. This component further employs an optimal transport-based cross-attention mechanism to dynamically integrate task-relevant textual knowledge for improved report generation. By combining adaptive retrieval with multi-source alignment and fusion, REVTAF achieves fine-grained visual-text integration under weak image-report level supervision while effectively mitigating data imbalance issues. Comprehensive experiments demonstrate that REVTAF outperforms state-of-the-art methods, achieving an average improvement of 7.4% on the MIMIC-CXR dataset and 2.9% on the IU X-Ray dataset. Comparisons with mainstream multimodal LLMs (e.g., GPT-series models), further highlight …
Poster
Zhizhong Huang · Xiaoming Liu
[ Exhibit Hall I ]
Abstract
Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textit{identity-sensitive} features critical for ReID.This paper proposes Visual In-Context Prompting (VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textit{in-context examples} as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models (VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textit{dynamic visual prompts}.By aligning LLM-derived semantic concepts with the VFM's pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories.Code will be released upon publication.
Poster
Pooyan Rahmanzadehgervi · Hung Nguyen · Rosanne Liu · Long Mai · Anh Nguyen
[ Exhibit Hall I ]
Abstract
Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision.Multiple heads intuitively enable different parallel processes over the same input. Yet, they also obscure the attribution of each input patch to the output of a model.We propose a novel 1-head Transformer Attention Bottleneck (TAB) layer, inserted after the traditional MHSA architecture, to serve as an attention bottleneck for interpretability and intervention.Unlike standard self-attention, TAB constrains the total attention over all patches to $\in [0, 1]$.That is, when the total attention is 0, no visual information is propagated further into the network, and the vision-language model (VLM) would default to a generic, image-independent response.To demonstrate the advantages of TAB, we train VLMs with TAB to perform image-difference captioning.Over three datasets, our models perform similarly to baseline VLMs in captioning but the bottleneck is superior in localizing changes and in identifying when no changes occur.TAB is the first architecture to enable users to debug by editing attention, which often produces expected outputs by VLMs.
Poster
Ruiyun Yu · Bingyang Guo · Haoyuan Li
[ Exhibit Hall I ]
Abstract
Anomaly detection plays a crucial role in the industrial sector, especially in ensuring the quality of integrated circuits (IC), which are critical for product reliability and performance. With increasing demands for higher quality standards, anomaly detection during the IC manufacturing process has become a significant research focus. However, the progress of IC anomaly detection is hampered by the scarcity of defective samples and the shortage of well-defined annotations. To address this challenge, this paper focuses on the research in the field of IC, especially on ceramic package substrates (CPS). We construct a systematic automated optical inspection (AOI) equipment, and based on this, collected large-scale CPS 2D images to build a novel anomaly detection dataset (CPS2D-AD), which offers copious samples with precise annotations, including category, mask, and bounding box. To the best of our knowledge, CPS2D-AD is the largest dataset in the field of IC. Meanwhile, we conduct an extensive benchmark of CPS2D-AD, intending to supplement existing research by providing a baseline for the detection and localization of anomalies in high-resolution data of ceramic package substrates. In addition, we have developed a novel large vision model, \textbf{S}egment \textbf{A}ny \textbf{I}ntegrated \textbf{C}ircuits (SAIC), by embedding-based distillation mechanism based on CPS2D-AD datasets. Our CPS2D-AD …
Poster
Kaining Ying · Henghui Ding · Guangquan Jie · Yu-Gang Jiang
[ Exhibit Hall I ]
Abstract
Referring audio-visual segmentation (RAVS) has recently seen significant advancements, yet challenges remain in integrating multimodal information as well as deeply understanding and reasoning about audiovisual content. To extend the boundaries of RAVS and facilitate future research in this field, we propose **Omni**modal Referring **A**udio-**V**isual **S**egmentation (**OmniAVS**), a new dataset containing 2,098 videos and 59,458 multimodal referring expressions. OmniAVS stands out with three key innovations: (1) 8 types of multimodal expressions that flexibly combine text, speech, sound, and visual cues; (2) an emphasis on understanding audio content beyond just detecting their presence; and (3) the inclusion of complex reasoning and world knowledge in expressions. Furthermore, we introduce **O**mnimodal **I**nstructed **S**egmentation **A**ssistant (**OISA**), to address the challenges of multimodal reasoning and fine-grained understanding of audiovisual content in OmniAVS. OISA uses MLLM to comprehend complex cues and perform reasoning-based segmentation. Extensive experiments on 10 datasets show that OISA outperforms existing methods on OmniAVS and achieves competitive results on other related tasks.
Poster
Dibyadip Chatterjee · Edoardo Remelli · Yale Song · Bugra Tekin · Abhay Mittal · Bharat Bhatnagar · Necati Cihan Camgoz · Shreyas Hampali · Eric Sauser · Shugao Ma · Angela Yao · Fadime Sener
[ Exhibit Hall I ]
Abstract
We introduce ProVideLLM, an end-to-end framework for real-time streaming procedural assistance. ProVideLLM integrates a multimodal cache configured to store two types of tokens -- verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by $22\times$ over existing methods in representing one hour of long-term observations while effectively encoding fine-grained representations. By interleaving these tokens in the multimodal cache, ProVideLLM ensures sub-linear scaling of memory and compute with video length, enabling per-frame streaming inference at 10 FPS and 25 FPS for streaming dialogue, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.
Poster
Yuan Bian · Min Liu · Yunqi Yi · Xueping Wang · Shuai Jiang · Yaonan Wang
[ Exhibit Hall I ]
Abstract
Person re-identification (re-id) models are vital in security surveillance systems, requiring transferable adversarial attacks to explore the vulnerabilities of them. Recently, vision-language models (VLM) based attacks have shown superior transferability by attacking generalized image and textual features of VLM, but they lack comprehensive feature disruption due to the overemphasis on discriminative semantics in integral representation. In this paper, we introduce the Attribute-aware Prompt Attack (AP-Attack), a novel method that leverages VLM's image-text alignment capability to explicitly disrupt fine-grained semantic features of pedestrian images by destroying attribute-specific textual embeddings. To obtain personalized textual descriptions for individual attributes, textual inversion networks are designed to map pedestrian images to pseudo tokens that represent semantic embeddings, trained in the contrastive learning manner with images and a predefined prompt template that explicitly describes the pedestrian attributes. Inverted benign and adversarial fine-grained textual semantics facilitate attacker in effectively conducting thorough disruptions, enhancing the transferability of adversarial examples. Extensive experiments show that AP-Attack achieves state-of-the-art transferability, significantly outperforming previous methods by 22.9% on mean Drop Rate in cross-model&dataset attack scenarios.
Poster
Mingyang Liu · Xinyang Chen · Yang Shu · Xiucheng Li · Weili Guan · Liqiang Nie
[ Exhibit Hall I ]
Abstract
Chest X-ray classification is extensively utilized within the field of medical image analysis. However, manually labeling chest X-ray images is time-consuming and costly. Domain adaptation, which is designed to transfer knowledge from related domains, could offer a promising solution. Existing methods employ feature adaptation or self-training for knowledge transfer. Nonetheless, negative transfer is observed due to the entanglement of class imbalance and distribution shift in chest X-ray classification. In this paper, wepropose Debiased Curriculum Adaptation framework to mitigate negative transfer in two aspects: (1) Curriculum Adaptation, which is designed to transfer knowledge in an easy-to-hard way, is proposed to alleviate confirmation bias in self-training. (2) Spectral Debiasing is introduced to harmonize the feature space between the source and target domains, as well as balance the feature space of positive and negative samples. Extensive experiments on 72 transfer tasks (including 6 diseases and 4 domains) demonstrate our superiority over state-of-the-art methods. In comparison to advanced methods, our approach effectively mitigates negative transfer, ensuring safe knowledge transfer.
Poster
Linwei Chen · Lin Gu · Ying Fu
[ Exhibit Hall I ]
Abstract
Vision Transformers (ViTs) have significantly advanced computer vision, demonstrating strong performance across various tasks. However, the attention mechanism in ViTs makes each layer function as a low-pass filter, and the stacked-layer architecture in existing transformers suffers from frequency vanishing. This leads to the loss of critical details and textures.We propose a novel, circuit-theory-inspired strategy called Frequency-Dynamic Attention Modulation (FDAM), which can be easily plugged into ViTs. FDAM directly modulates the overall frequency response of ViTs and consists of two techniques: Attention Inversion (AttInv) and Frequency Dynamic Scaling (FreqScale).Since circuit theory uses low-pass filters as fundamental elements, we introduce AttInv, a method that generates complementary high-pass filtering by inverting the low-pass filter in the attention matrix, and dynamically combining the two. We further design FreqScale to weight different frequency components for fine-grained adjustments to the target response function.Through feature similarity analysis and effective rank evaluation, we demonstrate that our approach avoids representation collapse, leading to consistent performance improvements across various models, including SegFormer, DeiT, and MaskDINO. These improvements are evident in tasks such as semantic segmentation, object detection, and instance segmentation. Additionally, we apply our method to remote sensing detection, achieving state-of-the-art results in single-scale settings.The code will be publicly available …
Poster
Jingyang Li · Kuangyu Ding · Kim-chuan Toh · Pan Zhou
[ Exhibit Hall I ]
Abstract
Preconditioned stochastic optimization algorithms, exemplified by Shampoo, outperform first-order optimizers by offering theoretical convergence benefits and practical gains in large-scale neural network training. However, they incur substantial memory overhead due to the storage demands of non-diagonal preconditioning matrices. To address this, we introduce 4-bit quantization for Shampoo's preconditioners. We introduce two key methods: First, we apply Cholesky decomposition followed by quantization of the Cholesky factors, reducing memory usage by leveraging their lower triangular structure while better preserving spectral properties to minimize information loss. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. Second, we incorporate error feedback in the quantization process, efficiently storing Cholesky factor and error state in the lower and upper triangular parts of the same matrix. Through extensive experiments, we demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance in large-scale deep-learning tasks. Theoretically, we also provide convergence proofs for quantized Shampoo under both smooth and non-smooth stochastic optimization settings. The source code is included in the supplementary and will be publicly released.
Poster
Zechao Hu · Zhengwei Yang · Hao Li · Zheng Wang · Yixiong Zou
[ Exhibit Hall I ]
Abstract
Sketch-based person re-identification (re-ID) enables pedestrian retrieval using sketches. While recent methods have improved modality alignment between sketches and RGB images, the challenge of subjective style variation, where sketches exhibit diverse and unpredictable appearances, remains largely unresolved.A natural solution is to train on a diverse range of pedestrian sketches, but the high cost of large-scale pedestrian sketch collection makes this impractical.In contrast, sketches of general categories (e.g., animals, objects) exhibit diverse style variations and are accessible at a low cost, making them an intuitive and scalable alternative for enhancing style generalization in sketch re-ID.To this end, we propose Adaptive Incremental Prompt-tuning (AIP), the first approach that explores cross-category subjective style generalization for sketch re-ID. Specifically, AIP incorporates a multi-stage prompt-tuning strategy that learns a broad but shareable spectrum of sketch styles from non-pedestrian data. In addition, an input-sensitive prompt generator enables the model to adapt dynamically to unseen sketch styles.Extensive experimental results demonstrate that the performance gain is not merely attributed to the inclusion of additional data but rather to the effectiveness of AIP in leveraging non-pedestrian data for subjective style generalization. Our method outperforms existing works by a significant margin, establishing new state-of-the-art results.
Poster
Tianyu Fu · Tengxuan Liu · Qinghao Han · Guohao Dai · Shengen Yan · Huazhong Yang · Xuefei Ning · Yu Wang
[ Exhibit Hall I ]
Abstract
The increasing demand to process long and high-resolution videos significantly burdens Large Vision-Language Models (LVLMs) due to the enormous number of visual tokens.Existing token reduction methods primarily prune tokens based on importance metrics, such as accumulative attention scores. However, even important tokens may exhibit high redundancy caused by similarity among adjacent video frames and repetitive visual elements.To address this limitation, we propose FrameFusion, a novel token reduction approach integrating similarity-based merging with importance-based pruning.We conduct a thorough study on token similarity characteristics, revealing three key insights: (1) spatially corresponding vision tokens between adjacent frames have higher cosine similarities compared to other token pairs; (2) high token similarities prominently decrease in deeper model layers; and (3) token similarity rankings are highly consistent across different layers.Guided by these observations, FrameFusion computes token similarities exclusively between corresponding vision tokens from adjacent frames, applies token merging at initial successive layers followed by pruning in deeper layers, and adopts a cascaded merging strategy to further enhance efficiency.We evaluate FrameFusion comprehensively across six diverse LVLMs, ranging from 2B to 72B parameters, using five video benchmarks encompassing video retrieval, question-answering, and spatial-temporal understanding tasks.Experiments show that FrameFusion reduces vision tokens by 70\%, achieving 1.6 – 3.6$\times$ end-to-end …
Poster
Yong Liu · Song-Li Wu · Sule Bai · Jiahao Wang · Yitong Wang · Yansong Tang
[ Exhibit Hall I ]
Abstract
Open-vocabulary segmentation aims to achieve segmentation of arbitrary categories given unlimited text inputs as guidance. To achieve this, recent works have focused on developing various technical routes to exploit the potential of large-scale pre-trained vision-language models and have made significant progress on existing benchmarks. However, we find that existing test sets are limited in measuring the models' comprehension of ``open-vocabulary" concepts, as their semantic space closely resembles the training space, even with many overlapping categories. To this end, we present a new benchmark named OpenBench that differs significantly from the training semantics. It is designed to better assess the model's ability to understand and segment a wide range of real-world concepts. When testing existing methods on OpenBench, we find that their performance diverges from the conclusions drawn on existing test sets. In addition, we propose a method named OVSNet to improve the segmentation performance for diverse and open scenarios. Through elaborate fusion of heterogeneous features and cost-free expansion of the training space, OVSNet achieves state-of-the-art results on both existing datasets and our proposed OpenBench. Corresponding analysis demonstrate the soundness and effectiveness of our proposed benchmark and method.
Poster
Zelong Sun · Dong Jing · Zhiwu Lu
[ Exhibit Hall I ]
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images by integrating information from a composed query (reference image and modification text) without training samples. Existing methods primarily combine caption models and large language models (LLMs) to generate target captions based on composed queries but face various issues such as incompatibility, visual information loss, and insufficient reasoning. In this work, we propose CoTMR, a training-free framework crafted for ZS-CIR with novel Chain-of-thought (CoT) and Multi-scale Reasoning. Instead of relying on caption models for modality transformation, CoTMR employs the Large Vision-Language Model (LVLM) to achieve unified understanding and reasoning for composed queries. To enhance the reasoning reliability, we devise CIRCoT, which guides the LVLM through a step-by-step inference process using predefined subtasks. Considering that existing approaches focus solely on global-level reasoning, our CoTMR incorporates multi-scale reasoning to achieve more comprehensive inference via fine-grained predictions about the presence or absence of key elements at the object scale. Further, we design a Multi-Grained Scoring (MGS) mechanism, which integrates CLIP similarity scores of the above reasoning outputs with candidate images to realize precise retrieval. Extensive experiments demonstrate that our CoTMR not only drastically outperforms previous methods across four prominent benchmarks but also offers appealing …
Poster
Zhuoyan Luo · Yinghao Wu · Tianheng Cheng · Yong Liu · Yicheng Xiao · Hongfa Wang · Xiao-Ping Zhang · Yujiu Yang
[ Exhibit Hall I ]
Abstract
The newly proposed Generalized Referring Expression Segmentation (GRES) amplifies the formulation of classic RES by involving complex multiple/non-target scenarios. Recent approaches address GRES by directly extending the well-adopted RES frameworks with object-existence identification. However, these approaches tend to encode multi-granularity object information into a single representation, which makes it difficult to precisely represent comprehensive objects of different granularity. Moreover, the simple binary object-existence identification across all referent scenarios fails to specify their inherent differences, incurring ambiguity in object understanding. To tackle the above issues, we propose a **Co**unting-Aware **H**ierarchical **D**ecoding framework (CoHD) for GRES. By decoupling the intricate referring semantics into different granularity with a visual-linguistic hierarchy, and dynamic aggregating it with intra- and inter-selection, CoHD boosts multi-granularity comprehension with the reciprocal benefit of the hierarchical nature. Furthermore, we incorporate the counting ability by embodying multiple/single/non-target scenarios into count- and category-level supervision, facilitating comprehensive object perception. Experimental results on gRefCOCO, Ref-ZOM, R-RefCOCO, and RefCOCO benchmarks demonstrate the effectiveness and rationality of CoHD which outperforms state-of-the-art GRES methods by a remarkable margin. Code will be available.
Poster
YINGXIAN Chen · Jiahui Liu · Ruidi Fan · Yanwei Li · Chirui CHANG · Shizhen Zhao · Wilton.W.T. Fok · Xiaojuan Qi · Yik WU
[ Exhibit Hall I ]
Abstract
Understanding abnormal events in videos is a vital and challenging task that has garnered significant attention in a wide range of applications. Although current video understanding Multi-modal Large Language Models (MLLMs) are capable of analyzing general videos, they often struggle to handle anomalies due to the spatial and temporal sparsity of abnormal events, where the redundant information always leads to suboptimal outcomes. To address these challenges, exploiting the representation and generalization capabilities of Vison Language Models (VLMs) and Large Language Models (LLMs), we propose VA-GPT, a novel MLLM designed for summarizing and localizing abnormal events in various videos. Our approach efficiently aligns effective tokens between visual encoders and LLMs through two key proposed modules: Spatial Effective Token Selection (SETS) and Temporal Effective Token Generation (TETG). These modules enable our model to effectively capture and analyze both spatial and temporal information associated with abnormal events, resulting in more accurate responses and interactions. Furthermore, we construct an instruction-following dataset specifically for fine-tuning video-anomaly-aware MLLMs, and introduce a cross-domain evaluation benchmark based on XD-Violence dataset. Our proposed method outperforms existing state-of-the-art methods on various benchmarks. The code and data will be released.
Poster
Bin Yang · Yulin Zhang · Hong-Yu Zhou · Sibei Yang
[ Exhibit Hall I ]
Abstract
Detection transformers have been applied to human-object interaction (HOI) detection, enhancing the localization and recognition of human-action-object triplets in images. Despite remarkable progress, this study identifies a critical issue—"Toxic Siblings" bias—which hinders the interaction decoder's learning, as numerous similar yet distinct HOI triplets interfere with and even compete against each other both input side and output side to the interaction decoder. This bias arises from high confusion among sibling triplets/categories, where increased similarity paradoxically reduces precision, as one’s gain comes at the expense of its toxic sibling’s decline. To address this, we propose two novel debiasing learning objectives—"contrastive-then-calibration" and "merge-then-split"—targeting the input and output perspectives, respectively. The former samples sibling-like incorrect HOI triplets and reconstructs them into correct ones, guided by strong positional priors. The latter first learns shared features among sibling categories to distinguish them from other groups, then explicitly refines intra-group differentiation to preserve uniqueness. Experiments show that we significantly outperform both the baseline (+9.18\% mAP on HICO-Det) and the state-of-the-art (+3.59\% mAP) across various settings. The source code will be made public.
Poster
Yuci Liang · Xinheng Lyu · Meidan Ding · Wenting Chen · Xiaohan Xing · Jipeng Zhang · Sen Yang · Xiangjian He · Song Wu · Xiyue Wang · Linlin Shen
[ Exhibit Hall I ]
Abstract
Recent advances in computational pathology have introduced whole slide image (WSI)-level multimodal large language models (MLLMs) for automated pathological analysis. However, current WSI-level MLLMs face two critical challenges: limited explainability in their decision-making process and insufficient attention to morphological features crucial for accurate diagnosis. To address these challenges, we first introduce \textbf{WSI-Bench}, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, specifically designed to evaluate MLLMs' understanding of morphological characteristics crucial for accurate diagnosis. To the best of our knowledge, WSI-Bench presents the first benchmarking systematically evaluate morphological understanding capabilities in WSI analysis. To enhance the model explainability, we present \textbf{WSI-LLaVA}, an MLLM framework for gigapixel WSI understanding with a three-stage training strategy, which can provide detailed morphological findings to explain its final answer. For more precise model assessment in pathological contexts, we develop two specialized WSI metrics: \textbf{WSI-Precision} and \textbf{WSI-Relevance}. Extensive evaluation on WSI-Bench reveals both the capabilities and limitations of current WSI MLLMs in morphological analysis and various pathology tasks, while demonstrating WSI-LLaVA's superior performance across all capabilities.
Poster
Rongpei Hong · Jian Lang · Ting Zhong · Fan Zhou
[ Exhibit Hall I ]
Abstract
The rapid proliferation of online video-sharing platforms has accelerated the spread of malicious videos, creating an urgent need for robust detection methods. However, the performance and generalizability of existing detection approaches are severely limited by the scarcity of annotated video data, as manually curating large-scale malicious detection datasets is both labor-intensive and impractical. To address this challenge, we propose CRAVE, a novel CRoss-domAin retrieVal augmEntation framework that transfers knowledge from resource-rich image-text domain to enhance malicious video detection. Specifically, CRAVE introduces a Pseudo-Pair Retriever to identify semantically relevant image-text data for high-quality cross-domain augmentation. Additionally, a Contrastive Cross-Domain Augmenter is designed to disentangle domain-shared and -unique representations, effectively bridging the domain gaps during knowledge transfer. These shared image-text representations are then leveraged to refine video representations, yielding more discriminative features for accurate malicious content detection. Experiments on four video datasets demonstrate that CRAVE largely outperforms competitive baselines in both performance and generalization, providing an innovative and strong solution to the issue of video data-scarcity.
Poster
Zhongwei Qiu · Hanqing Chao · Tiancheng Lin · Wanxing Chang · Zijiang Yang · Wenpei Jiao · Yixuan Shen · Yunshuo Zhang · Yelin Yang · Wenbin Liu · Hui Jiang · Yun Bian · Ke Yan · Dakai Jin · Le Lu
[ Exhibit Hall I ]
Abstract
Histopathology plays a critical role in medical diagnostics, with whole slide images (WSIs) offering valuable insights that directly influence clinical decision-making. However, the large size and complexity of WSIs may pose significant challenges for deep learning models, in both computational efficiency and effective representation learning. In this work, we introduce Pixel-Mamba, a novel deep learning architecture designed to efficiently handle gigapixel WSIs. Pixel-Mamba leverages the Mamba module, a state-space model (SSM) with linear memory complexity, and incorporates local inductive biases through progressively expanding tokens, akin to convolutional neural networks. This enables Pixel-Mamba to hierarchically combine both local and global information while efficiently addressing computational challenges. Remarkably, Pixel-Mamba achieves or even surpasses the quantitative performance of state-of-the-art (SOTA) foundation models that were pretrained on millions of WSIs or WSI-text pairs, in a range of tumor staging and survival analysis tasks, even without requiring any pathology-specific pretraining. Extensive experiments demonstrate the efficacy of Pixel-Mamba as a powerful and efficient framework for end-to-end WSI analysis.
Poster
Maximilian Augustin · Yannic Neuhaus · Matthias Hein
[ Exhibit Hall I ]
Abstract
Vision-language models (VLMs) are prone to object hal-lucinations, where they erroneously indicate the presenceof certain objects in an image. Existing benchmarks quan-tify hallucinations using relatively small, labeled datasets.However, this approach is i) insufficient to assess halluci-nations that arise in open-world settings, where VLMs arewidely used, and ii) inadequate for detecting systematic er-rors in VLMs. We propose DASH (Detection and Assess-ment of Systematic Hallucinations), an automatic, large-scale pipeline designed to identify systematic hallucinationsof VLMs on real-world images in an open-world setting.A key component is DASH-OPT for image-based retrieval,where we optimize over the “natural image manifold” togenerate images that mislead the VLM. The output of DASHconsists of clusters of real and semantically similar imagesfor which the VLM hallucinates an object. We apply DASHto PaliGemma and two LLaVA-NeXT models across 380 ob-ject classes and, in total, find more than 15k clusters with650kimages. We study the transfer of the identified system-atic hallucinations to other VLMs and show that fine-tuningPaliGemma with the model-specific images obtained withDASH mitigates object hallucinations.
Poster
Jiajin Tang · Zhengxuan Wei · Yuchen Zhu · Cheng Shi · Guanbin Li · Liang Lin · Sibei Yang
[ Exhibit Hall I ]
Abstract
Temporal sentence grounding aims to identify exact moments in a video that correspond to a given textual query, typically addressed with detection transformer (DETR) solutions. However, we find that typical strategies designed to enhance DETR do not improve, and may even degrade, its performance in this task. We systematically analyze and identify the root causes of this abnormal behavior: (1) conflicts between queries from similar target moments and (2) internal query conflicts due to the tension between global semantics and local localization. Building on these insights, we propose a simple yet powerful baseline, Sim-DETR, which extends the standard DETR with two minor modifications in the decoder layers: (1) constraining self-attention between queries based on their semantic and positional overlap and (2) adding query-to-frame alignment to bridge the global and local contexts. Experiments demonstrate that Sim-DETR unlocks the full potential of DETR for temporal sentence grounding, offering a strong baseline for future research. Code will be made publicly available.
Poster
Onkar Susladkar · Gayatri Deshmukh · Yalcin Tur · Gorkem Durak · Ulas Bagci
[ Exhibit Hall I ]
Abstract
We introduce ViCTr (Vital Consistency Transfer), a framework for advancing medical image synthesis through a principled integration with Rectified Flow trajectories. Unlike traditional approaches, we modify the Tweedie formulation to accommodate linear trajectories within the Rectified Flow framework, enabling more accurate initial state approximation and consistent trajectory paths. ViCTr’s design allows for precise control over anatomical accuracy and pathological attributes across CT and MRI modalities via a two-stage architecture. In Stage 1, it performs anatomical learning on the ATLAS-8k dataset using Elastic Weight Consolidation (EWC) to selectively train model weights tailored for medical data. In Stage 2, an adversarial fine-tuning strategy is applied: the base model from Stage 1 remains frozen while a LoRA adapter is exclusively applied to the weights tuned in Stage 1, allowing targeted adaptation for downstream tasks while preserving the core medical data properties learned during pretraining. ViCTr achieves notable improvements by utilizing segmentation maps and textual prompts to enable refined control over CT and MRI synthesis. Extensive experiments on benchmark datasets, including BTCV, AMOS, and CirrMRI600+, demonstrate ViCTr’s superiority, showing significant enhancements in quantitative metrics and clinical detail, such as liver surface nodularity in cirrhosis synthesis. These results establish ViCTr as a major advancement in …
Poster
Rohan Sharma · Changyou Chen · Feng-Ju Chang · Seongjun Yun · Xiaohu Xie · Rui Meng · Dehong Xu · Alejandro Mottini · qingjun cui
[ Exhibit Hall I ]
Abstract
We present Multi-Modal Multi-Task Unified Embedding Model (M3T-UEM), a framework that advances vision-language matching and retrieval by leveraging a large language model (LLM) backbone. While concurrent LLM-based approaches like VLM2VEC, MM-Embed, NV-Embed, and MM-GEM have demonstrated impressive capabilities in multi-modal and multi-task scenarios, our work introduces novel mechanisms for task-adaptive learning and embedding extraction that further enhance the potential of LLM-based retrieval systems. Our key technical contribution lies in the development of a task-aware contrastive learning framework with an automated Bayesian weighing mechanism. This approach provides a principled way to balance multiple tasks during training, departing from conventional contrastive learning strategies. We further enhance the framework through a multiple-token summarization strategy and an auxiliary language modeling objective, which together significantly improve retrieval performance.Comprehensive experiments on M-BEIR and ICinW benchmarks demonstrate M3T-UEM's effectiveness, showing competitive or superior performance compared to both traditional encoder-based methods and recent LLM-based approaches. Furthermore, we demonstrate particular strengths in handling compositional conceptual changes and multilingual scenarios owing to the incorporation of an LLM backbone where the method drastically outperforms CLIP in zero-shot settings, often by orders of magnitude.
Poster
Songsong Duan · Xi Yang · Nannan Wang
[ Exhibit Hall I ]
Abstract
Recent Training-Free Open-Vocabulary Semantic Segmentation (TF-OVSS) leverages a pre-training vision-language model to segment images from open-set visual concepts without training and fine-tuning. The key of TF-OVSS is to improve the local spatial representation of CLIP by leveraging self-correlation maps, thus preserving its zero-sample capability and achieving open understanding. However, most TF-OVSS methods utilize the Multi-Head Self-Attention (MHSA) mechanism to generate self-correlation maps, neglecting the diversity among multiple heads. In this paper, we explore the diversity of MHSA, revealing that the contributions of single-head attention to the final results are varied and redundant. To address this issue, we introduce DIH-CLIP, a training-free CLIP model for open-vocabulary semantic segmentation. Specifically, we propose a Selective Head Attention (SHA) to replace the traditional MHSA in CLIP, which contains two key designs: (1) evaluating the diversity of multi-head attention via calculating information entropy scores of per head attention map and removing the redundant attention head with threshold; (2) transferring the local representation of single-head attention to the global CLIP feature to enhance the local spatial representation capability of CLIP. Furthermore, we embed SHA into the middle layers of CLIP to extract the plentiful details. Experiments on six benchmark datasets demonstrate the effectiveness of DIH-CLIP.
Poster
Ryan Wong · Necati Cihan Camgoz · Richard Bowden
[ Exhibit Hall I ]
Abstract
Sign language representation learning presents unique challenges due to the complex spatio-temporal nature of signs and the scarcity of labeled datasets. Existing methods often rely either on models pre-trained on general visual tasks, that lack sign-specific features, or use complex multimodal and multi-branch architectures. To bridge this gap, we introduce a scalable, self-supervised framework for sign representation learning. We leverage important inductive (sign) priors during the training of our RGB model. To do this, we leverage simple but important cues based on skeletons while pretraining a masked autoencoder. These sign specific priors alongside feature regularization and an adversarial style agnostic loss provide a powerful backbone. Notably, our model does not require skeletal keypoints during inference, avoiding the limitations of keypoint-based models during downstream tasks. When finetuned, we achieve state-of-the-art performance for sign recognition on the WLASL, ASL-Citizen and NMFs-CSL datasets, using a simpler architecture and with only a single-modality. Beyond recognition, our frozen model excels in sign dictionary retrieval and sign translation, surpassing standard MAE pretraining and skeletal-based representations in retrieval. It also reduces computational costs for training existing sign translation models while maintaining strong performance on Phoenix2014T, CSL-Daily and How2Sign.
Poster
Zhixiang Chi · Yanan Wu · Li Gu · Huan Liu · Ziqiang Wang · Yang Zhang · Yang Wang · Konstantinos Plataniotis
[ Exhibit Hall I ]
Abstract
CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP.In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.
Poster
Mark Endo · Xiaohan Wang · Serena Yeung-Levy
[ Exhibit Hall I ]
Abstract
Recent works on accelerating Vision-Language Models achieve strong performance across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model. Surprisingly, we find that while strong performance is maintained across many tasks, it exhibits drastically different behavior for a subset of vision-centric tasks such as localization. Upon further investigation, we uncover a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, on many benchmarks aiming to evaluate vision-centric capabilities, strong performance persists with the flawed pruning strategy, highlighting these benchmarks' limited ability to assess fine-grained visual capabilities. Based on these findings, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that resolves the discovered early-layer pruning issue and further enhances the preservation of relevant tokens via multistage pruning with early uniform sampling to ensure broad image coverage. With comparable computational savings, we find that FEATHER achieves more than 5x performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.
Poster
Yicong Li · Yiyang Chen · Zhenyuan Ma · Junbin Xiao · Xiang Wang · Angela Yao
[ Exhibit Hall I ]
Abstract
Language-guided Affordance Segmentation (LASO) aims to identify actionable object regions based on text instructions. At the core of its practicality is learning generalizable affordance knowledge that captures functional regions across diverse objects. However, current LASO solutions struggle to extend learned affordances to object categories that are not encountered during training. Scrutinizing these designs, we identify limited generalizability on unseen categories, stemming from (1) underutilized generalizable patterns in the intermediate layers of both 3D and text backbones, which impedes the formation of robust affordance knowledge, and (2) the inability to handle substantial variability in affordance regions across object categories due to a lack of structural knowledge of the target region.Towards this, we introduce a \textbf{G}enera\textbf{L}ized fr\textbf{A}mework on u\textbf{N}seen \textbf{C}ategori\textbf{E}s (GLANCE), incorporating two key components: a cross-modal connector that links intermediate stages of the text and 3D backbones to enrich pointwise embeddings with affordance concepts, and a VLM-guided query generator that provides affordance priors by extracting a few 3D key points based on the intra-view reliability and cross-view consistency of their multi-view segmentation masks. Extensive experiments on two benchmark datasets demonstrate that GLANCE outperforms state-of-the-art methods (SoTAs), with notable improvements in generalization to unseen categories. Our code is available at \url{https://anonymous.4open.science/r/GLANCE}.
Poster
Jiayuan Chen · Thai-Hoang Pham · Yuanlong Wang · Ping Zhang
[ Exhibit Hall I ]
Abstract
High-throughput screening techniques, such as microscopy imaging of cellular responses to genetic and chemical perturbations, play a crucial role in drug discovery and biomedical research. However, robust perturbation screening for \textit{de novo} cell lines remains challenging due to the significant morphological and biological heterogeneity across cell lines. To address this, we propose a novel framework that integrates external biological knowledge into existing pretraining strategies to enhance microscopy image profiling models. Our approach explicitly disentangles perturbation-specific and cell line-specific representations using external biological information. Specifically, we construct a knowledge graph leveraging protein interaction data from STRING and Hetionet databases to guide models toward perturbation-specific features during pretraining. Additionally, we incorporate transcriptomic features from single-cell foundation models to capture cell line-specific representations. By learning these disentangled features, our method improves the generalization of imaging models to \textit{de novo} cell lines. We evaluate our framework on the RxRx database through one-shot fine-tuning on an RxRx1 cell line and few-shot fine-tuning on cell lines from the RxRx19a dataset. Experimental results demonstrate that our method enhances microscopy image profiling for \textit{de novo} cell lines, highlighting its effectiveness in real-world phenotype-based drug discovery applications.
Poster
Yuzhang Shang · Mu Cai · Bingxin Xu · Yong Jae Lee · Yan Yan
[ Exhibit Hall I ]
Abstract
Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer features in the CLIP visual encoder, as the prefix content. Recent LMMs incorporate more complex visual inputs, such as high-resolution images and videos, which further increases the number of visual tokens significantly. However, due to the inherent design of the Transformer architecture, the computational costs of these models tend to increase quadratically with the number of input tokens. To tackle this problem, we explore a token reduction mechanism that identifies significant spatial redundancy among visual tokens. In response, we propose PruMerge, a novel adaptive visual token reduction strategy that significantly reduces the number of visual tokens without compromising the performance of LMMs. Specifically, to metric the importance of each token, we exploit the sparsity observed in the visual encoder, characterized by the sparse distribution of attention scores between the class token and visual tokens. This sparsity enables us to dynamically select the most crucial visual tokens to retain. Subsequently, we cluster the selected (unpruned) tokens based on their key similarity and merge them …
Poster
Changhao Li · Xinrui Chen · Ji Wang · Kang Zhao · Jianfei Chen
[ Exhibit Hall I ]
Abstract
Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data.Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks.Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method.
Poster
Shicai Wei · Chunbo Luo · Yang Luo
[ Exhibit Hall I ]
Abstract
Multimodal learning often encounters the under-optimized problem and may have worse performance than unimodal learning. Existing methods attribute this problem to the imbalanced learning between modalities and rebalance them through gradient modulation. However, they fail to explain why the dominant modality in multimodal models also underperforms that in unimodal learning. In this work, we reveal the optimization conflict between the modality encoder and modality fusion module in multimodal models. Specifically, we prove that the cross-modal fusion in multimodal models decreases the gradient passed back to each modality encoder compared with unimodal models. Consequently, the performance of each modality in the multimodal model is inferior to that in the unimodal model. To this end, we propose a disentangled gradient learning (DGL) framework to decouple the optimization of the modality encoder and modality fusion module in the multimodal model. DGL truncates the gradient back-propagated from the multimodal loss to the modality encoder and replaces it with the gradient from unimodal loss. Besides, DGL removes the gradient back-propagated from the unimodal loss to the modality fusion module. This helps eliminate the gradient interference between the modality encoder and modality fusion module while ensuring their respective optimization processes. Finally, extensive experiments on multiple types …
Poster
Junho Kim · Hyungjin Chung · Byung-Hoon Kim
[ Exhibit Hall I ]
Abstract
Category-agnostic pose estimation (CAPE) has traditionally relied on support images with annotated keypoints, a process that is often cumbersome and may fail to fully capture the necessary correspondences across diverse object categories. Recent efforts have explored the use of text queries, leveraging their enhanced stability and generalization capabilities. However, existing approaches often remain constrained by their reliance on support queries, their failure to fully utilize the rich priors embedded in pre-trained large language models, and the limitations imposed by their parametric distribution assumptions. To address these challenges, we introduce CapeLLM, the first multimodal large language model (MLLM) designed for CAPE. Our method only employs query image and detailed text descriptions as an input to estimate category-agnostic keypoints. Our method encompasses effective training strategies and carefully designed instructions for applying the MLLM to CAPE. Moreover, we propose an inference mechanism that further enhances the reasoning process for unseen keypoints. while flexibly modeling their underlying spatial distribution and uncertainty, allowing for adaptive refinement based on contextual cues. We conducted extensive experiments to apply the MLLM to CAPE effectively, focusing not only on the model architecture and prompt design but also on ensuring robustness across input variations. Our approach sets a new state-of-the-art …
Poster
Sanjoy Chowdhury · Hanan Gani · Nishit Anand · Sayan Nag · Ruohan Gao · Mohamed Elhoseiny · Salman Khan · Dinesh Manocha
[ Exhibit Hall I ]
Abstract
Recent advancements in reasoning optimization havegreatly enhanced the performance of large language models(LLMs). However, existing work fails to address the com-plexities of audio-visual scenarios, underscoring the needfor further research. In this paper, we introduce AURE-LIA, a novel actor-critic based audio-visual (AV) reasoningframework that distills structured, step-by-step reasoninginto AVLLMs at test time, improving their ability to processcomplex multi-modal inputs without additional training orfine-tuning. To further advance AVLLM reasoning skills, wepresent AVReasonBench, a challenging benchmark compris-ing 4500 audio-visual questions, each paired with detailedstep-by-step reasoning. Our benchmark spans six distincttasks, including AV-GeoIQ, which evaluates AV reasoningcombined with geographical and cultural knowledge. Evalu-ating 18 AVLLMs on AVReasonBench reveals significant lim-itations in their multi-modal reasoning capabilities. UsingAURELIA, we achieve up to a 100% relative improvement,demonstrating its effectiveness. This performance gain high-lights the potential of reasoning-enhanced data generationfor advancing AVLLMs in real-world applications. Our codeand data will be publicly released.
Poster
Gueter Josmy Faure · Jia-Fong Yeh · Min-Hung Chen · Hung-Ting Su · Shang-Hong Lai · Winston Hsu
[ Exhibit Hall I ]
Abstract
Long-form video understanding presents unique challenges that extend beyond traditional short-video analysis approaches, particularly in capturing long-range dependencies, processing redundant information efficiently, and extracting high-level semantic concepts. To address these challenges, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, featuring two versatile modules that can enhance existing video-language models or operate as a standalone system. Our Episodic COmpressor (ECO) efficiently aggregates representations from micro to semi-macro levels, reducing computational overhead while preserving temporal dependencies. Our Semantics reTRiever(SeTR) enriches these representations with semantic information by focusing on broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. We demonstrate that these modules can be seamlessly integrated into existing SOTA models, consistently improving their performance while reducing inference latency by up to 43\%. As a standalone system, HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings. Our code will be made public.
Poster
Yusen Zhang · Wenliang Zheng · Aashrith Madasu · Peng Shi · Ryo Kamoi · Hao Zhou · Zhuoyang Zou · Shu Zhao · Sarkar Snigdha Sarathi Das · Vipul Gupta · Xiaoxin Lu · Nan Zhang · Ranran Zhang · Avitej Iyer · Renze Lou · Wenpeng Yin · Rui Zhang
[ Exhibit Hall I ]
Abstract
High-resolution image (HRI) understanding aims to process images with a large number of pixels such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) typically handle higher-resolution images through dynamic patching. However, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding, leaving this domain underexplored. To address this gap, we introduce HRScene, a novel unified benchmark for HRI understanding with rich scenes. HRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 $\times$ 1,024 to 35,503 $\times$ 26,627. HRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic and radiology images to street views, long-range pictures, and telescope images. It includes high-resolution images of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and similar distracting images in different orders. These datasets assess how well models utilize HRI by comparing performance across different image regions. We conduct extensive experiments involving 27 VLMs, including Gemini 2.0 Pro and GPT-4o. Experiments on HRScene show that current VLMs achieve an average accuracy of …
Poster
Hyojin Bahng · Caroline Chan · Fredo Durand · Phillip Isola
[ Exhibit Hall I ]
Abstract
Current metrics for image-text alignment rely on human preferences or task-oriented VQA datasets for supervision. We propose an alternative approach that leverages cycle consistency as a supervisory signal. Given an image, we generate diverse captions using image-to-text models, then map these captions back to image space with a text-to-image model. We compute a cycle consistency score by measuring perceptual similarity between the original and reconstructed image. The score is used to determine preferences over captions, i.e., more descriptive and accurate captions yield faithful reconstructions and are thus preferred over lower quality captions. Analogously, we can measure cycle consistency in the text-to-image-to-text direction by measuring textual similarity between an input caption and its reconstruction through the cycle. We explore both mapping directions, resulting in 398K image-to-text pairs and 468K text-to-image comparison pairs. Our reward model, trained on this dataset, outperforms state-of-the-art methods on detailed captioning tasks, with superior inference-time scalability when used as a verifier for Best-of-N evaluation. We will release our dataset, model, and code upon acceptance.
Poster
Yufei Zhan · Shurong Zheng · Yousong Zhu · Hongyin Zhao · Fan Yang · Ming Tang · Jinqiao Wang
[ Exhibit Hall I ]
Abstract
Large Vision Language Models have achieved fine-grained object perception, but the limitation of image resolution remains a significant obstacle to surpassing the performance of task-specific experts in complex and dense scenarios. Such limitation further restricts the model's potential to achieve nuanced visual and language referring in domains such as GUI Agents, counting, \textit{etc}. To address this issue, we introduce a unified high-resolution generalist model, Griffon v2, enabling flexible object referring with visual and textual prompts. To efficiently scale up image resolution, we design a simple and lightweight down-sampling projector to overcome the input tokens constraint in Large Language Models. This design inherently preserves the complete contexts and fine details and significantly improves multimodal perception ability, especially for small objects. Building upon this, we further equip the model with visual-language co-referring capabilities through a plug-and-play visual tokenizer. It enables user-friendly interaction with flexible target images, free-form texts, and even coordinates. Experiments demonstrate that Griffon v2 can localize objects of interest with visual and textual referring, achieve state-of-the-art performance on REC and phrase grounding, and outperform expert models in object detection, object counting, and REG. Data, codes, and models will be released.
Poster
Weihan Wang · zehai he · Wenyi Hong · Yean Cheng · Xiaohan Zhang · Ji Qi · Ming Ding · Xiaotao Gu · Shiyu Huang · Bin Xu · Yuxiao Dong · Jie Tang
[ Exhibit Hall I ]
Abstract
Recent progress in multimodal large language models has markedly enhanced the understanding of short videos (typically under one minute), and several evaluation datasets have emerged accordingly. However, these advancements fall short of meeting the demands of real-world applications such as embodied intelligence for long-term decision-making, in-depth movie reviews and discussions, and live sports commentary, all of which require comprehension of long videos spanning several hours. To address this gap, we introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction. LVBench is designed to challenge multimodal models to demonstrate long-term memory and extended comprehension capabilities. Our extensive evaluations reveal that current multimodal models still underperform on these demanding long video understanding tasks. Through LVBench, we aim to spur the development of more advanced models capable of tackling the complexities of long video comprehension.
Poster
Yongxin Zhu · Bocheng Li · Yifei Xin · Zhihua Xia · Linli Xu
[ Exhibit Hall I ]
Abstract
Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically design complex optimization strategies or reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we analyze the representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose \textbf{Sim}ple\textbf{VQ}, a novel method that reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the \textit{entire linear space} spanned by the codebook, rather than merely updating \textit{single code vectors} selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works …
Poster
Ylli Sadikaj · Hongkuan Zhou · Lavdim Halilaj · Stefan Schmid · Steffen Staab · Claudia Plant
[ Exhibit Hall I ]
Abstract
Precise optical inspection in industrial applications is crucial for minimizing scrap rates and reducing the associated costs. Besides merely detecting if a product is anomalous or not, it is crucial to know the distinct type of defect, such as a bent, cut, or scratch. The ability to recognize the ``exact" defect type enables automated treatments of the anomalies in modern production lines. Current methods are limited to solely detecting whether a product is defective or not without providing any insights on the defect type, nevertheless detecting and identifying multiple defects. We propose MultiADS, a zero-shot learning approach, able to perform Multi-Anomaly Detection and Segmentation. The architecture of MultiADS comprises CLIP and extra linear layers to align the visual- and textual representation in a joint feature space. To the best of our knowledge, our proposal, is the first approach to perform a multi-type anomaly segmentation task in zero-shot learning. Contrary to the other baselines, our approach i) generates specific anomaly masks for each distinct defect type, ii) learns to distinguish defect types, and iii) simultaneously identifies multiple defect types present in an anomalous product. Additionally, our approach outperforms zero/few-shot learning SoTA methods on image-level and pixel-level anomaly detection and segmentation tasks …
Poster
Nandish Chattopadhyay · Amira Guesmi · Muhammad Abdullah Hanif · Bassem ouni · Muhammad Shafique
[ Exhibit Hall I ]
Abstract
Adversarial attacks present a significant challenge to the dependable deployment of machine learning models, with patch-based attacks being particularly potent. These attacks introduce adversarial perturbations in localized regions of an image, deceiving even well-trained models. In this paper, we propose Outlier Detection and Dimension Reduction (ODDR), a comprehensive defense strategy engineered to counteract patch-based adversarial attacks through advanced statistical methodologies.Our approach is based on the observation that input features corresponding to adversarial patches—whether naturalistic or synthetic—deviate from the intrinsic distribution of the remaining image data and can thus be identified as outliers. ODDR operates through a robust three-stage pipeline: Fragmentation, Segregation, and Neutralization. This model-agnostic framework is versatile, offering protection across various tasks, including image classification, object detection, and depth estimation, and is proved effective in both CNN-based and Transformer-based architectures.In the Fragmentation stage, image samples are divided into smaller segments, preparing them for the Segregation stage, where advanced outlier detection techniques isolate anomalous features linked to adversarial perturbations. The Neutralization stage then applies dimension reduction techniques to these outliers, effectively neutralizing the adversarial impact while preserving critical information for the machine learning task.Extensive evaluation on benchmark datasets against state-of-the-art adversarial patches underscores the efficacy of ODDR. For example, our …
Poster
Hao Tang · Zhiqing Guo · Liejun Wang · Chao Liu
[ Exhibit Hall I ]
Abstract
In recent years, it has been found that “grandmother cells” in the primary visual cortex (V1) of macaques can directly recognize visual input with complex shapes. This inspires us to examine the value of these cells in promoting the research of medical image segmentation. In this paper, we design a Similarity Memory Prior Network (Sim-MPNet) for medical image segmentation. Specifically, we propose a Dynamic Memory Weights–Loss Attention (DMW-LA), which matches and remembers the category features of specific lesions or organs in medical images through the similarity memory prior in the prototype memory bank, thus helping the network to learn subtle texture changes between categories. DMW-LA also dynamically updates the similarity memory prior in reverse through Weight-Loss Dynamic (W-LD) update strategy, effectively assisting the network directly extract category features. In addition, we propose the Double-Similarity Global Internal Enhancement Module (DS-GIM) to deeply explore the internal differences in the feature distribution of input data through cosine similarity and euclidean distance. Extensive experiments on four public datasets show that Sim-MPNet has better segmentation performance than other state-of-the-art methods. Our code is available on https://anonymous.4open.science/r/Sim-MPNet.
Poster
Han Ji · Yuqi Feng · Jiahao Fan · Yanan Sun
[ Exhibit Hall I ]
Abstract
Performance predictors have emerged as a promising method to accelerate the evaluation stage of neural architecture search (NAS). These predictors estimate the performance of unseen architectures by learning from the correlation between a small set of trained architectures and their performance. However, most existing predictors ignore the inherent distribution shift between limited training samples and diverse test samples. Hence, they tend to learn spurious correlations as shortcuts to predictions, leading to poor generalization. To address this, we propose a Causality-guided Architecture Representation Learning (CARL) method aiming to separate critical (causal) and redundant (non-causal) features of architectures for generalizable architecture performance prediction. Specifically, we employ a substructure extractor to split the input architecture into critical and redundant substructures in the latent space. Then, we generate multiple interventional samples by pairing critical representations with diverse redundant representations to prioritize critical features. Extensive experiments on five NAS search spaces demonstrate the state-of-the-art accuracy and superior interpretability of CARL. For instance, CARL achieves 97.67\% top-1 accuracy on CIFAR-10 using DARTS.
Poster
Peng Du · Hui Li · Han Xu · Paul Jeon · Dongwook Lee · Daehyun Ji · Ran Yang · Feng Zhu
[ Exhibit Hall I ]
Abstract
Discrete Wavelet Transform (DWT) has been widely explored to enhance the performance of image super-resolution (SR). Despite some DWT-based methods improving SR by capturing fine-grained frequency signals, most existing approaches neglect the interrelations among multi-scale frequency sub-bands, resulting in inconsistencies and unnatural artifacts in the reconstructed images. To address this challenge, we propose a Diffusion Transformer model based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the superiority of diffusion models and transformers to capture the interrelations among multi-scale frequency sub-bands, leading to a more consistence and realistic SR image. Specifically, we use a Multi-level Discrete Wavelet Transform (MDWT) to decompose images into wavelet spectra. A pyramid tokenization method is proposed which embeds the spectra into a sequence of tokens for transformer model, facilitating to capture features from both spatial and frequency domain. A dual-decoder is designed elaborately to handle the distinct variances in low-frequency (LF) and high-frequency (HF) sub-bands, without omiting their alignment in image generation. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness of our method, with high performance on both perception quality and fidelity.
Poster
Seungju Yoo · Hyuk Kwon · Joong-Won Hwang · Kibok Lee
[ Exhibit Hall I ]
Abstract
Object detection is a fundamental task in computer vision that has received significant attention in recent years. Despite advances in training object detection models, evaluating their performance in real-world applications remains challenging due to the substantial costs associated with image annotation. To address this issue, we propose Prediction Consistency and Reliability (PCR) as an automated model evaluation (AutoEval) method for object detection. Our method is motivated by the observation that most existing object detection models generate many candidate predictions, which are subsequently filtered through non-maximum suppression (NMS). Specifically, we analyze 1) the consistency between the final and redundant predictions and 2) the reliability of these predictions determined by their confidence scores, and propose PCR by examining their relationships with object detection performance. Furthermore, to facilitate a more realistic assessment of AutoEval methods for object detection, we construct meta-datasets incorporating various corruptions. Experimental results demonstrate the superior performance of PCR compared to the existing AutoEval methods.
Poster
Yuxuan Luo · Jiaqi Tang · Chenyi Huang · Feiyang Hao · Zhouhui Lian
[ Exhibit Hall I ]
Abstract
Chinese calligraphy, a UNESCO Heritage, remains computationally challenging due to visual ambiguity and cultural complexity. Existing AI systems fail to contextualize their intricate scripts, because of limited annotated data and poor visual-semantic alignment. We propose CalliReader, a vision-language model (VLM) that solves the Chinese Calligraphy Contextualization (CC$^2$) problem through three innovations: (1) character-wise slicing for precise character extraction and sorting, (2) CalliAlign for visual-text token compression and alignment, (3) embedding instruction tuning (e-IT) for improving alignment and addressing data scarcity. We also build CalliBench, the first benchmark for full-page calligraphic contextualization, addressing three critical issues in previous OCR and VQA approaches: fragmented context, shallow reasoning, and hallucination. Extensive experiments including user studies have been conducted to verify our CalliReader's \textbf{superiority to other state-of-the-art methods and even human professionals in page-level calligraphy recognition and interpretation}, achieving higher accuracy while reducing hallucination. Comparisons with reasoning models highlight the importance of accurate recognition as a prerequisite for reliable comprehension. Quantitative analyses validate CalliReader's efficiency; evaluations on document and real-world benchmarks confirm its robust generalization ability.
Poster
Weiwei Cao · Jianpeng Zhang · Zhongyi Shui · Sinuo Wang · Zeli Chen · Xi Li · Le Lu · Xianghua Ye · Qi Zhang · Tingbo Liang · Ling Zhang
[ Exhibit Hall I ]
Abstract
Vision-language pre-training (VLP) has great potential for developing multifunctional and general medical diagnostic capabilities. However, aligning medical images with a low signal-to-noise ratio (SNR) to reports with a high SNR presents a semantic density gap, leading to visual alignment bias. In this paper, we propose boosting vision semantic density to improve alignment effectiveness. On the one hand, we enhance visual semantics through disease-level vision contrastive learning, which strengthens the model's ability to differentiate between normal and abnormal samples for each anatomical structure. On the other hand, we introduce an anatomical normality modeling method to model the distribution of normal samples for each anatomy, leveraging VQ-VAE for reconstructing normal vision embeddings in latent space. This process amplifies abnormal signals by leveraging distribution shifts in abnormal samples, enhancing the model's perception and discrimination of abnormal attributes. The enhanced visual representation effectively captures the diagnostic-relevant semantics, facilitating more efficient and accurate alignment with the diagnostic report. We conduct extensive experiments on two chest CT datasets, CT-RATE and Rad-ChestCT, and an abdominal CT dataset, MedVL-69K, and comprehensively evaluate the diagnosis performance across multiple tasks in the chest and abdominal CT scenarios, achieving state-of-the-art zero-shot performance. Notably, our method achieved an average AUC of 84.9\% …
Poster
Yihang Liu · Ying Wen · Longzhen Yang · Lianghua He · Heng Tao Shen
[ Exhibit Hall I ]
Abstract
Medical foundation models, pre-trained on diverse data sources, have shown significant potential for multi-domain medical imaging tasks.However, the domain shifts across different anatomical types significantly hinder their performance compared to domain-specific models.To address this challenge, we propose CoSMIC, a Continual Self-supervised learning framework for Multi-domain medIcal image analysis, with the core idea of Conditional mutual information maximization. Specifically, CoSMIC (i) acquires domain-specific knowledge sequentially, bypassing domain shifts caused by joint pre-training; (ii) enhances generalized representations by proposing a novel conditional contrastive loss to prevent catastrophic forgetting. This loss hierarchically aligns multi-view features within the current domain, maximizing their mutual information conditioned on domain-invariant representations extracted from prior domains through Anatomy-Guided Calibration. We pre-train CoSMIC across four medical domains and evaluate it on fifteen downstream datasets from five domains: Retinoscopy, Radiography, Ophthalmoscopy, Dermoscopy, and Histopathology (unseen). Experimental results show that CoSMIC (i) achieves robust feature extraction ability comparable to domain-specific models, (ii) exhibits exceptional generalization capability, significantly surpassing SOTA medical foundation models, and (iii) demonstrates superior transferability to new domains, overcoming current continual pre-training methods.
Poster
Zhaoyang Li · Yuan Wang · Guoxin Xiong · Wangkai Li · Yuwen Pan · Tianzhu Zhang
[ Exhibit Hall I ]
Abstract
Generalized few-shot point cloud segmentation (GFS-3DSeg) aims to segment objects of both base and novel classes using abundant base class samples and limited novel class samples. Existing GFS-3DSeg methods encounter bottlenecks due to the scarcity of novel class data and inter-class confusion. In this paper, we propose the LLM-Assisted Hyper-Relation Matching (LARM) framework, which leverages the wealth of prior knowledge in LLM to enrich novel category prototypes and introduces a hyper-relation matching strategy to mitigate false matches between point features and category prototypes caused by inter-class confusion. The proposed LARM enjoys several merits. First, the vast knowledge embedded in LLM can be an effective complement to vanilla category prototypes, enabling them to exhibit greater robustness. Second, the hyper-relation matching strategy harnesses the structure information implicit in the inter-class relationships, making it more robust than comparing individually.Extensive experiments on two benchmarks demonstrate that LARM outperforms previous state-of-the-art methods by large margins. The code will be open-sourced for further research.
Poster
Jun Li · Jinpeng Wang · Chaolei Tan · Niu Lian · Long Chen · Yaowei Wang · Min zhang · Shu-Tao Xia · Bin Chen
[ Exhibit Hall I ]
Abstract
Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce ``$\text{text} \prec \text{video}$'' hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods.Code will be released at https://anonymous.4open.science/r/HLFormer-F8E6.
Poster
Lizhen Xu · Xiuxiu Bai · Xiaojun Jia · Jianwu Fang · Shanmin Pang
[ Exhibit Hall I ]
Abstract
Query-based methods with dense features have demonstrated remarkable success in 3D object detection tasks.However, the computational demands of these models, particularly with large image sizes and multiple transformer layers, pose significant challenges for efficient running on edge devices. Existing pruning and distillation methods either need retraining or are designed for ViT models, which are hard to migrate to 3D detectors. To address this issue, we propose a zero-shot runtime pruning method for transformer decoders in 3D object detection models. The method, termed tgGBC (trim keys gradually Guided By Classification scores), systematically trims keys in transformer modules based on their importance. We expand the classification score to multiply it with the attention map to get the importance score of each key and then prune certain keys after each transformer layer according to their importance scores.Our method achieves a 1.99x speedup in the transformer decoder of the latest ToC3D model, with only a minimal performance loss of less than 1%. Interestingly, for certain models, our method even enhances their performance. Moreover, we deploy 3D detectors with tgGBC on an edge device, further validating the effectiveness of our method. The code will be made publicly available on GitHub.
Poster
Yangfu Li · Hongjian Zhan · Qi Liu · Li Sun · Yu-Jie Xiong · Yue Lu
[ Exhibit Hall I ]
Abstract
Most existing methods regard open-set Chinese text recognition (CTR) as a single-task problem, primarily focusing on prototype learning of linguistic components or glyphs to identify unseen characters. In contrast, humans identify characters by integrating multiple perspectives, including linguistic and visual cues. Inspired by this, we propose a multi-task framework termed MSA$^2$, which considers multi-view character representations for open-set CTR. Within MSA$^2$, we introduce two novel strategies for character representation: structure-aware component encoding (SACE) and style-adaptive glyph embedding (SAGE). SACE utilizes a binary tree with dynamic representation space to emphasize the primary linguistic components, thereby generating structure-aware and discriminative linguistic representations for each character. Meanwhile, SAGE employs a glyph-centric contrastive learning to aggregate features from diverse forms, yielding robust glyph representations for the CTR model to adapt to the style variations among various fonts. Extensive experiments demonstrate that our proposed MSA$^2$ outperforms state-of-the-art CTR methods, achieving an average improvement of 1.3% and 6.0% in accuracy under closed-set and open-set settings on the BCTR dataset, respectively. The code will be available soon.
Poster
Rui Hu · Yuxuan Zhang · Lianghui Zhu · Tianheng Cheng · Lei Liu · Heng Liu · Longjin Ran · Xiaoxin Chen · Wenyu Liu · Xinggang Wang
[ Exhibit Hall I ]
Abstract
Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities.However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce **GroundingSuite**, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., $4.5 \times$ faster than the GLaMM.
Poster
Yiping Ji · Hemanth Saratchandran · Peyman Moghadam · Simon Lucey
[ Exhibit Hall I ]
Abstract
We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (eg. CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that self-attention mechanism is fundamentally ill-conditioned and is therefore uniquely dependent on skip connections for regularization. Additionally, we propose $T$oken $G$raying($TG$) -- a simple yet effective complement (to skip connections) that further improves the conditioning of input tokens. We validate our approach in both supervised and self-supervised training methods.
Poster
Qi Chen · Lingxiao Yang · Yun Chen · Nailong Zhao · Jianhuang Lai · Jie Shao · Xiaohua Xie
[ Exhibit Hall I ]
Abstract
Fine-tuning pre-trained vision-language models has proven effective in enhancing open-vocabulary semantic segmentation (OVSS). However, given the significant resource consumption required for training on large datasets, there is growing interest in exploring training-free methods for OVSS. Current training-free methods primarily focus on modifying model architectures and generating prototypes to improve segmentation performance, often overlooking issues of category redundancy and ambiguity. In this paper, we identify two key phenomena in OVSS: class redundancy and vision-language ambiguity in class activation maps and the affinity-refined activation maps. Inspired by our observations, we propose a training-free class purification framework -- FreeCP to purify semantic categories and address errors caused by these two issues. Specifically, we first generate class activation maps along with their refined activation maps using CLIP. These activations and their refined counterparts, are then organized by their associated categories to adaptively construct category relations, i.e., per category relations, and cross-category relations. We then effectively perform redundancy purification to eliminate classes, which are not present in the current image. Furthermore, we propose ambiguity purification to distinguish the correct class from their semantic similarity ones. The purified classes are subsequently used to produce the final segmentation prediction. Extensive experiments across eight benchmarks demonstrate that FreeCP, …
Poster
Zhewei Dai · Shilei Zeng · Haotian Liu · Xurui Li · Feng Xue · Yu Zhou
[ Exhibit Hall I ]
Abstract
We introduce SeaS, a unified industrial generative model for automatically creating diverse anomalies, authentic normal products, and precise anomaly masks. While extensive research exists, most efforts either focus on specific tasks, i.e., anomalies or normal products only, or require separate models for each anomaly type. Consequently, prior methods either offer limited generative capability or depend on a vast array of anomaly-specific models. We demonstrate that U-Net's differentiated learning ability captures the distinct visual traits of slightly-varied normal products and diverse anomalies, enabling us to construct a unified model for all tasks. Specifically, we first introduce an Unbalanced Abnormal (UA) Text Prompt, comprising one normal token and multiple anomaly tokens. More importantly, our Decoupled Anomaly Alignment (DA) loss decouples anomaly attributes and binds them to distinct anomaly tokens of UA, enabling SeaS to create unseen anomalies by recombining these attributes. Furthermore, our Normal-image Alignment (NA) loss aligns the normal token to normal patterns, making generated normal products globally consistent and locally varied. Finally, SeaS produces accurate anomaly masks by fusing discriminative U-Net features with high-resolution VAE features. SeaS sets a new benchmark for industrial generation, significantly enhancing downstream applications, with average improvements of +8.66% pixel-level AP for synthesis-based AD approaches, +1.10% …
Poster
Lin Zhang · Xianfang Zeng · Kangcong Li · Gang YU · Tao Chen
[ Exhibit Hall I ]
Abstract
We propose SC-Captioner, a reinforcement learning framework that enables the self-correcting capability of image caption models. Our crucial technique lies in the design of the reward function to incentivize accurate caption corrections. pecifically, the predicted and reference captions are decomposed into object, attribute, and relation sets using scene-graph parsing algorithms. We calculate the set difference between sets of original and self-corrected captions to identify added and removed elements. These elements are matched against the reference sets to calculate recall bonuses for accurate corrections and hallucination punishments for wrong additions and removals, thereby forming the final reward. For image caption quality assessment, we propose a set of metrics refined from CAPTURE that alleviate its incomplete precision evaluation and inefficient relation matching problems. Experiments show that applying SC-Captioner on large visual-language models can generate better image captions across various scenarios, significantly outperforming the direct preference optimization training strategy.
Poster
Chenhao Zheng · Jieyu Zhang · Mohammadreza Salehi · Ziqi Gao · Vishnu Iyengar · Norimasa Kobori · Quan Kong · Ranjay Krishna
[ Exhibit Hall I ]
Abstract
Effective video tokenization is critical for scaling transformer models for long videos. Current approaches tokenize videos using space-time patches, leading to excessive tokens and computational inefficiencies. The best token reduction strategies degrade performance and barely reduce the number of tokens when the camera moves. We introduce grounded video tokenization, a paradigm that organizes tokens based on panoptic sub-object trajectories rather than fixed patches. Our method aligns with fundamental perceptual principles, ensuring that tokenization reflects scene complexity rather than video duration. We propose TrajViT, a video encoder that extracts object trajectories and converts them into semantically meaningful tokens, significantly reducing redundancy while maintaining temporal coherence. Trained with contrastive learning, TrajViT significantly outperforms space-time ViT (ViT3D) across multiple video understanding benchmarks, e.g., TrajViT outperforms ViT3D by a large margin of 6% top-5 recall in average at video-text retrieval task with 10x token deduction. We also show TrajViT as a stronger model than ViT3D for being the video encoder for modern VideoLLM, obtaining an average of 5.2% performance improvement across 6 VideoQA benchmarks while having 4x faster training time and 20x less inference FLOPs. TrajViT is the first efficient encoder to consistently outperform ViT3D across diverse video analysis tasks, making it a robust …
Poster
Handong Li · Yiyuan Zhang · Longteng Guo · Xiangyu Yue · Jing Liu
[ Exhibit Hall I ]
Abstract
Most current video-language models rely on an encoder-decoder architecture, where a vision encoder extracts visual features from video and passes them to a language model. However, this approach suffers from inefficiencies, resolution biases, and challenges in capturing fine-grained multimodal correlations, particularly when dealing with long-duration videos. To address these limitations, we propose NOVA, an encoder-free video-language model that directly integrates raw video input into a language model, eliminating the need for a separate vision encoder. NOVA leverages input-adaptive video tokenization, efficient distillation from a video-pretrained teacher, multimodal alignment using synthetic video recaption data, and hybrid-resolution inference to overcome the limitations of traditional models. Our experiments demonstrate that NOVA, with only about 10M publicly available training data, achieves competitive performance as strong encoder-based models across various benchmarks, and offers clear advantages in efficiency and scalability. This work provides a promising solution for real-time, large-scale video applications and paves the way for more flexible and resource-efficient video-language models.
Poster
Jiahui Wang · Zuyan Liu · Yongming Rao · Jiwen Lu
[ Exhibit Hall I ]
Abstract
Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5\%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38× real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test.
Poster
Jessica Bader · Leander Girrbach · Stephan Alaniz · Zeynep Akata
[ Exhibit Hall I ]
Abstract
Concept Bottleneck Models (CBMs) and other interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concept values under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB --- a fine-grained image and concept dataset containing 38,400 synthetic images based on the CUB bird dataset. To create SUB, we select a subset of 33 bird classes and 32 concepts from CUB to generate counterfactual bird images where a specific concept, such as wing color or belly pattern, is substituted.To achieve precise control for generated images, we introduce a novel Tied Diffusion Guidance (TDG) method, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct bird concept are generated. This novel dataset enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods.Furthermore, we show that the common practice of training CBMs using class-level concept annotations does not lead to generalized recognition of the concepts. Our code and data will be released upon acceptance.
Poster
Lin Sun · Jiale Cao · Jin Xie · Xiaoheng Jiang · Yanwei Pang
[ Exhibit Hall I ]
Abstract
Contrastive Language-Image Pre-training (CLIP) exhibits strong zero-shot classification ability on various image-level tasks, leading to the research to adapt CLIP for pixel-level open-vocabulary semantic segmentation without additional training. The key is to improve spatial representation of image-level CLIP, such as replacing self-attention map at last layer with self-self attention map or vision foundation model based attention map. In this paper, we present a novel hierarchical framework, named CLIPer, that hierarchically improves spatial representation of CLIP. The proposed CLIPer includes an early-layer fusion module and a fine-grained compensation module. We observe that, the embeddings and attention maps at early layers can preserve spatial structural information. Inspired by this, we design the early-layer fusion module to generate segmentation map with better spatial coherence. Afterwards, we employ a fine-grained compensation module to compensate the local details using the self-attention maps of diffusion model. We conduct the experiments on eight segmentation datasets. Our CLIPer achieves the state-of-the-art performance on these datasets. With ViT-L and sliding-window inference, CLIPer has the mIoU of 72.2% and 44.7% on VOC and Object, outperforming ProxyCLIP by 11.6% and 5.5%. We will release the source code and models.
Poster
Tongkun Guan · Zining Wang · Pei Fu · Zhentao Guo · Wei Shen · Kai zhou · Tiezhu Yue · Chen Duan · Hao Sun · Qianyi Jiang · Junfeng Luo · Xiaokang Yang
[ Exhibit Hall I ]
Abstract
In recent years, general visual foundation models (VFMs) have witnessed increasing adoption, particularly as image encoders for popular multi-modal large language models (MLLMs). However, without semantically fine-grained supervision, these models still encounter fundamental prediction errors in the context of downstream text-image-related tasks, i.e., perception, understanding and reasoning with images containing small and dense texts. To bridge this gap, we develop TokenFD, the first token-level visual foundation model specifically tailored for text-image-related tasks, designed to support a variety of traditional downstream applications. To facilitate the pretraining of TokenFD, we also devise a high-quality data production pipeline that constructs the first token-level image text dataset, TokenIT, comprising 20 million images and 1.8 billion token-mask pairs. Furthermore, leveraging this foundation with exceptional image-as-text capability, we seamlessly replace previous VFMs with TokenFD to construct a token-level visual-language MLLM, TokenVL, for VQA-based document understanding tasks. Finally, extensive experiments demonstrate the effectiveness of TokenFD and TokenVL. Code, demo, datasets, and weights will be available soon.
Poster
Anurag Bagchi · Zhipeng Bao · Yu-Xiong Wang · Pavel Tokmakov · Martial Hebert
[ Exhibit Hall I ]
Abstract
We present REM, a framework for segmenting a wide range of concepts in video that can be described through natural language. Our method unlocks the universal visual-language mapping learned by video diffusion models on Internet-scale data by fine-tuning them on small-scale Referring Object Segmentation datasets. Our key insight is preserving the entirety of the generative model's architecture by shifting its objective from predicting noise to predicting mask latents. The resulting model can accurately segment and track rare and unseen objects, despite only being trained on object masks from a limited set of categories. Additionally, it can effortlessly generalize to non-object dynamic concepts, such as smoke or raindrops, as demonstrated in our newly introduced benchmark for Referring Video Process Segmentation (Ref-VPS). Our experiments show that REM performs on par with state-of-the-art approaches on in-domain datasets, like Ref-DAVIS, while outperforming them by up to 11 points in terms of region similarity out-of-domain, leveraging the power of Internet-scale pre-training.
Poster
Byung Hyun Lee · Wongi Jeong · Woojae Han · KYOUNGBUN LEE · Se Young Chun
[ Exhibit Hall I ]
Abstract
Multiple instance learning (MIL) significantly reduced annotation costs via bag-level weak labels for large-scale images, such as histopathological whole slide images (WSIs). However, its adaptability to continual tasks with minimal forgetting has been rarely explored, especially on instance classification for localization. Weakly incremental learning for semantic segmentation has been studied for continual localization, but it focused on natural images, leveraging global relationships among hundreds of small patches (e.g., $16 \times 16$) using pre-trained models. This approach seems infeasible for MIL localization due to enormous amounts ($\sim 10^5$) of large patches (e.g., $256 \times 256$) and no available global relationships such as cancer cells. To address these challenges, we propose Continual Multiple Instance Learning with Enhanced Localization (CoMEL), an MIL framework designed to improve both localization and adaptability with minimal forgetting. CoMEL consists of (1) Grouped Double Attention Transformer (GDAT) for efficient instance encoding, (2) Bag Prototypes-based Pseudo-Labeling (BPPL) for reliable instance pseudo-labeling, and (3) Orthogonal Weighted Low-Rank Adaptation (OWLoRA) to mitigate forgetting in both bag and instance classification. Extensive experiments on three public WSI datasets, CAMELYON-16, PAIP, and TCGA, demonstrate superior performance of CoMEL, outperforming the prior arts by up to $11.00\%$ in bag-level accuracy and up to $23.4\%$ in …
Poster
Yucheng Suo · Fan Ma · Linchao Zhu · Tianyi Wang · Fengyun Rao · Yi Yang
[ Exhibit Hall I ]
Abstract
Multi-modal Large language models (MLLMs) show remarkable ability in video understanding. Nevertheless, understanding long videos remains challenging as the models can only process a finite number of frames in a single inference, potentially omitting crucial visual information. To address the challenge, we propose generating multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction. Specifically, we devise a bin-wise sampling strategy that enables MLLMs to generate diverse answers based on various combinations of keyframes, thereby enriching the visual context. To determine the final prediction from the sampled answers, we employ a self-reward by linearly combining three scores: (1) a frequency score indicating the prevalence of each option, (2) a marginal confidence score reflecting the inter-intra sample certainty of MLLM predictions, and (3) a reasoning score for different question types, including clue-guided answering for global questions and temporal self-refocusing for local questions. The frequency score ensures robustness through majority correctness, the confidence-aligned score reflects prediction certainty, and the typed-reasoning score addresses cases with sparse key visual information using tailored strategies. Experiments show that this approach covers the correct answer for a high percentage of long video questions, on seven datasets show that our method improves …
Poster
Weijia Zhang · Yuehao Liu · Wu Ran · Chao Ma
[ Exhibit Hall I ]
Abstract
We describe a simple method for cross-architecture knowledge distillation, where the knowledge transfer is cast into a redundant information suppression formulation. Existing methods introduce sophisticated modules, architecture-tailored designs, and excessive parameters, which impair their efficiency and applicability. We propose to extract the architecture-agnostic knowledge in heterogeneous representations by reducing the redundant architecture-exclusive information. To this end, we present a simple redundancy suppression distillation (RSD) loss, which comprises cross-architecture invariance maximization and feature decorrelation objectives. To prevent the student from entirely losing its architecture-specific capabilities, we further design a lightweight module that decouples the RSD objective from the student's internal representations. Our method is devoid of the architecture-specific designs and complex operations in the pioneering method of OFA. It substantially outperforms OFA on CIFAR-100 and ImageNet-1k benchmarks with only a fraction of their parameter overhead, which highlights its potential as a simple and strong baseline to the cross-architecture distillation community. Our code and models will be made publicly available.
Poster
Tao Wang · Changxu Cheng · Lingfeng Wang · Senda Chen · Wuyue Zhao
[ Exhibit Hall I ]
Abstract
The remarkable performance of large multimodal models (LMMs) has attracted significant interest from the image segmentation community.To align with the next-token-prediction paradigm, current LMM-driven segmentation methods either use object boundary points to represent masks or introduce special segmentation tokens, whose hidden states are decoded by a segmentation model requiring the original image as input.However, these approaches often suffer from inadequate mask representation and complex architectures, limiting the potential of LMMs.In this work, we propose the \textbf{Hi}erarchical \textbf{M}ask \textbf{Tok}enizer (HiMTok), which represents segmentation masks with up to 32 tokens and eliminates the need for the original image during mask de-tokenization.HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the LLM next-token-prediction paradigm and facilitating the direct acquisition of segmentation capabilities.We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.Additionally, we enable bidirectional information flow, allowing conversion between bounding boxes and mask tokens to fully leverage multi-task training potential.Extensive experiments demonstrate that our method achieves state-of-the-art performance across various segmentation tasks,while also enhancing visual grounding and maintaining overall visual understanding.The codes will be made publicly available.
Poster
Jihun Kim · Hoyong Kwon · Hyeokjun Kweon · Wooseong Jeong · Kuk-Jin Yoon
[ Exhibit Hall I ]
Abstract
Interactive segmentation (IS) allows users to iteratively refine object boundaries with minimal cues, such as positive and negative clicks. While the Segment Anything Model (SAM) has garnered attention in the IS community for its promptable segmentation capabilities, it often struggles in specialized domains or when handling complex scenarios (e.g., camouflaged or multi-part objects). To overcome these challenges, we propose DC-TTA, a novel test-time adaptation (TTA) framework that adapts SAM on a per-sample basis by leveraging user interactions as supervision. Instead of forcing a single model to incorporate all user clicks at once, DC-TTA partitions the clicks into more coherent subsets, each processed independently via TTA with a separated model. This Divide-and-Conquer strategy reduces conflicts among diverse cues and enables more localized updates. Finally, we merge the adapted models to form a unified predictor that integrates the specialized knowledge from each subset. Experimental results across various benchmarks demonstrate that DC-TTA significantly outperforms SAM’s zero-shot results and conventional TTA methods, effectively handling complex tasks such as camouflaged object segmentation with fewer interactions and improved accuracy. The code will be available soon.
Poster
YITING LI · Fayao Liu · Jingyi Liao · Sichao Tian · Chuan-Sheng Foo · Xulei Yang
[ Exhibit Hall I ]
Abstract
Multimodal anomaly detection (MAD) enhances industrial inspection by leveraging complementary 2D and 3D data. However, existing methods struggle in few-shot scenarios due to limited data and modality gaps. While current approaches either fuse multimodal features or align cross-modal representations, they often suffer from high false positive rates and fail to detect subtle defects, especially when training data is scarce. To address these challenges, we propose the first few-shot MAD method FIND, a novel dual-student framework that synergistically integrates intra-modal reverse distillation and cross-modal distillation. FIND employs modality-specific teachers and two collaborative students: an intra-modal student for fine-grained anomaly localization via reverse distillation, and a cross-modal student that captures inter-modal correspondences to detect inconsistencies. Extensive experiments on MVTec-3D-AD and Eyecandies show that FIND outperforms state-of-the-art methods in both full-shot and few-shot settings. Ablation studies validate the complementary roles of intra- and cross-modal distillation. Our work significantly advances MAD robustness in data-scarce industrial applications.
Poster
Meiqi Wang · Han Qiu
[ Exhibit Hall I ]
Abstract
In-orbit object detection is essential for Earth observation missions on satellites equipped with GPUs.A promising approach is to use pre-trained vision-language modeling (VLM) to enhance its open-vocabulary capability.However, adopting it on satellites poses two challenges: (1) satellite imagery differs substantially from natural images, and (2) satellites' embedded GPUs are insufficient for complex models' inference.We reveal their lack of a crucial prior: in-orbit detection involves identifying a set of known objects within a cluttered yet monotonous background.Motivated by this observation, we propose VISO, a Vision-language Instructed Satellite Object detection model that focuses on object-specific features while suppressing irrelevant regions through language-guided mask learning.After pre-training on a large-scale satellite dataset with 3.4M region-text pairs, VISO enhances object-text alignment and object-centric features to improve detection accuracy.Also, VISO suppresses irrelevant regions, enabling highly sparse inference to accelerate speed on satellites.Extensive experiments show that VISO without sparsity outperforms state-of-the-art (SOTA) VLMs in zero-shot detection by increasing 34.1\% AP and reducing 27$\times$ FLOPs, and surpasses specialist models in supervised object detection and object referring by improving 2.3\% AP.When sparsifying VISO to a comparable AP, FLOPs can be greatly reduced by up to 8.5$\times$.Real-world tests reveal that VISO achieves a 2.8–4.8$\times$ FPS speed-up on satellites’ embedded GPUs.
Poster
Ran Ran · Jiwei Wei · Shiyuan He · Zeyu Ma · Chaoning Zhang · Ning Xie · Yang Yang
[ Exhibit Hall I ]
Abstract
Video Temporal Grounding (VTG) confronts the challenge of bridging the semantic gap between concise textual queries and the rich complexity of video content, compounded by the difficulty of capturing discriminative features without external priors. To address these challenges, we propose Knowledge Diffusion Alignment (KDA), a framework that leverages the generative prowess of diffusion models. KDA introduces a multi-layer video knowledge extraction module alongside a background residual diffusion model that progressively prunes irrelevant background information from global video features, thereby distilling query-relevant moment knowledge enriched with visual context. By a three-stage training approach that harnesses external priors, KDA guarantees that the extracted moment knowledge incorporates the discriminative features necessary for accurate localization. A knowledge prompt reasoning module facilitates the comprehensive interaction and utilization of moment knowledge and multimodal features. Moreover, we introduce a spans-enhanced decoder that selectively integrates spans from multi-modal features, capitalizing on intrinsic alignment cues. Comprehensive experiments on three datasets demonstrate performance that surpasses state-of-the-art methods, attesting to the effectiveness of the proposed framework.
Poster
YUFEI SHI · Weilong Yan · Gang Xu · Yumeng Li · Yucheng Chen · ZhenXi Li · Fei Yu · Ming Li · Si Yong Yeo
[ Exhibit Hall I ]
Abstract
Video large language models (ViLLMs) excel in general video understanding, e.g., recognizing activities like talking and eating, but struggle with identity-aware comprehension, such as ''Wilson is receiving chemotherapy" or ''Tom is discussing with Sarah", limiting their applicability in smart healthcare and smart home environments. To address this limitation, we propose a one-shot learning framework PVChat, the first personalized ViLLM that enables subject-aware question answering (QA) from a single video for each subject. Our approach optimizes a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset, leveraging a progressive image-to-video learning strategy. Specifically, we introduce an automated augmentation pipeline that synthesizes identity-preserving positive samples and retrieves hard negatives from existing video corpora, generating a diverse training dataset with four QA types: existence, appearance, action, and location inquiries. To enhance subject-specific learning, we propose a ReLU Routing MoH attention mechanism, alongside two novel objectives: (1) Smooth Proximity Regularization for progressive learning through exponential distance scaling and (2) Head Activation Enhancement for balanced attention routing. Finally, we adopt a two-stage training strategy, transitioning from image pre-training to video fine-tuning, enabling a gradual learning process from static attributes to dynamic representations. We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, …
Poster
Wentian Cai · Weizhao Weng · Zihao Huang · Yandan Chen · Siquan Huang · Ping Gao · Victor Leung · Ying Gao
[ Exhibit Hall I ]
Abstract
Massive requirement for pixel-wise annotations in histopathological image segmentation poses a significant challenge, leading to increasing interest in Unsupervised Semantic Segmentation (USS) as a viable alternative. Pre-trained model-based methods have been widely used in USS, achieving promising segmentation performance. However, these methods are less capable for medical image USS tasks due to their limited ability in encoding task-specific contextual information. In this paper, we propose a context-based Overlapping Patches Consistency Constraint (OPCC), which employs the consistency constraint between the local overlapping region’s similarity and global context similarity, achieving consistent class representation in similar environments. Additionally, we introduce an Inter-Layer Self-Attention Fusion (ILSAF) module that employs a multi-head self-attention mechanism along with Inter-Layer Importance-Weighting to generate context-aware and semantically discriminative pixel representations, improving pixel clustering accuracy. Extensive experiments on two public histopathological image segmentation datasets demonstrate that our approach significantly outperforms state-of-the-art methods by a large margin, with mIoU surpassing previous leading work by 5.74 and 8.38 percentage points on the two datasets, respectively.
Poster
Yujian Lee · Peng Gao · Yongqi Xu · Wentao Fan
[ Exhibit Hall I ]
Abstract
Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, Stepping Stone Plus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient …
Poster
Harsh Agrawal · Eldon Schoop · Xinlei Pan · Ari Seff · Anuj Mahajan · Di Feng · Ruijia Cheng · Andres Romero Mier y Teran · Esteban Gomez · Abhishek Sundararajan · Forrest Huang · Amanda Swearngin · Mohana Moorthy · Jeffrey Nichols · Alexander Toshev
[ Exhibit Hall I ]
Abstract
We build a comprehensive online evaluation benchmark for language-conditioned multi-step task execution on mobile interfaces. Our benchmark strives to evaluate the multi-step planning, reasoning, and visual grounding capabilities of agents, using mobile user interfaces as a concrete testbed. To build diverse, challenging tasks that reflect real-world use cases, we propose an exhaustive taxonomy that allows us to measure progress along multiple decision-making abilities including multi-step planning, visual perception, action grounding, and using memory or external knowledge. We also highlight important factors such as statefulness, safety, and evaluation complexity that are key to design tasks that can be reliably evaluated. Using this taxonomy, we design 116 tasks across 36 unique apps. Through an automatic framework, we stage and evaluate several natural baselines with different input representations and planning strategies. We show that the best-performing agent achieves 40% success on our benchmark. We further measure agents' abilities to plan, ground, and utilize world knowledge highlighting areas of improvement.
Poster
Li Yi · Jie Hu · Songan Zhang · GUANNAN JIANG
[ Exhibit Hall I ]
Abstract
Foundation Segmentation Models (FSMs) show suboptimal performance on unconventional image domains like camouflage objects. Fine-tuning is often impractical due to data preparation challenges, time limits, and optimization issues. To boost segmentation performance while keeping zero-shot features, one approach is pre-augmenting images for the segmentation model. However, existing image augmentations mainly depend on rule-based methods, restricting augmentation effectiveness. Though learning-based methods can diversify augmentation, rule-based ones are degree-describable (e.g., slight/intense brightening), while learning-based methods usually predict non-degree-describable ground truths (e.g., depth estimation), creating a heterogeneous search space when combined. To this end, we propose an ``Augmenting-to-Adapt'' paradigm, replacing traditional rule-based augmentation with an optimal heterogeneous augmentation policy to enhance segmentation. Our method uses 32 augmentation techniques (22 rule-based, 10 learning-based) to ease parameter misalignment, forming a robust, multi-discrete heterogeneous search space.To apply the optimal policy in real-world scenarios, we distill the augmentation process to speed up the preprocess. Extensive evaluations across diverse datasets and domains show our method significantly improves model adaptation with a domain-specific augmentation strategy. We will release our code to support further research.
Poster
Xiao-Wen Zhang · Delong Zhang · Yi-Xing Peng · Zhi Ouyang · Jingke Meng · Wei-Shi Zheng
[ Exhibit Hall I ]
Abstract
Person re-identification (ReID) is to match the person images under different camera views. Training ReID models necessitates a substantial amount of labeled real-world data, leading to high labeling costs and privacy issues. Although several ReID data synthetic methods are proposed to address these issues, they fail to generate images with real-world camera style or new identities. In this paper, we propose a novel pedestrian generation pipeline, VIPerson, to generate camera-realistic pedestrian images with flexible Virtual Identities for the Person ReID task. VIPerson focuses on three key factors in data synthesis: (I) Virtual identity diversity: Enhancing the latent diffusion model with our proposed dropout text embedding, we flexibly generate random and hard identities. (II) Scalable cross-camera variations: VIPerson introduces scalable variations of scenes and poses within each identity. (III) Camera-realistic style: Adopting an identity-agnostic approach to transfer realistic styles, we avoid privacy exposure of real identities. Extensive experimental results across a broad range of downstream ReID tasks demonstrate the superiority of our generated dataset over existing methods. In addition, VIPerson can be adapted to the privacy-constrained ReID scenario, which widens the application of our pipeline. We will release our code and datasets.
Poster
Guobin Shen · Jindong Li · Tenglong Li · Dongcheng Zhao · Yi Zeng
[ Exhibit Hall I ]
Abstract
Spiking Neural Networks (SNNs) hold promise for energy-efficient, biologically inspired computing. We identify substantial information loss during spike transmission, linked to temporal dependencies in traditional Leaky Integrate-and-Fire (LIF) neurons—a key factor potentially limiting SNN performance. Existing SNN architectures also underutilize modern GPUs, constrained by single-bit spike storage and isolated weight-spike operations that restrict computational efficiency. We introduce SpikePack, a neuron model designed to reduce transmission loss while preserving essential features like membrane potential reset and leaky integration. SpikePack achieves constant $\mathcal{O}(1)$ time and space complexity, enabling efficient parallel processing on GPUs and also supporting serial inference on existing SNN hardware accelerators. Compatible with standard Artificial Neural Network (ANN) architectures, SpikePack facilitates near-lossless ANN-to-SNN conversion across various networks. Experimental results on tasks such as image classification, detection, and segmentation show SpikePack achieves significant gains in accuracy and efficiency for both directly trained and converted SNNs over state-of-the-art models. Tests on FPGA-based platforms further confirm cross-platform flexibility, delivering high performance and enhanced sparsity. By enhancing information flow and rethinking SNN-ANN integration, SpikePack advances efficient SNN deployment across diverse hardware platforms.
Poster
Xinwei Long · Kai Tian · Peng Xu · Guoli Jia · Jingxuan Li · Sa Yang · Yihua Shao · Kaiyan Zhang · Che Jiang · Hao Xu · Yang Liu · Jiaheng Ma · Bowen Zhou
[ Exhibit Hall I ]
Abstract
Large language models (LLMs) have taken a great step towards AGI. Meanwhile, an increasing number of domain-specific problems such as math and programming boost these general-purpose models to continuously evolve via learning deeper expertise. Now is thus the time further to extend the diversity of specialized applications for knowledgeable LLMs, though collecting high quality data with unexpected and informative tasks is challenging. In this paper, we propose to use advertisement (ad) videos as a challenging test-bed to probe the ability of LLMs in perceiving beyond the objective physical content of the common visual domain. Our motivation is to take full advantage of the clue-rich and information-dense ad videos' traits, e.g., marketing logic, persuasive strategies, and audience engagement. Our contribution is three-fold, (1) To our knowledge, this is the first attempt to use ad videos with well-designed tasks to evaluate LLMs. We contribute AdsQA, a challenging ad Video QA benchmark derived from 1,544 ad videos with 10,962 clips, totaling 21.1 hours, providing 5 challenging tasks. (2) We propose ReAd-R, a Deepseek-R1 styled RL model that reflects on questions, and generates answers via reward-driven optimization. (3) We benchmark 14 top-tier LLMs on AdsQA, and our ReAd-R achieves the state-of-the-art outperforming strong competitors …
Poster
Woojung Son · Yoonki Cho · Guoyuan An · Chanmi Lee · Sung-eui Yoon
[ Exhibit Hall I ]
Abstract
Person search aims to simultaneously detect and re-identify a query person within an entire scene.While existing studies have made significant progress in achieving superior performance on clean datasets, the challenge of robustness under various corruptions remains largely unexplored.However, the lack of environments for analyzing corruption robustness presents a challenge, as extensive collection of new person images attempting to cover numerous corruption scenarios inevitably introduces privacy concerns.In this context, we construct the environments for analyzing corruption robustness using existing publicly available data, and introduce two benchmarks: CUHK-SYSU-C and PRW-C.Previous studies on corruption have been conducted independently for single tasks such as re-identification and detection.However, recent advancements in person search adopt an end-to-end multi-task learning framework that processes the entire scene as input, unlike the combination of single tasks. This raises the question of whether independent achievements can ensure corruption robustness for person search.Our findings reveal that merely combining independent, robust detection and re-identification models is not sufficient for achieving robust person search. We further investigate the vulnerability of the detection and representation stages to corruption and explore its impact on both foreground and background areas.Based on these insights, we propose a foreground-aware augmentation and regularization method to enhance the robustness of …
Poster
Chiao-An Yang · Kuan-Chuan Peng · Raymond A. Yeh
[ Exhibit Hall I ]
Abstract
Anomaly detection (AD) identifies the defect regions of a given image. Recent works have studied AD, focusing on learning AD without abnormal images, with long-tailed distributed training data, and using a unified model for all classes. In addition, online AD learning has also been explored. In this work, we expand in both directions to a realistic setting by considering the new novel task of long-tailed online AD (LTOAD). We first identified that the offline state-of-the-art LTAD methods cannot be directly applied to the online setting. Specifically, LTAD is class-aware, requiring class labels that are not available in the online setting. To address this challenge, we propose a class-agnostic framework for LTAD and then adapt it to our online learning setting. Our method outperforms the SOTA baselines in most offline LTAD settings, including both the industrial manufacturing and the medical domain. In particular, we observe $+$4.63\% image-AUROC on MVTec even compared to methods that have access to class labels and the number of classes. In the most challenging long-tailed online setting, we achieve +0.53\% image-AUROC compared to baselines.
Poster
Fatemeh Ghezloo · Saygin Seyfioglu · Rustin Soraki · Wisdom Ikezogwo · Beibin Li · Tejoram Vivekanandan · Joann Elmore · Ranjay Krishna · Linda Shapiro
[ Exhibit Hall I ]
Abstract
Diagnosing diseases through histopathology whole slide images (WSIs) is fundamental in modern pathology but is challenged by the gigapixel scale and complexity of WSIs. Trained histopathologists overcome this challenge by navigating the WSI, looking for relevant patches, taking notes, and compiling them to produce a final holistic diagnostic. Traditional AI approaches, such as multiple instance learning and transformer-based models, fail short of such a holistic, iterative, multi-scale diagnostic procedure, limiting their adoption in the real-world. We introduce PathFinder, a multi-modal, multi-agent framework that emulates the decision-making process of expert pathologists. PathFinder integrates four AI agents, the Triage Agent, Navigation Agent, Description Agent, and Diagnosis Agent, that collaboratively navigate WSIs, gather evidence, and provide comprehensive diagnoses with natural language explanations. The Triage Agent classifies the WSI as benign or risky; if risky, the Navigation and Description Agents iteratively focus on significant regions, generating importance maps and descriptive insights of sampled patches. Finally, the Diagnosis Agent synthesizes the findings to determine the patient's diagnostic classification. Our Experiments show that PathFinder outperforms state-of-the-art methods in skin melanoma diagnosis by 8% while offering inherent explainability through natural language descriptions of diagnostically relevant patches. Qualitative analysis by pathologists shows that the Description Agent's outputs are …
Poster
Marcin Przewięźlikowski · Randall Balestriero · Wojciech Jasiński · Marek Śmieja · Bartosz Zieliński
[ Exhibit Hall I ]
Abstract
Masked Image Modeling (MIM) has emerged as a promising approach for Self-Supervised Learning (SSL) of visual representations. However, the out-of-the-box performance of MIMs is typically inferior to competing approaches. Most users cannot afford fine-tuning due to the need for large amounts of data, high GPU consumption, and specialized user knowledge. Therefore, the practical use of MIM representations is limited. In this paper we ask what is the reason for the poor out-of-the-box performance of MIMs. Is it due to weaker features produced by MIM models, or is it due to suboptimal usage? Through detailed analysis, we show that attention in MIMs is spread almost uniformly over many patches, leading to ineffective aggregation by the [cls] token. Based on this insight, we propose Selective aggregation to better capture the rich semantic information retained in patch tokens, which significantly improves the out-of-the-box performance of MIM.
Poster
Weili Xu · Enxin Song · Wenhao Chai · Xuexiang Wen · Tian Ye · Gaoang Wang
[ Exhibit Hall I ]
Abstract
The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with RWKV, an RNN-like language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, as well as to reduce the gap between RWKV’s 4k context length and the extended token sequences typical of long videos, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a RWKV LLM backbone in a LLaVA-like model for open-ended video QA.
Poster
Junghyup Lee · Jeimin Jeon · Dohyung Kim · Bumsub Ham
[ Exhibit Hall I ]
Abstract
Quantization-aware training (QAT) simulates a quantization process during training to lower bit-precision of weights/activations. It learns quantized weights indirectly by updating latent weights, i.e., full-precision inputs to a quantizer, using gradient-based optimizers. We claim that coupling a user-defined learning rate (LR) with these optimizers is sub-optimal for QAT. Quantized weights transit discrete levels of a quantizer, only if corresponding latent weights pass transition points, where the quantizer changes discrete states. This suggests that the changes of quantized weights are affected by both the LR for latent weights and their distributions. It is thus difficult to control the degree of changes for quantized weights by scheduling the LR manually. We conjecture that the degree of parameter changes in QAT is related to the number of quantized weights transiting discrete levels. Based on this, we introduce a transition rate (TR) scheduling technique that controls the number of transitions of quantized weights explicitly. Instead of scheduling a LR for latent weights, we schedule a target TR of quantized weights, and update the latent weights with a novel transition-adaptive LR (TALR), enabling considering the degree of changes for the quantized weights during QAT. Experimental results demonstrate the effectiveness of our approach on standard benchmarks.
Poster
Junhao Zheng · Jiahao Sun · Chenhao Lin · Zhengyu Zhao · Chen Ma · Chong Zhang · Cong Wang · Qian Wang · Chao Shen
[ Exhibit Hall I ]
Abstract
Developing reliable defenses against patch attacks for object detectors has attracted increasing interest.However, we identify that existing defense evaluations lack a unified and comprehensive framework, causing inconsistent and incomplete assessment of current methods.To address this issue, we revisit 10 representative defenses and present the first large-scale benchmark, involving 2 attack goals, 13 patch attacks, 11 object detectors, and 4 diverse metrics.This leads to the first large-scale adversarial patch dataset with 94 types of patches and 94,000 images, which can also be used to improve existing defenses. We conduct comprehensive analyses to reveal new insights: (1) The difficulty in defending against naturalistic patches lies in the data distribution, rather than the commonly believed high frequencies. In light of this, we construct a large-scale dataset with diverse patch distributions to obtain stronger defenses, with 15.09\% AP@0.5 improvement.(2) A higher patch detection accuracy does not necessarily imply better defense performance.Instead, the average precision of the attacked object shows higher consistency.(3) Existing defenses can be substantially bypassed by adaptive attacks, and defenses that integrate complex/stochastic models or patch-level features are less vulnerable.We will open-source our dataset and code as well as keep integrating new attacks/defenses.
Poster
Yuheng Shi · Minjing Dong · Chang Xu
[ Exhibit Hall I ]
Abstract
While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation.Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance.Trident achieves a significant improvement in the mIoU across eight popular benchmarks compared with the current SOTA.Furthermore, it can also be utilized to generate visual prompts that enhance the performance of Large Vision-Language Models (LVLMs).
Poster
Zexi Jia · Chuanwei Huang · Yeshuang Zhu · Hongyan Fei · Ying Deng · Zhiqiang Yuan · Jiapei Zhang · Jinchao Zhang · Jie Zhou
[ Exhibit Hall I ]
Abstract
Vision-language models (VLMs) often struggle with compositional reasoning due to insufficient high-quality image-text data. To tackle this challenge, we propose a novel block-based diffusion approach that automatically generates counterfactual datasets without manual annotation. Our method utilizes large language models to identify entities and their spatial relationships. It then independently generates image blocks as "puzzle pieces" coherently arranged according to specified compositional rules. This process creates diverse, high-fidelity counterfactual image-text pairs with precisely controlled variations. In addition, we introduce a specialized loss function that differentiates inter-set from intra-set samples, enhancing training efficiency and reducing the need for negative samples. Experiments demonstrate that fine-tuning VLMs with our counterfactual datasets significantly improves visual reasoning performance. Our approach achieves state-of-the-art results across multiple benchmarks while using substantially less training data than existing methods.
Poster
Jiao Tang · Junjie Zhou · Bo Qian · Peng Wan · Yingli Zuo · WEI SHAO · Daoqiang Zhang
[ Exhibit Hall I ]
Abstract
Tissue segmentation in pathology images is crucial for computer-aided diagnostics of human cancers. Traditional tissue segmentation models rely heavily on large-scale labeled datasets, where every tissue type must be annotated by experts. However, due to the complexity of tumor micro-environment, collecting annotations for all possible tissue types is challenging, which makes the traditional methods ineffective in segmenting unseen tissue types with zero training samples. With the rapid development of vision-language models (VLMs), recent studies extend their powerful zero-shot capabilities to pixel-level segmentation tasks, where the model is trained only on seen classes but can perform tissue segmentation on both seen and unseen categories in the testing phase. However, these VLM-based zero-shot segmentation models still require substantial annotation efforts on seen classes. To attach desirable segmentation performance on both seen and unseen categories with limited labeled data, we propose AcZeroTS, a novel active learning framework for zero-shot tissue segmentation in pathology images. Specifically, AcZeroTS is built on a VLM-based prototype-guided zero-shot segmentation model called ProZS. We introduce a novel active selection criterion to select the most valuable samples for annotation on seen classes, which not only considers both uncertainty and diversity of unlabeled samples, but also ensures that the generated prototypes …
Poster
Xiaowen Ma · Zhen-Liang Ni · Xinghao Chen
[ Exhibit Hall I ]
Abstract
Mamba has shown great potential for computer vision due to its linear complexity in modeling the global context with respect to the input length. However, existing lightweight Mamba-based backbones cannot demonstrate performance that matches Convolution or Transformer-based methods. By observing, we find that simply modifying the scanning path in the image domain is not conducive to fully exploiting the potential of vision Mamba. In this paper, we first perform comprehensive spectral and quantitative analyses, and verify that the Mamba block mainly models low-frequency information under Convolution-Mamba hybrid architecture. Based on the analyses, we introduce a novel Laplace mixer to decouple the features in terms of frequency and input only the low-frequency components into the Mamba block. In addition, considering the redundancy of the features and the different requirements for high-frequency details and low-frequency global information at different stages, we introduce a frequency ramp inception, i.e., gradually reduce the input dimensions of the high-frequency branches, so as to efficiently trade-off the high-frequency and low-frequency components at different layers. By integrating mobile-friendly convolution and efficient Laplace mixer, we build a series of tiny hybrid vision Mamba called TinyViM. The proposed TinyViM achieves impressive performance on several downstream tasks including image classification, semantic …
Poster
Renshan Zhang · Rui Shao · Gongwei Chen · Miao Zhang · Kaiwen Zhou · Weili Guan · Liqiang Nie
[ Exhibit Hall I ]
Abstract
The incorporation of high-resolution visual input equips multimodal large language models (MLLMs) with enhanced visual perception capabilities for real-world tasks. However, most existing high-resolution MLLMs rely on a cropping-based approach to process images, which leads to fragmented visual encoding and a sharp increase in redundant tokens. To tackle these issues, we propose the FALCON model. FALCON introduces a novel visual register technique to simultaneously: 1) Eliminate redundant tokens at the stage of visual encoding. To directly address the visual redundancy present in the output of vision encoder, we propose a Register-based Representation Compacting (ReCompact) mechanism. This mechanism introduces a set of learnable visual registers designed to adaptively aggregate essential information while discarding redundancy. It enables the encoder to produce a more compact visual representation with a minimal number of output tokens, thus eliminating the need for an additional compression module. 2) Ensure continuity in visual encoding. To address the potential encoding errors caused by fragmented visual inputs, we develop a Register Interactive Attention (ReAtten) module. This module facilitates effective and efficient information exchange across sub-images by enabling interactions between visual registers. It ensures the continuity of visual semantics throughout the encoding. We conduct comprehensive experiments with FALCON on high-resolution benchmarks …
Poster
Zhentao Tan · Ben Xue · Jian Jia · Junhao Wang · Wencai Ye · Shaoyun Shi · Sun Mingjie · Wenjin Wu · Quan Chen · Peng Jiang
[ Exhibit Hall I ]
Abstract
This paper presents the $\textbf{S}$emantic-a$\textbf{W}$ar$\textbf{E}$ spatial-t$\textbf{E}$mporal $\textbf{T}$okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via $\textbf{D}$ecoupled $\textbf{Q}$uery $\textbf{A}$uto$\textbf{E}$ncoder (DQAE). This design allows SweetTok to efficiently compress video token count while achieving better fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a $\textbf{M}$otion-enhanced $\textbf{L}$anguage $\textbf{C}$odebook (MLC) tailored for spatial and temporal compression to address the differences in semantic representation between appearance and motion information.SweetTok significantly improves video reconstruction results by $\textbf{42.8}$\% w.r.t rFVD on UCF-101 dataset.With a better token compression strategy, it also boost downstream video generation results by $\textbf{15.1}$\% w.r.t gFVD.Additionally, the compressed decoupled tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.
Poster
Ping Cao · Yepeng Tang · Chunjie Zhang · Xiaolong Zheng · Chao Liang · Yunchao Wei · Yao Zhao
[ Exhibit Hall I ]
Abstract
Human-object interaction (HOI) detection fundamentally relies on capturing fine-grained visual information to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixel-level generation, their potential for interaction-level relationship modeling remains largely unexplored. We aim to bridge this gap by leveraging generative models’ fine-grained visual perception to enhance HOI detection through improved visual relation representation learning. In this work, we propose a Visual Relation Diffusion model (VRDiff) for HOI detection, which introduces dense visual relation conditions. Considering that diffusion models primarily focus on instance-level objects, we design an interaction-aware condition representation that learns relation features with spatial responsiveness and contextual interaction cues. Instead of relying on text conditions, VRDiff leverages learned visual relation representations as conditions for the diffusion model. Furthermore, we refine the visual relation representations through generative feedback from the text-to-image diffusion model, enhancing HOI detection performance without requiring image generation. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves state-of-the-art performance under both standard and zero-shot HOI detection settings.
Poster
Yingfan MA · Bohan An · Ao Shen · Mingzhi Yuan · Minghong Duan · Manning Wang
[ Exhibit Hall I ]
Abstract
Whole Slide Image (WSI) classification has been widely used in pathological diagnosis and prognosis prediction, and it is commonly formulated as a weakly-supervised Multiple Instance Learning (MIL) problem because of the large size of WSIs and the difficulty of obtaining fine-grained annotations. In the MIL formulation, a WSI is treated as a bag and the patches cut from it are treated as its instances, and most existing methods first extract instance features and then aggregate them into bag feature using attention-based mechanism for bag-level prediction. These models are trained using only bag-level labels, so they often lack instance-level insights and lose detailed semantic information, which limits their bag-level classification performance and damages their ability to explore high-expressive information. In this paper, we propose Flow-MIL, which leverages normalizing flow-based Latent Semantic Embedding Space (LSES) to enhance feature representation. By mapping patches into the simple and highly-expressive latent space LSES, Flow-MIL achieves effective slide-level aggregation while preserving critical semantic information. We also introduce Gaussian Mixture Model-based Latent Semantic Prototypes (LSP) within the LSES to capture class-specific pathological distribution for each class and refine pseudo instance labels. Extensive experiments on three public WSI datasets show that Flow-MIL outperforms recent SOTA methods in both …
Poster
Ke Zhang · Yi Huang · Wei Liu · Yuanyuan Wang · Vishal Patel · Le Lu · Xu Han · Dakai Jin · Ke Yan
[ Exhibit Hall I ]
Abstract
Accurate segmentation of tubular structures in medical images, such as vessels and airway trees, is crucial for computer-aided diagnosis, radiotherapy, and surgical planning. However, significant challenges exist in algorithm design when faced with diverse sizes, complex topologies, and (often) incomplete data annotation of these structures. We address these difficulties by proposing a new tubular structure segmentation framework named HarmonySeg. First, we design a deep-to-shallow decoder network featuring flexible convolution blocks with varying receptive fields, which enables the model to effectively adapt to tubular structures of different scales. Second, to highlight potential anatomical regions and improve the recall of small tubular structures, we incorporate vesselness maps as auxiliary information. These maps are aligned with image features through a shallow-and-deep fusion module, which simultaneously eliminates unreasonable candidates to maintain high precision. Finally, we introduce a topology-preserving loss function that leverages contextual and shape priors to balance the growth and suppression of tubular structures, which also allows the model to handle low-quality and incomplete annotations. Extensive quantitative experiments are conducted on four public datasets. The results show that our model can accurately segment 2D and 3D tubular structures and outperform existing state-of-the-art methods. External validation on a private dataset also demonstrates good generalizability. …
Poster
Xiaohui Chen · Satya Narayan Shukla · Mahmoud Azab · Aashu Singh · Qifan Wang · David Yang · ShengYun Peng · Hanchao Yu · Shen Yan · Xuewen Zhang · Baosheng He
[ Exhibit Hall I ]
Abstract
How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite Captions (CompCap), a flexible framework that leverages Large Language Models (LLMs) and automation tools to synthesize CIs with accurate and detailed captions. Using CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs across six CI types. We validate the effectiveness of CompCap-118K by supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K significantly enhances MLLMs’ understanding of CIs, yielding average gains of 1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.
Poster
Ayush Gupta · Anirban Roy · Rama Chellappa · Nathaniel D. Bastian · Alvaro Velasquez · Susmit Jha
[ Exhibit Hall I ]
Abstract
We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available.We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding.We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded open-ended question answering.For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.
Poster
Trevine Oorloff · Vishwanath Sindagi · Wele Gedara Chaminda Bandara · Ali Shafahi · Amin Ghiasi · Charan Prakash · Reza Ardekani
[ Exhibit Hall I ]
Abstract
Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) -- the ability to leverage a few set of example prompts to adapt to various tasks without having to explicitly update model weights. ICL has recently been explored for the visual domain with promising early outcomes. These approaches involve specialized training and/or additional data which complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be re-purposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this re-purposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task …
Poster
Haicheng Wang · Zhemeng Yu · Gabriele Spadaro · Chen Ju · Victor Quétu · Shuai Xiao · Enzo Tartaglione
[ Exhibit Hall I ]
Abstract
Recently, Multi-modal Large Language Models (MLLMs) have shown remarkable effectiveness for multi-modal tasks due to their abilities to generate and understand cross-modal data. However, processing long sequences of visual tokens extracted from visual backbones poses a challenge for deployment in real-time applications. To address this issue, we introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence, mitigating computational and memory demands during both training and inference. Through a comprehensive analysis of the token reduction process in vision encoder, we analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. We show the effectiveness of FOLDER by integrating it into the visual backbone of various MLLMs, significantly accelerating the inference phase. Furthermore, we evaluate its utility as a training accelerator or even performance booster for MLLMs. In both contexts, FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens. The source code will be open-sourced upon acceptance of the article.
Poster
guangyao Li · Siping Zhuang · Yajun Jian · Yan Yan · Hanzi Wang
[ Exhibit Hall I ]
Abstract
Referring Multi-Object Tracking (RMOT) aims to detect and track specific objects based on natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, often failing to exploit fine-grained linguistic cues that are crucial for distinguishing objects with similar characteristics. Notably, these cues play distinct roles at different tracking stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose DKGTrack, a novel RMOT method that enhances language comprehension for precise object tracking by decoupling language expressions into localized descriptions and motion states. To improve the accuracy of language-guided object identification, we introduce a Static Semantic Enhancement (SSE) module, which enhances region-level vision-language alignment through hierarchical cross-modal feature interaction, providing more discriminative object representations for tracking. Furthermore, we propose a Motion Perception Alignment (MPA) module that explicitly aligns object queries with motion descriptions, enabling accurate object trajectory prediction across frames. Experimental results on multiple RMOT benchmarks demonstrate the effectiveness of our method, which achieves competitive performance in challenging tracking scenarios.
Poster
Yuxiao Wang · Yu Lei · Zhenao WEI · WeiYing Xue · Xinyu Jiang · Nan Zhuang · Qi Liu
[ Exhibit Hall I ]
Abstract
The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects.Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions.To tackle this issue, a HOT framework, termed **P3HOT**, is proposed, which blends **P**rompt guidance and human **P**roximal **P**erception. To begin with, we utilize a semantic-driven prompt mechanism to direct the network's attention towards the relevant regions based on the correlation between image and text.Then a human proximal perception mechanism is employed to dynamically perceive key depth range around the human, using learnable parameters to effectively eliminate regions where interactions are not expected.Calculating depth resolves the uncertainty of the overlap between humans and objects in a 2D perspective, providing a quasi-3D viewpoint.Moreover, a Regional Joint Loss (RJLoss) has been created as a new loss to inhibit abnormal categories in the same area. A new evaluation metric called ``AD-Acc.'' is introduced to address the shortcomings of existing methods in addressing negative samples.Comprehensive experimental results demonstrate that our approach achieves state-of-the-art performance in four metrics across two benchmark datasets. Specifically, our model achieves …
Poster
Sebastian Schmidt · Julius Koerner · Dominik Fuchsgruber · Stefano Gasperini · Federico Tombari · Stephan Günnemann
[ Exhibit Hall I ]
Abstract
In panoptic segmentation, individual instances must be separated within semantic classes. As state-of-the-art methods rely on a pre-defined set of classes, they struggle with novel categories and out-of-distribution (OOD) data. This is particularly problematic in safety-critical applications, such as autonomous driving, where reliability in unseen scenarios is essential. We address the gap between outstanding benchmark performance and reliability by proposing Prior2Former(P2F), the first approach for segmentation vision transformers rooted in evidential learning. P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments. This design enables high-quality uncertainty estimation that effectively detects novel and OOD objects enabling state-of-the-art anomaly instance segmentation and open-world panoptic segmentation. Unlike most segmentation models addressing unknown classes, P2F operates without access to OOD data samples or contrastive training on void (i.e., unlabeled) classes, making it highly applicable in real-world scenarios where such prior information is unavailable. Additionally, P2F can be flexibly applied to anomaly instance and panoptic segmentation.Through comprehensive experiments on the Cityscapes, COCO, SegmentMeIfYouCan, and OoDIS datasets, we demonstrate the state-of-the-art performance of P2F. It achieves the highest ranking in the OoDIS anomaly instance benchmark among methods not using OOD data in any way.
Poster
Peng Ren · Tian Bai · Jing Sun · Fuming Sun
[ Exhibit Hall I ]
Abstract
Open-Vocabulary Camouflaged Object Segmentation (OVCOS) aims to segment camouflaged objects of any category based on text descriptions. Despite existing open-vocabulary methods exhibit strong segmentation capabilities, they still have a major limitation in camouflaged scenarios: semantic confusion, which leads to incomplete segmentation and class shift in the model. To mitigate the above limitation, we propose a framework for OVCOS, named SuCLIP. Specifically, we design a context-aware prompt scheme that leverages the internal knowledge of the CLIP visual encoder to enrich the text prompt and align it with local visual features, thereby enhancing the text prompt. To better align the visual semantic space and the text semantic space, we design a class-aware feature selection module to dynamically adjust text and visual embeddings, making them more matched with camouflaged object. Meanwhile, we introduce a semantic consistency loss to mitigate the semantic deviation between the text prompt and visual features, ensuring semantic consistency between the segmentation results and the text prompt. Finally, we design a text query decoder that precisely maps textual semantics to pixel-level segmentation results, thereby achieving semantic-spatial consistent decoding. Experimental results show that SuCLIP significantly outperforms the advanced method OVCoser on the OVCamo dataset.
Poster
rongkun Zheng · Lu Qi · Xi Chen · Yi Wang · Kun Wang · Hengshuang Zhao
[ Exhibit Hall I ]
Abstract
Recent efforts in video reasoning segmentation (VRS) integrate large language models (LLMs) with perception models to localize and track objects via textual instructions, achieving barely satisfactory results in simple scenarios. However, they struggled to discriminate and deduce the objects from user queries in more real-world scenes featured by long durations, multiple objects, rapid motion, and heavy occlusions. In this work, we analyze the underlying causes of these limitations, and present **ViLLa**: **Vi**deo reasoning segmentation with **L**arge **La**nguage Model. Remarkably, our ViLLa manages to tackle these challenges through multiple core innovations: (1) a context synthesizer that dynamically encodes the user intent with video contexts for accurate reasoning, resolving ambiguities in complex queries, and (2) a hierarchical temporal synchronizer that disentangles multi-object interactions across complex temporal scenarios by modelling multi-object interactions at local and global temporal scales. To enable efficient processing of long videos, ViLLa incorporates (3) a key segment sampler that adaptively partitions long videos into shorter but semantically dense segments for less redundancy. What's more, to promote research in this unexplored area, we construct a VRS benchmark, **VideoReasonSeg**, featuring different complex scenarios. Our model also exhibits impressive state-of-the-art results on VideoReasonSeg, Ref-YouTube-VOS, Ref-DAVIS17, MeViS, and ReVOS. Both quantitative and qualitative …
Poster
Xiaoyi Bao · Chen-Wei Xie · Hao Tang · Tingyu Weng · Xiaofeng Wang · Yun Zheng · Xingang Wang
[ Exhibit Hall I ]
Abstract
In recent years, the introduction of Multi-modal Large Language Models (MLLMs) into video understanding tasks has become increasingly prevalent. However, how to effectively integrate temporal information remains a critical research focus. Traditional approaches treat spatial and temporal information separately. Due to issues like motion blur, it is challenging to accurately represent the spatial information of rapidly moving objects. This can lead to temporally important regions being underemphasized during spatial feature extraction, which in turn hinders accurate spatio-temporal interaction and video understanding. To address this limitation, we propose an innovative video representation method called Dynamic-Image (DynImg). Specifically, we introduce a set of non-key frames as temporal prompts to highlight the spatial areas containing fast-moving objects. During the process of visual feature extraction, these prompts guide the model to pay additional attention to the fine-grained spatial features corresponding to these regions. Moreover, to maintain the correct sequence for DynImg, we employ a corresponding 4D video Rotary Position Embedding. This retains both the temporal and spatial adjacency of DynImg, helping MLLM understand the spatio-temporal order within this combined format. Experimental evaluations reveal that DynImg surpasses the state-of-the-art methods by approximately 2% across multiple video understanding benchmarks, proving the effectiveness of our temporal prompts …
Poster
chunlin wen · Yu Zhang · Jie Fan · Hongyuan Zhu · Xiu-Shen Wei · Yijun Wang · Zhiqiang Kou · Shuzhou Sun
[ Exhibit Hall I ]
Abstract
Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, i.e., irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the object-level correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target …
Poster
Taimur Hassan · Anabia Sohail · Muzammal Naseer · Naoufel Werghi
[ Exhibit Hall I ]
Abstract
Retinopathy comprises a group of retinal disorders that can lead to severe visual impairment or blindness. The heterogeneous morphology of lesions poses a significant challenge in developing robust diagnostic systems. Supervised approaches rely on large labeled datasets and often struggle with generalization. To address these limitations, we propose an unsupervised vision-language neural graph featurization method. This method first segments fundus images into a set of super-pixels via Simple Linear Iterative Clustering (SLIC). The super-pixel regions are then decomposed into an undirected graph where each super-pixel serve as a node, and spatially adjacent nodes are connected by edges. A Hamiltonian path systematically traverses the graph and iteratively update and propagate node and edge latent space embeddings throughout the graph until convergence is achieved. Then, a normalized cut separates the converged embeddings into two clusters within a latent space that represent the lesion and healthy super-pixel regions of the input scans. The lesion super-pixels are further classified into lesion categories using prompt-based zero-shot vision-language model. The proposed method is rigorously tested on three public datasets, dubbed ODIR, BIOMISA, and IDRiD, achieving F1-scores of 0.89, 0.93, and 0.92, respectively, with significant performance gains over state-of-the-art methods.
Poster
Shuaiting Li · Juncan Deng · Chengxuan Wang · Kedong Xu · Rongtao Deng · Hong Gu · Haibin Shen · Kejie Huang
[ Exhibit Hall I ]
Abstract
Vector Quantization (VQ) has emerged as a prominent weight compression technique, showcasing substantially lower quantization errors than uniform quantization across diverse models, particularly in extreme compression scenarios. However, its efficacy during fine-tuning is limited by the constraint of the compression format, where weight vectors assigned to the same codeword are restricted to updates in the same direction. Consequently, many quantized weights are compelled to move in directions contrary to their local gradient information. To mitigate this issue, we introduce a novel VQ paradigm, Sign-Splitting VQ (SSVQ), which decouples the sign bit of weights from the codebook. Our approach involves extracting the sign bits of uncompressed weights and performing clustering and compression on all-positive weights. We then introduce latent variables for the sign bit and jointly optimize both the signs and the codebook. Additionally, we implement a progressive freezing strategy for the learnable sign to ensure training stability. Extensive experiments on various modern models and tasks demonstrate that SSVQ achieves a significantly superior compression-accuracy trade-off compared to conventional VQ. Furthermore, we validate our algorithm on a hardware accelerator, showing that SSVQ achieves a 3× speedup over the 8-bit compressed model by reducing memory access.
Poster
Pedro Bassi · Mehmet Yavuz · Ibrahim Ethem Hamamci · Sezgin Er · Xiaoxi Chen · Wenxuan Li · Bjoern Menze · Sergio Decherchi · Andrea Cavalli · Kang Wang · Yang Yang · Alan Yuille · Zongwei Zhou
[ Exhibit Hall I ]
Abstract
With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present Rad-GPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. Rad-GPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical structures, then transforms this information into both structured reports and narrative reports. These reports provide tumor size, shape, location, attenuation, volume, and interactions with surrounding blood vessels and organs. Extensive evaluation on unseen hospitals shows that RAD-GPT can produce accurate reports, with high sensitivity/specificity for small tumor (<2 cm) detection: 80/73% for liver tumors, 92/78% for kidney tumors, and 77/77% for pancreatic tumors. For large tumors, sensitivity ranges from 89% to 97%. The results significantly surpass the state-of-the-art in abdominal CT report generation.Rad-GPT generated reports for 17 public datasets. Through radiologist review and refinement, we have ensured the reports' accuracy, and created the first publicly available image-text 3D medical dataset, comprising over 1.8 million text tokens and 2.7 million images from 9,262 CT scans, including 2,947 tumor scans/reports of 2,562 tumor instances. Our reports can: (1) localize tumors in eight liver sub-segments and …
Poster
Jiayuan Zhu · Junde Wu · Cheng Ouyang · Konstantinos Kamnitsas · Alison Noble
[ Exhibit Hall I ]
Abstract
Medical image segmentation data inherently contain uncertainty. This can stem from both imperfect image quality and variability in labeling preferences on ambiguous pixels, which depend on annotator expertise and the clinical context of the annotations. For instance, a boundary pixel might be labeled as tumor in diagnosis to avoid under-estimation of severity, but as normal tissue in radiotherapy to prevent damage to sensitive structures. As segmentation preferences vary across downstream applications, it is often desirable for an image segmentation model to offer user-adaptable predictions rather than a fixed output. While prior uncertainty-aware and interactive methods offer adaptability, they are inefficient at test time: uncertainty-aware models require users to choose from numerous similar outputs, while interactive models demand significant user input through click or box prompts to refine segmentation. To address these challenges, we propose **SPA**, a new **S**egmentation **P**reference **A**lignment framework that efficiently adapts to diverse test-time preferences with minimal human interaction. By presenting users with a select few, distinct segmentation candidates that best capture uncertainties, it reduces the user workload to reach the preferred segmentation. To accommodate user preference, we introduce a probabilistic mechanism that leverages user feedback to adapt a model's segmentation preference. The proposed framework is evaluated …
Poster
Kaixiang Yang · Xin Li · Qiang Li · Zhiwei Wang
[ Exhibit Hall I ]
Abstract
Anticipating and recognizing surgical workflows are critical for intelligent surgical assistance systems. However, existing methods rely on deterministic decision-making, struggling to generalize across the large anatomical and procedural variations inherent in real-world surgeries. In this paper, we introduce an innovative framework that incorporates stochastic modeling through a denoising diffusion probabilistic model (DDPM) into conventional deterministic learning for surgical workflow analysis. At the heart of our approach is a collaborative co-training paradigm: the DDPM branch captures procedural uncertainties to enrich feature representations, while the task branch focuses on predicting surgical phases and instrument usage. Theoretically, we demonstrate that this mutual refinement mechanism benefits both branches: the DDPM reduces prediction errors in uncertain scenarios, and the task branch directs the DDPM toward clinically meaningful representations. Notably, the DDPM branch is discarded during inference, enabling real-time predictions without sacrificing accuracy. Experiments on the Cholec80 dataset show that for the anticipation task, our method achieves a 16% reduction in eMAE compared to state-of-the-art approaches, and for phase recognition, it improves the Jaccard score by 1.0%. Additionally, on the AutoLaparo dataset, our method achieves a 1.5% improvement in the Jaccard score for phase recognition, while also exhibiting robust generalization to patient-specific variations. Our code and …
Poster
Jiaqi Liao · Yuwei Niu · Fanqing Meng · Hao Li · Changyao Tian · Yinuo Du · Yuwen Xiong · Dianqi Li · Xizhou Zhu · Li Yuan · Jifeng Dai · Yu Cheng
[ Exhibit Hall I ]
Abstract
Recent years have witnessed remarkable advances in Large Vision-Language Models (LVLMs), which have achieved human-level performance across various complex vision-language tasks. Following LLaVA's paradigm, mainstream LVLMs typically employ a shallow MLP for visual-language alignment through a two-stage training process: pretraining for cross-modal alignment followed by instruction tuning. While this approach has proven effective, the underlying mechanisms of how MLPs bridge the modality gap remain poorly understood. Although some research has explored how LLMs process transformed visual tokens, few studies have investigated the fundamental alignment mechanism. Furthermore, the MLP adapter requires retraining whenever switching LLM backbones. To address these limitations, we first investigate the working principles of MLP adapters and discover that they learn to project visual embeddings into subspaces spanned by corresponding text embeddings progressively. Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings. This innovative design enables pretraining-free adapter transfer across different LLMs while maintaining performance. Our experimental results demonstrate that a LangBridge adapter pre-trained on Qwen2-0.5B can be directly applied to larger models such as LLaMA3-8B or Qwen2.5-14B with minimal performance degradation. Overall, LangBridge enables interpretable vision-language alignment by grounding visual semantics in LLM language …
Poster
Joëlle Hanna · Damian Borth
[ Exhibit Hall I ]
Abstract
Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo-masks, outperforming related works. Those pseudo-masks can be used to train a segmentation model which achieves results comparable to fully-supervised models, significantly reducing the need for fine-grained labeled data.
Poster
Xinye Cao · Hongcan Guo · Jiawen Qian · Guoshun Nan · Chao Wang · Yuqi Pan · Tianhao Hou · Xiaojuan Wang · Yutong Gao
[ Exhibit Hall I ]
Abstract
Understanding hour-long videos with multi-modal large language models (MM-LLMs) enriches the landscape of human-centered AI applications. However, for end-to-end video understanding with LLMs, uniformly sampling video frames results in LLMs being overwhelmed by a vast amount of irrelevant information as video length increases. Existing hierarchical key frame extraction methods improve the accuracy of video understanding but still face two critical challenges. 1) How can the interference of extensive redundant information in long videos be mitigated? 2) How can a model dynamically adapt to complex hierarchical structures while accurately identifying key frames? To address these issues, we propose VideoMiner, which iteratively segments, captions, and clusters long videos, forming a hierarchical tree structure. The proposed VideoMiner progresses from long videos to events to frames while preserving temporal coherence, effectively addressing the first challenge. To precisely locate key frames, we introduce T-GRPO, a tree-based group relative policy optimization in reinforcement learning method that guides the exploration of the VideoMiner. The proposed T-GRPO is specifically designed for tree structures, integrating spatiotemporal information at the event level while being guided by the question, thus solving the second challenge. We achieve superior performance in all long-video understanding tasks and uncover several interesting insights. Our proposed T-GRPO …
Poster
Samir Khaki · Junxian Guo · Jiaming Tang · Shang Yang · Yukang Chen · Konstantinos Plataniotis · Yao Lu · Song Han · Zhijian Liu
[ Exhibit Hall I ]
Abstract
Vision language models (VLMs) have garnered increasing attention for their ability to integrate visual and textual understanding, with some capable of processing native-resolution images and long videos. While the capacity to process large visual data unlocks numerous downstream applications, it often introduces significant latency challenges, as the visual tokens dominate the resource consumption. In this work, we introduce SparseVILA, a novel method of query-aware token retrieval to dynamically accelerate the underlying LLM, by pruning tokens in the context stage, while attending to a sparse subset of visual tokens during the generation phase. By decoupling the context and generation compression, we can migrate the majority of sparsity into the generation stage, enabling query-aware support for multi-turn conversation while achieving a 1.5$\times$ speedup on image benchmarks. Further, this approach leads to significant accuracy improvements on image-centric benchmarks over previous query-aware/agnostic pruning works. Finally, SparseVILA enables efficient long-context/long-generation tasks by achieving a 6.3$\times$ and 1.7$\times$ speedup in context processing and generation respectively.
Poster
Federico Girella · Davide Talon · Ziyue Liu · Zanxi Ruan · Yiming Wang · Marco Cristani
[ Exhibit Hall I ]
Abstract
Fashion design is a complex creative process that blends visual and textual expressions. Designers convey ideas through sketches, which define spatial structure and design elements, and textual descriptions, capturing material, texture, and stylistic details. In this paper, we present LOcalized Text and Sketch for fashion image generation (LOTS), an approach for compositional sketch-text based generation of complete fashion outlooks. LOTS leverages a global description with paired localized sketch + text information for conditioning and introduces a novel step-based merging strategy for diffusion adaptation. First, a Modularized Pair-Centric representation encodes sketches and text into a shared latent space while preserving independent localized features; then, a Diffusion Pair Guidance phase integrates both local and global conditioning via attention-based guidance within the diffusion model’s multi-step denoising process. To validate our method, we build on Fashionpedia to release Sketchy, the first fashion dataset where multiple text-sketch pairs are provided per image. Quantitative results show LOTS achieves state-of-the-art image generation performance on both global and localized metrics, while qualitative examples and a human evaluation study highlight its unprecedented level of design customization.
Poster
Ziv Weiss Haddad · Oren Barkan · Yehonatan Elisha · Noam Koenigstein
[ Exhibit Hall I ]
Abstract
Completeness is a widely discussed property in explainability research, requiring that the attributions sum to the model’s response to the input. While completeness intuitively suggests that the model’s prediction is "completely explained" by the attributions, its global formulation alone is insufficient to ensure meaningful explanations. We contend that promoting completeness locally within attribution subregions, in a soft manner, can serve as a standalone guiding principle for producing high quality attributions. To this end, we introduce the concept of the completeness gap as a flexible measure of completeness and propose an optimization procedure that minimizes this gap across subregions within the attribution map. Extensive evaluations across various model architectures demonstrate that our method outperforms state-of-the-art explanation methods on multiple benchmarks.
Poster
Yunheng Li · Yuxuan Li · Quan-Sheng Zeng · Wenhai Wang · Qibin Hou · Ming-Ming Cheng
[ Exhibit Hall I ]
Abstract
Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks. Self-distillation recently is emerging as a promising approach for fine-tuning VLMs to better adapt to local regions without requiring extensive annotations. However, previous state-of-the-art approaches often suffer from significant `foreground bias', where models tend to wrongly identify background regions as foreground objects. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. To alleviate this issue, we propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations. DenseVLM leverages the pre-trained VLM to retrieve categories for unlabeled regions and then decouples the interference between foreground and background features. We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods, leading to notable performance improvements. Furthermore, it exhibits promising zero-shot scalability when training on more extensive and diverse datasets. Our code is available in the supplementary materials and will be publicly released.
Poster
Ronglai Zuo · Rolandos Alexandros Potamias · Evangelos Ververas · Jiankang Deng · Stefanos Zafeiriou
[ Exhibit Hall I ]
Abstract
Sign language is a visual language that encompasses all linguistic features of natural languages and serves as the primary communication method for the deaf and hard-of-hearing communities. Although many studies have successfully adapted pretrained language models (LMs) for sign language translation (sign-to-text), the reverse task—sign language generation (text-to-sign)—remains largely unexplored. In this work, we introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs using a pretrained LM. To align sign language with the LM, we leverage a decoupled tokenizer that discretizes continuous signs into token sequences representing various body parts. During decoding, unlike existing approaches that flatten all part-wise tokens into a single sequence and predict one token at a time, we propose a multi-head decoding method capable of predicting multiple tokens simultaneously. This approach improves inference efficiency while maintaining effective information fusion across different body parts. To further ease the generation process, we propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs as auxiliary conditions, significantly improving the precision of generated signs. Extensive qualitative and quantitative evaluations demonstrate the effectiveness of SOKE. Code, models, and data will be made publicly available.
Poster
Ruyang Liu · Shangkun Sun · Haoran Tang · Wei Gao · Ge Li
[ Exhibit Hall I ]
Abstract
Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the "key" is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines frame-level hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7\% on Video-MME, 71.4\% on MLVU and 60.4\% on LongVideoBench.
Poster
Beomyoung Kim · Chanyong Shin · Joonhyun Jeong · Hyungsik Jung · Seyun Lee · Sewhan Chun · Dong-Hyun HWANG · Joonsang Yu
[ Exhibit Hall I ]
Abstract
The recent segmentation foundation model, Segment Anything Model (SAM), exhibits strong zero-shot segmentation capabilities, but it falls short in generating fine-grained precise masks. To address this limitation, we propose a novel zero-shot image matting model, called ZIM, with two key contributions: First, we develop a label converter that transforms segmentation labels into detailed matte labels, constructing the new SA1B-Matte dataset without costly manual annotations. Training SAM with this dataset enables it to generate precise matte masks while maintaining its zero-shot capability. Second, we design the zero-shot matting model equipped with a hierarchical pixel decoder to enhance mask representation, along with a prompt-aware masked attention mechanism to improve performance by enabling the model to focus on regions specified by visual prompts. We evaluate ZIM using the newly introduced MicroMat-3K test set, which contains high-quality micro-level matte labels. Experimental results show that ZIM outperforms existing methods in fine-grained mask generation and zero-shot generalization. Furthermore, we demonstrate the versatility of ZIM in various downstream tasks requiring precise masks, such as image inpainting and 3D segmentation. Our contributions provide a robust foundation for advancing zero-shot matting and its downstream applications across a wide range of computer vision tasks. The code will be available soon.
Poster
Tassilo Wald · Constantin Ulrich · Jonathan Suprijadi · Sebastian Ziegler · Michal Nohel · Robin Peretzke · Gregor Koehler · Klaus Maier-Hein
[ Exhibit Hall I ]
Abstract
The field of self-supervised learning (SSL) for 3D medical images lacks consistency and standardization.While many methods have been developed, it is impossible to identify the current state-of-the-art, due to i) varying and small pre-training datasets, ii) varying architectures, and iii) being evaluated on differing downstream datasets. In this paper we bring clarity to this field and lay the foundation for further method advancements through three key contributions: We a) publish the largest publicly available pre-training dataset comprising 114k brain MRI volumes, enabling all practitioners to pre-train on a large-scale dataset. We b) benchmark existing 3D self-supervised learning methods on this dataset for a state-of-the-art CNN and Transformer architecture, clarifying the state of 3D SSL pre-training. Among many findings, we show that pre-trained methods can exceed a strong from-scratch nnU-Net ResEnc-L baseline. Lastly, we c) publish the code of our pre-training and fine-tuning frameworks and provide the pre-trained models created during the benchmarking process to facilitate rapid adoption and reproduction.
Poster
Chenting Wang · Kunchang Li · Tianxiang Jiang · Xiangyu Zeng · Yi Wang · Limin Wang
[ Exhibit Hall I ]
Abstract
Popular video training methods mainly operate on a fixed number of tokens sampled from a predetermined spatiotemporal grid, resulting in sub-optimal accuracy-computation trade-offs due to inherent video redundancy. They also lack adaptability to varying computational budgets for downstream tasks, hindering applications of the most competitive model in real-world scenes. We thus propose a new test setting, Token Optimization, for maximized input information across budgets, which optimizes the size-limited set of input tokens through token selection from more suitably sampled videos. To this end, we propose a novel augmentation tool termed Flux. By making the sampling grid flexible and leveraging token selection, it is easily adopted in most popular video training frameworks, boosting model robustness with nearly no additional cost. We integrate Flux in large-scale video pre-training, and the resulting FluxViT establishes new state-of-the-art results across extensive tasks at standard costs. Notably, with 1/4 tokens only, it can still match the performance of previous state-of-the-art models with Token Optimization, yielding nearly 90\% savings. The code and models will be publicly released to facilitate future video tasks.
Poster
Ding Zhong · Xu Zheng · Chenfei Liao · Yuanhuiyi Lyu · Jialei Chen · Shengyang Wu · Linfeng Zhang · Xuming Hu
[ Exhibit Hall I ]
Abstract
Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide.To address these issues, we propose a novel $\textbf{OmniSAM}$ framework, which makes the $\textbf{first}$ attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2’s memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries.For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization …
Poster
Xinyu Yan · Meijun Sun · Ge-Peng Ji · Fahad Khan · Salman Khan · Deng-Ping Fan
[ Exhibit Hall I ]
Abstract
We present LawDIS, a language-window-based controllable dichotomous image segmentation (DIS) framework that produces high-quality object masks. Our framework recasts DIS as an image-conditioned mask generation task within a latent diffusion model, enabling seamless integration of user controls. LawDIS is enhanced with macro-to-micro control modes. Specifically, in macro mode, we introduce a language-controlled segmentation strategy (LS) to generate an initial mask based on user-provided language prompts. In micro mode, a window-controlled refinement strategy (WR) allows flexible refinement of user-defined regions (i.e., size-adjustable windows) within the initial mask. Coordinated by a mode switcher, these modes can operate independently or jointly, making the framework well-suited for high-accuracy, personalised applications. Extensive experiments on the DIS5K benchmark reveal that our LawDIS significantly outperforms 11 cutting-edge methods across all metrics. Notably, compared to the second-best model MVANet, we achieve $F_\beta^\omega$ gains of 4.6% with both the LS and WR strategies and 3.6% gains with only the LS strategy on DIS-TE. Our code will be available.
Poster
Vishwesh Ramanathan · Tony Xu · Pushpak Pati · Faruk Ahmed · Maged Goubran · Anne Martel
[ Exhibit Hall I ]
Abstract
Prediction tasks in digital pathology are challenging due to the massive size of whole-slide images (WSIs) and the weak nature of training signals. Advances in computing, data availability, and self-supervised learning (SSL) have paved the way for slide-level foundation models (SLFMs) that can improve prediction tasks in low-data regimes. However, working with these models is challenging, with issues such as catastrophic forgetting during fine-tuning and under-utilization of shared information between tasks and modalities. To overcome these two challenges, we propose ModalTune, a novel fine-tuning framework which introduces the Modal Adapter to integrate new modalities without modifying SLFM weights. Additionally, we use large-language models (LLMs) to encode labels as text, capturing semantic relationships and enhancing generalization across multiple tasks and cancer types in a single training recipe. ModalTune achieves state-of-the-art (SOTA) results against both uni-modal and multi-modal models across four cancer types, jointly improving survival and cancer subtype prediction while remaining competitive in pan-cancer settings. Additionally, we show ModalTune is highly generalizable to two out-of-distribution (OOD) datasets. To our knowledge, this is the first unified fine-tuning framework for multi-modal, multi-task, and pan-cancer modeling in digital pathology. Code will be shared after blind-review.
Poster
Sihan Yang · Runsen Xu · Chenhang Cui · Tai Wang · Dahua Lin · Jiangmiao Pang
[ Exhibit Hall I ]
Abstract
Large Multimodal Models (LMMs) excel in visual-language tasks by leveraging numerous visual tokens for fine-grained visual information, but this token redundancy results in significant computational costs. Previous research aimed at reducing visual tokens during inference typically leverages importance maps derived from attention scores among vision-only tokens or vision-language tokens to prune tokens across one or multiple pruning stages. Despite this progress, pruning frameworks and strategies remain simplistic and insufficiently explored, often resulting in substantial performance degradation. In this paper, we propose VFlowOpt, a token pruning framework that introduces an importance map derivation process and a progressive pruning module with a recycling mechanism. The hyperparameters of its pruning strategy are further optimized by a visual information flow-guided method. Specifically, we compute an importance map for image tokens based on their attention-derived context relevance and patch-level information entropy. We then decide which tokens to retain or prune and aggregate the pruned ones as recycled tokens to avoid potential information loss. Finally, we apply a visual information flow-guided method that regards the last token in the LMM as the most representative signal of text-visual interactions. This method minimizes the discrepancy between token representations in LMMs with and without pruning, thereby enabling superior pruning …
Poster
Chia-Wen Kuo · Sijie Zhu · Fan Chen · Xiaohui Shen · Longyin Wen
[ Exhibit Hall I ]
Abstract
Large vision-and-language models (LVLMs) have traditionally integrated visual and textual tokens by concatenating them into a single homogeneous input for large language models (LLMs), thereby maximally preserving the pre-trained language capabilities.However, this constrained architecture for visual and textual tokens restricts the design space for processing visual tokens, potentially leading to suboptimal performance and efficiency.In this paper, we propose Decomposed Attention (\method{}), a more flexible attention architecture for LVLMs, which enables modification of visual token operations without affecting textual-to-textual attention.\method{} decomposes the 1-D causal self-attention of LVLMs into visual-to-visual, textual-to-visual, and textual-to-textual attentions, and the visual and textual output tokens from the decomposed attentions are merged with a carefully derived weighting strategy, namely $\alpha$-weighting. Taking advantage of the flexibility, we are able to introduce two critical improvements in visual token processing while maintaining the capacity of pre-trained LLMs: 1) We rectify the biased positional encoding in textual-to-visual attention to boost visual understanding performance. 2) We diagonalize visual-to-visual attention to reduce computation complexity from $\mathcal{O}(|V|^2)$ to $\mathcal{O}(|V|)$ for $|V|$ visual tokens without compromising performance. Extensive experiments and analysis validate the effectiveness of \method{}, demonstrating significant improvements on multiple image benchmarks while significantly reducing computational costs (\eg, $5\times$ faster).Code, data, and models will …
Poster
Ting Lei · Shaofeng Yin · Qingchao Chen · Yuxin Peng · Yang Liu
[ Exhibit Hall I ]
Abstract
Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set.Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model’s ability to capture detailed HOI relationships.To address these issues, we propose Interaction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model’s attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection.Furthermore, we refine HOI concept representations through language model-guided calibration, which helps distinguish diverse HOI concepts by leveraging structured semantic knowledge. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions.Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and …
Poster
Kai Huang · hao zou · Bochen Wang · Xi Ye · Zhen Xie · Hao Wang
[ Exhibit Hall I ]
Abstract
Recent advancements in Large Visual Language Models (LVLMs) have gained significant attention due to their remarkable reasoning capabilities and proficiency in generalization. However, processing a large number of visual tokens and generating long-context outputs impose substantial computational overhead, leading to excessive demands for key-value (KV) cache. To address this critical bottleneck, we propose AirCache, a novel KV cache compression method aimed at accelerating LVLMs inference. This work systematically investigates the correlations between visual and textual tokens within the attention mechanisms of LVLMs. Our empirical analysis reveals considerable redundancy in cached visual tokens, wherein strategically eliminating these tokens preserves model performance while significantly accelerating context generation. Inspired by these findings, we introduce an elite observation window for assessing the importance of visual components in the KV cache, focusing on stable inter-modal relevancy modeling with enhanced multi-perspective consistency. Additionally, we develop an adaptive layer-wise budget allocation strategy that capitalizes on the strength and skewness of token importance distribution, showcasing superior efficiency compared to uniform allocation. Comprehensive evaluations across multiple LVLMs and benchmarks demonstrate that our method achieves comparable performance to the full cache while retaining only 10% of visual KV cache, thereby reducing decoding latency by 29% to 66% across various batch …
Poster
Arsha Nagrani · Sachit Menon · Ahmet Iscen · Shyamal Buch · Nilpa Jha · Ramin Mehran · Anja Hauth · Mikhail Sirotenko · Yukun Zhu · Carl Vondrick · Cordelia Schmid · Tobias Weyand
[ Exhibit Hall I ]
Abstract
Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be released publicly.
Poster
Fu Rong · Meng Lan · Qian Zhang · Lefei Zhang
[ Exhibit Hall I ]
Abstract
Referring video object segmentation (RVOS) aims to segment objects in a video according to textual descriptions, which requires the integration of multimodal information and temporal dynamics perception. The Segment Anything Model 2 (SAM 2) has shown great effectiveness across various video segmentation tasks. However, its application to offline RVOS is challenged by the translation of the text into effective prompts and a lack of global context awareness. In this paper, we propose a novel RVOS framework, termed MPG-SAM 2, to address these challenges. Specifically, MPG-SAM 2 employs a multimodal encoder to jointly encode video and textual features, generating semantically aligned video and text embeddings along with multimodal class tokens. A mask prior generator is devised to utilize the video embeddings and class tokens to create pseudo masks of target objects and global context. These masks are fed into the prompt encoder as dense prompts, along with multimodal class tokens as sparse prompts to generate accurate prompts for SAM 2. To provide the online SAM 2 with a global view, we propose a hierarchical global-historical aggregator, which allows SAM 2 to aggregate global and historical information of target objects at both pixel and object levels, enhancing the target representation and temporal …
Poster
Jeongseok Hyun · Sukjun Hwang · Su Ho Han · Taeoh Kim · Inwoong Lee · Dongyoon Wee · Joon-Young Lee · Seon Joo Kim · Minho Shim
[ Exhibit Hall I ]
Abstract
Video large language models (LLMs) have achieved good video understanding performance by utilizing a large number of tokens in spatio-temporal space. However, the quadratic growth of the computational complexity associated with the number of tokens remains a critical challenge. To address this, we propose a novel spatio-temporal token merging (STTM) designed to enhance token efficiency in video LLMs. Our key insight is to leverage inherent spatial and temporal local redundancy in video data, which has been overlooked in previous research. Specifically, we transform individual frames into multi-granular spatial tokens, by coarse-to-fine search algorithm based on the quadtree data structure. Subsequently, we perform multi-granular directed pairwise merging in the temporal dimension. This decomposed merging approach significantly reduces redundant visual tokens across spatio-temporal dimension. Experiments on multiple video QA benchmarks show that our approach outperforms existing token reduction methods in accuracy. Surprisingly, our approach maintains above 99\% relative accuracy to models using full tokens with only 50\% of token budget. This token reduction also translates to lower inference latency.
Poster
Qi Chen · Xinze Zhou · Chen Liu · Hao Chen · Wenxuan Li · Zekun Jiang · Ziyan Huang · Yuxuan Zhao · Dexin Yu · Junjun He · Yefeng Zheng · Ling Shao · Alan Yuille · Zongwei Zhou
[ Exhibit Hall I ]
Abstract
AI development for tumor segmentation is challenged by the scarcity of large, annotated datasets, due to the intensive annotation effort and required medical expertise. Analyzing a proprietary dataset of 3,000 per-voxel annotated pancreatic tumor scans, we discovered that beyond 1,500 scans, AI performance plateaus despite more data. We further incorporated synthetic data, showing that AI could reach the plateaus with only 500 real scans. This indicates that synthetic augmentation steepens the scaling laws, enhancing AI performance more efficiently than real data alone.Motivated by these lessons, we created CancerVerse---a dataset of 10,136 CT scans with a total of 10,260 tumor instances per-voxel manually annotated in six organs (pancreas, liver, kidney, colon, esophagus, uterus) and 5,279 control scans. This monumental effort by eight expert radiologists offers a dataset scale that surpasses existing public tumor datasets by several orders of magnitude. While we continue to expand the scale of data and annotations, we believe that the current CancerVerse can already provide a solid foundation---based on our lessons from the proprietary dataset---to enable AI to segment tumors in these six organs, offering significant improvements in both in-distribution (+7% DSC) and out-of-distribution (+16% DSC) evaluations over those trained on current public datasets. More importantly, AI …
Poster
Ziyang Luo · Nian Liu · Xuguang Yang · Salman Khan · Rao Anwer · Hisham Cholakkal · Fahad Khan · Junwei Han
[ Exhibit Hall I ]
Abstract
Audio-Visual Segmentation (AVS) faces a fundamental challenge of effectively aligning audio and visual modalities. While recent approaches leverage foundation models to address data scarcity,they often rely on single-modality knowledge or combine foundation models in an off-the-shelf manner, failing to address the cross-modal alignment challenge. In this paper, we present TAViS, a novel framework that couples the knowledge of multimodal foundation models (ImageBind) for cross-modal alignment and a segmentation foundation model (SAM2) for precise segmentation. However, effectively combining these models poses two key challenges: the difficulty in transferring the knowledge between SAM2 and ImageBind due to their different feature spaces, and the insufficiency of using only segmentation loss for supervision. To address these challenges, we introduce a text-bridged design with two key components: (1) a text-bridged hybrid prompting mechanism where pseudo text provides class prototype information while retaining modality-specific details from both audio and visual inputs, and (2) an alignment supervision strategy that leverages text as a bridge to align shared semantic concepts within audio-visual modalities. Our approach achieves superior performance on single-source, multi-source, semantic datasets, and excels in zero-shot settings.
Poster
Young Seok Jeon · Hongfei Yang · Huazhu Fu · Young Seok Jeon
[ Exhibit Hall I ]
Abstract
Imposing key anatomical features, such as the number of organs, their shapes and relative positions, is crucial for building a robust multi-organ segmentation model. Current attempts to incorporate anatomical features include broadening the effective receptive field (ERF) size with data-intensive modules, or introducing anatomical constraints that scales poorly to multi-organ segmentation. We introduce a novel architecture called the Anatomy-Informed Cascaded Segmentation Network (AIC-Net). AIC-Net incorporates a learnable input termed "Anatomical Prior", which can be adapted to patient-specific anatomy using a differentiable spatial deformation. The deformed prior later guides decoder layers towards more anatomy-informed predictions. We repeat this process at a local patch level to enhance the representation of intricate objects, resulting in a cascaded network structure. AIC-Net is a general method that enhances any existing segmentation models to be more anatomy-aware. We have validated the performance of AIC-Net, with various backbones, on three multi-organ segmentation tasks: abdominal organs, vertebrae, and ribs. For each respective task, our benchmarks demonstrate improved dice score and Hausdorff distance.
Poster
Jiannan Ge · Lingxi Xie · Hongtao Xie · Pandeng Li · Sun-Ao Liu · XIAOPENG ZHANG · Qi Tian · Yongdong Zhang
[ Exhibit Hall I ]
Abstract
In recent years, Open-Vocabulary Semantic Segmentation (OVSS) has been largely advanced. However, existing methods mostly rely on a pre-trained vision-language model (e.g., CLIP) and require a predefined set of classes to guide the semantic segmentation process during the inference. This not only narrows the application scenario but also constrains comprehension within a finite vocabulary. To overcome this, we reformulate OVSS as a text generation task and propose the CLIP-adapted Region-to-Text Network (CRTNet) that achieves vocabulary-free OVSS by generating category names and descriptions upon segmentation masks. The training process consists of two steps to ensure an accurate and detailed interpretation of the masked regions: (i) the initial step adapts CLIP visual features to mask-level proposal features using binarized masks extracted by a trained mask extractor, and (ii) the subsequent step involves aggregating these features to become text-aware by integrating CLIP text embeddings, effectively aligning visual data with corresponding linguistic data to facilitate region-to-text learning. Furthermore, we introduce a series of parsing and filtering techniques to integrate multiple sources of training data to improve the generalization ability of our model. Experiments demonstrate that our model not only excels in OVSS but also exhibits scalability and can be adapted to various foundation models …
Poster
Danhui Chen · Ziquan Liu · Chuxi Yang · Dan Wang · Yan Yan · Yi Xu · Xiangyang Ji
[ Exhibit Hall I ]
Abstract
Pixel-level vision tasks, such as semantic segmentation, require extensive and high-quality annotated data, which is costly to obtain. Semi-supervised semantic segmentation (SSSS) has emerged as a solution to alleviate the labeling burden by leveraging both labeled and unlabeled data through self-training techniques. Meanwhile, the advent of foundational segmentation models pre-trained on massive data, has shown the potential to generalize across domains effectively. This work explores whether a foundational segmentation model can address label scarcity in the pixel-level vision task as an annotator for unlabeled images. Specifically, we investigate the efficacy of using SEEM, a Segment Anything Model (SAM) variant fine-tuned for textual input, to generate predictive masks for unlabeled data. To address the shortcomings of using SEEM-generated masks as supervision, we propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain's labeled data and then filters out unreliable pixel labels of unlabeled data so that only high-confidence labels are used as supervision. By leveraging conformal prediction (CP) to adapt foundation models to target data through uncertainty calibration, ConformalSAM exploits the strong capability of the foundational segmentation model reliably which benefits the early-stage learning, while a subsequent self-reliance training strategy mitigates overfitting to SEEM-generated masks …
Poster
Zhang Li · Biao Yang · Qiang Liu · Shuo Zhang · Zhiyin Ma · Liang Yin · Deng Linger · Yabo Sun · Yuliang Liu · Xiang Bai
[ Exhibit Hall I ]
Abstract
While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
Poster
Sijie Li · Chen Chen · Jungong Han
[ Exhibit Hall I ]
Abstract
In this paper, we propose SimMLM, a simple yet powerful framework for multimodal learning with missing modalities. Unlike existing approaches that rely on sophisticated network architectures or complex data imputation techniques, SimMLM provides a generic and effective solution that can adapt to various missing modality scenarios with improved accuracy and robustness. Specifically, SimMLM consists of a generic Dynamic Mixture of Modality Experts (DMoME) architecture, featuring a dynamic, learnable gating mechanism that automatically adjusts each modality’s contribution in both full and partial modality settings. A key innovation of SimMLM is the proposed More vs. Fewer (MoFe) ranking loss, which ensures that task accuracy improves or remains stable as more modalities are made available. This aligns the model with an intuitive principle: removing one or more modalities should not increase accuracy. We validate SimMLM on multimodal medical image segmentation (BraTS 2018) and multimodal classification (UPMC Food-101, avMNIST) tasks, where it consistently surpasses competitive methods, demonstrating superior accuracy, interpretability, robustness, and reliability across both complete and missing modality scenarios at test time.
Poster
Zhixi Cai · Fucai Ke · Simindokht Jahangard · Maria Garcia de la Banda · Gholamreza Haffari · Peter Stuckey · Hamid Rezatofighi
[ Exhibit Hall I ]
Abstract
Visual Grounding (VG) tasks, such as referring expression detection and segmentation tasks are important for linking visual entities to context, especially in complex reasoning tasks that require detailed query interpretation. This paper explores VG beyond basic perception, highlighting challenges for methods that require reasoning like human cognition. Recent advances in large language methods (LLMs) and Vision-Language methods (VLMs) have improved abilities for visual comprehension, contextual understanding, and reasoning. These methods are mainly split into end-to-end and compositional methods, with the latter offering more flexibility. Compositional approaches that integrate LLMs and foundation models show promising performance but still struggle with complex reasoning with language-based logical representations. To address these limitations, we propose NAVER, a compositional visual grounding method that integrates explicit probabilistic logic reasoning within a finite-state automaton, equipped with a self-correcting mechanism. This design improves robustness and interpretability in inference through explicit logic reasoning. Our results show that NAVER achieves SoTA performance comparing to recent end-to-end and compositional baselines.
Poster
Hui Sun · Shiyin Lu · Huanyu Wang · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang · Ming Li
[ Exhibit Hall I ]
Abstract
Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP$^3$) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space (RKHS). We then apply the determinantal point process (DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process (MDP) for allocating selection sizes across segments. Theoretically, MDP$^3$ provides a $(1-1/e)$-approximate solution to the NP-hard list-wise frame selection problem with …
Poster
Hongqiu Wang · Wu Chen · Xiangde Luo · Zhaohu Xing · Lihao Liu · Jing Qin · Shaozhi Wu · Lei Zhu
[ Exhibit Hall I ]
Abstract
Fairness in AI-assisted medical image analysis is crucial for equitable healthcare, but is often neglected, especially in cross-domain scenarios (diverse patient demographics and imaging protocols) that are prevalent in medical applications. Effective and equitable deployment of AI models in these scenarios are critical, yet traditional Unsupervised Domain Adaptation (UDA) methods exhibit limited improvements. Emerging Active Domain Adaptation (ADA) approaches offer more effective enhancements, but all ignore fairness issues, exacerbating biased outcomes. Therefore, in this work, we propose the first fairness-aware ADA paradigm that simultaneously achieves both enhanced fairness and superior overall performance. Our method leverages the multimodal alignment capability of Vision-Language Models (VLMs): By performing medical images (vision) and sensitive attributes (language) learning, VLM inherently captures semantic correlations between visual features and protected attributes, enabling explicit attributes representation. Building on this foundation, we further devise an attribute-aware strategy (FairAP), which dynamically adapts to diverse patient demographics to promote equitable and high-quality outcomes by considering both Attribute and Polysemy. Extensive experiments on the FairDomain benchmark demonstrate that our method significantly reduces bias and maintains state-of-the-art performance in segmentation tasks, outperforming existing UDA and ADA methods. This work pioneers a VLM-driven ADA paradigm for fair cross-domain medical segmentation, offering a blueprint for …
Poster
Maoxian Wan · Kaige Li · Qichuan Geng · Weimin Shi · Zhong Zhou
[ Exhibit Hall I ]
Abstract
Existing incremental few-shot semantic segmentation (IFSS) methods often learn novel classes by fine-tuning parameters from previous stages. This inevitably reduces the distinguishability of old class features, leading to catastrophic forgetting and overfitting to limited new samples. In this paper, we propose a novel prompt-based IFSS method with a visual prompt pool to store and switch multi-granular knowledge across stages, enhancing the model's ability to learn new classes. Specifically, we introduce three levels of prompts: 1) Task-persistent prompts: capturing generalizable knowledge shared across stages, such as foreground-background distributions, to ensure consistent recognition guidance; 2) Stage-specific prompts: adapting to the unique requirements of each stage by integrating its discriminative knowledge (e.g., shape difference) with common knowledge from previous stages; and 3) Region-unique prompts: encoding category-specific structures (e.g., edges) to more accurately guide the model to retain local details. In particular, we introduce a prompt switching mechanism that adaptively allocates the knowledge required for base and new classes, avoiding interference between prompts and preventing catastrophic forgetting and reducing the increasing computation. Our method achieves a new state-of-the-art performance, outperforming previous SoTA methods by 30.28\% mIoU-N on VOC and 13.90\% mIoU-N on COCO under 1-shot setting.
Poster
Jiale Zhou · Wenhan Wang · Shikun Li · Xiaolei Qu · Xin Guo · Yizhong Liu · Wenzhong Tang · Xun Lin · Yefeng Zheng
[ Exhibit Hall I ]
Abstract
Tubular structure segmentation (TSS) is important for various applications, such as hemodynamic analysis and route navigation. Despite significant progress in TSS, domain shifts remain a major challenge, leading to performance degradation in unseen target domains. Unlike other segmentation tasks, TSS is more sensitive to domain shifts, as changes in topological structures can compromise segmentation integrity, and variations in local features distinguishing foreground from background (e.g., texture and contrast) may further disrupt topological continuity. To address these challenges, we propose Topology-enhanced Test-Time Adaptation (TopoTTA), the first test-time adaptation framework designed specifically for TSS. TopoTTA consists of two stages: Stage 1 adapts models to cross-domain topological discrepancies using the proposed Topological Meta Difference Convolutions (TopoMDCs), which enhance topological representation without altering pre-trained parameters; Stage 2 improves topological continuity by a novel Topology Hard sample Generation (TopoHG) strategy and prediction alignment on hard samples with pseudo-labels in the generated pseudo-break regions. Extensive experiments across four scenarios and ten datasets demonstrate TopoTTA's effectiveness in handling topological distribution shifts, achieving an average improvement of 31.81% in clDice. TopoTTA also serves as a plug-and-play TTA solution for CNN-based TSS models.
Poster
Rongchang Xie · Chen Du · Ping Song · Chang Liu
[ Exhibit Hall I ]
Abstract
We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with language tokens. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance.Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces the amount of training data and improves the performance of the unified model. With the same LLM size, our method improved the understanding performance by 4.8\% compared to the previous SOTA Emu3 and surpassed the dedicated understanding model LLaVA-NeXT 34B by 3.7\%. For visual generation, our model achieves a FID score of 7.73 on MJHQ-30k, surpassing the existing unified models.
Poster
Yitong Jiang · Jinwei Gu · Tianfan Xue · Ka Chun Cheung · Pavlo Molchanov · Hongxu Yin · Sifei Liu
[ Exhibit Hall I ]
Abstract
Vision-Language Models (VLMs) excel at visual understanding by leveraging pretrained image encoders to generate visual tokens. However, they struggle with high-resolution images and zoomed-in regions due to the computational burden and token redundancy of uniform patch-based processing, often leading to the loss of critical details. To address these challenges, we propose Token-Efficient Vision Language Model (TEVA), a novel framework that detects key regions and applies dynamic patch sampling to efficiently capture fine-grained details while preserving global context. Our approach first identifies subject-oriented regions using an adaptive detection strategy. Then, a dynamic patch sampling mechanism selects and arranges patches at varying scales, ensuring efficient processing without increasing token count. Extensive experiments demonstrate that Token-Efficient Vision Language Model (TEVA) significantly enhances VLM performance in handling visual details, seamlessly integrating with various decoders and LLMs. Code and dataset will be released upon publication.
Poster
YASSER ABDELAZIZ DAHOU DJILALI · Ngoc Huynh · Phúc Lê Khắc · Wamiq Para · Ankit Singh · Sanath Narayan
[ Exhibit Hall I ]
Abstract
We present Saliency Benchmark (SalBench), a novel benchmark designed to assess the capability of Large Vision-Language Models (LVLM) in detecting visually salient features that are readily apparent to humans, such as a large circle amidst a grid of smaller ones. This benchmark focuses on low-level features including color, intensity, and orientation, which are fundamental to human visual processing. Our SalBench consists of images that highlight rare, unusual, or unexpected elements within scenes, and naturally draw human attention. It comprises three novel tasks for evaluating the perceptual capabilities of LVLM: Odd-One-Out Detection, Referring Odd-One-Out, and Visual Referring Odd-One-Out. We perform a comprehensive evaluation of state-of-the-art LVLM using SalBench and our findings reveal a surprising limitation: LVLM struggle to identify seemingly obvious visual anomalies, with even the advanced GPT-4o achieving only 47.6\% accuracy on such a simple task. SalBench will be an important step in measuring the capabilities of LVLM that align with the subtle definition of human attention. The project is available: https://github.com/salbench/salbench.
Poster
Yuxuan Wang · Yiqi Song · Cihang Xie · Yang Liu · Zilong Zheng
[ Exhibit Hall I ]
Abstract
Recent advancements in large-scale video-language models have shown significant potential for real-time planning and detailed interactions. However, their high computational demands and the scarcity of annotated datasets limit their practicality for academic researchers.In this work, we introduce VideoLLaMB, a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences alongside historical visual data, effectively preserving semantic continuity and enhancing model performance across various tasks.This approach includes recurrent memory tokens and a SceneTilling algorithm, which segments videos into independent semantic units to preserve semantic integrity. Empirically, VideoLLaMB significantly outstrips existing video-language models, demonstrating a $4.2$ points improvement over its competitors across four VideoQA benchmarks, and $2.06$ points on egocentric planning. Remarkably, it maintains robust performance as PLLaVA even as video length increases up to $8\times$. Besides, the frame retrieval results on our specialized \textbf{Needle in a Video Haystack (NIAVH)} benchmark, further validate VideoLLaMB's prowess in accurately identifying specific frames within lengthy videos. Our SceneTilling algorithm also enables the generation of streaming video captions directly, without necessitating additional training. In terms of efficiency, VideoLLaMB trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU with linear GPU memory …
Poster
Xinyu Mao · Xiaohan Xing · Fei MENG · Jianbang LIU · Fan BAI · Qiang Nie · Max Meng
[ Exhibit Hall I ]
Abstract
Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anything Model (SAM) have demonstrated strong generalizability and fine-grained boundary detection with sparse prompts, effectively addressing key polyp segmentation challenges. However, SAM’s prompt-dependent nature limits automation in medical applications, since manually inputting prompts for each image is labor-intensive and time-consuming. We propose OP-SAM, a One-shot Polyp segmentation framework based on SAM that automatically generates prompts from a single annotated image, ensuring accurate and generalizable segmentation without additional annotation burdens. Our method introduces Correlation-based Prior Generation (CPG) for semantic label transfer and Scale-cascaded Prior Fusion (SPF) to adapt to polyp size variations as well as filter out noisy transfers. Instead of dumping all prompts at once, we devise Euclidean Prompt Evolution (EPE) for iterative prompt refinement, progressively enhancing segmentation quality. Extensive evaluations across five datasets validate OP-SAM’s effectiveness. Notably, on Kvasir, it achieves 76.93% IoU, surpassing the state-of-the-art by 11.44%. Source codes will be released upon acceptance.
Poster
Shuyi Ouyang · Ziwei Niu · Hongyi Wang · Yen-wei Chen · Lanfen Lin
[ Exhibit Hall I ]
Abstract
Referring Visual Grounding (RVG) tasks revolve around utilizing vision-language interactions to incorporate object information from language expressions, thereby enabling targeted object detection or segmentation within images. Transformer-based methods have enabled effective interaction through attention mechanisms, achieving notable performance in RVG tasks. However, existing strategies for RVG, which involve direct interaction between visual and linguistic features, face three key challenges: (i) tendency to focus on a single target, (ii) insufficient control over linguistic noise, and (iii) high computational cost. To address these challenges, we propose a Region-aware Anchoring Mechanism (RaAM) that mediates vision-language interactions. In RaAM, region-aware anchors engage in alternating interactions with vision and language modalities, acting as indicators for object presence across different regions within the image. RaAM (i) directs attention to multiple target regions for better localization, (ii) reduces cross-modal redundancy by using anchors as buffers, and (iii) lowers time complexity. In addition, we design region and pixel level loss functions to enhance object presence assessment and edge precision. We evaluate our RaAM-RVG on four benchmark datasets and integrate RaAM into various models by replacing their interaction design. Results show that RaAM outperforms state-of-the-art methods with lower computational cost. Code will be released publicly.
Poster
Jinglei Zhang · Yuanfan Guo · Rolandos Alexandros Potamias · Jiankang Deng · Hang Xu · Chao Ma
[ Exhibit Hall I ]
Abstract
In recent years, video question answering based on multimodal large language models (MLLM) has garnered considerable attention, due to the benefits from the substantial advancements in LLMs. However, these models have a notable deficiency in the domains of video temporal grounding and reasoning, posing challenges to the development of effective real-world video understanding systems. Inspired by how humans use video players to interact with the progress bar for video comprehension, we introduce VTimeCoT, a simple yet effective training-free framework, designed for high-performance video grounding and reasoning. The proposed framework incorporates two novel visual tools of the progress bar: a plug-and-play progress bar integration tool and a high-efficiency highlighting tool. In addition, to address the limitations of conventional text-based chain-of-thought (CoT) approaches, we introduce a visuotemporal CoT process that integrates cross-modality reasoning across both video and text. Our approach demonstrates significant performance improvements on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and reasoning-based question answering. Finally, we showcase that the proposed framework achieves a compositional and interpretable reasoning process. The code will be made publicly available.
Poster
Ruyi Xu · Yen-Tzu Chiu · Tai-I Chen · Oscar Chew · Yung-Yu Chuang · Wen-Huang Cheng
[ Exhibit Hall I ]
Abstract
Anomaly generation has become essential in addressing the scarcity of defective samples in industrial anomaly inspection. However, existing training-based methods fail to handle complex anomalies and multiple defects simultaneously, especially when only a single anomaly sample is available per defect type. To address this issue, we propose TF-IDG, a novel training-free defect generation framework capable of generating diverse anomaly samples in a one-shot setting. We propose a Feature Alignment strategy that provides fine-grained appearance guidance by minimizing the distributional gap between generated and real defects with high complexity. Additionally, we introduce an Adaptive Anomaly Mask mechanism to mitigate the issue of defects with small regions being ignored during the generation process, enhancing consistency between synthetic defects and their corresponding masks. Finally, we incorporate a Texture Preservation module that extracts background information from anomaly-free images, ensuring that the visual properties of synthetic defects are seamlessly integrated into the image. Extensive experiments demonstrate the effectiveness of our method in generating accurate and diverse anomalies, further leading to superior performance in downstream anomaly inspection tasks.
Poster
Sebastian Höfer · Dorian Henning · Artemij Amiranashvili · Douglas Morrison · Mariliza Tzes · Ingmar Posner · Marc Matvienko · Alessandro Rennola · Anton Milan
[ Exhibit Hall I ]
Abstract
We present a novel large-scale dataset for defect detection in a logistics setting. Recent work on industrial anomaly detection has primarily focused on manufacturing scenarios with highly controlled poses and a limited number of object categories. Existing benchmarks like MVTec-AD (Bergmann et al., 2021) and VisA (Zou et al., 2022) have reached saturation, with state-of-the-art methods achieving up to 99.9% AUROC scores. In contrast to manufacturing, anomaly detection in retail logistics faces new challenges, particularly in the diversity and variability of viewpoints and object appearances. Leading anomaly detection methods fall short when applied to this new setting.To bridge this gap, we introduce a new benchmark that overcomes the current limitations of existing datasets. With over 230,000 images (29,000 defective instances), it is 40 times larger than MVTec and contains more than 46,000 distinct objects. To validate the difficulty of the problem, we conduct an extensive evaluation of multiple state-of-the-art anomaly detection methods, demonstrating that they achieve only 56.9% AUC on our dataset. Further qualitative analysis confirms that existing methods struggle to leverage normal samples under heavy pose and appearance variation. With our large-scale dataset, we set a new benchmark and encourage future research towards solving this challenging problem in retail …
Poster
Hui Lu · Albert Ali Salah · Ronald Poppe
[ Exhibit Hall I ]
Abstract
Video understanding requires the extraction of rich spatio-temporal representations, achieved by transformer models through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that addresses these limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. VideoMambaPro models surpass VideoMamba by 1.6-3.0% and 1.1-1.9% top-1 on Kinetics-400 and Something-Something V2, respectively. Even without extensive pre-training, our models present an attractive and efficient alternative to current transformer models. Moreover, our two solutions are orthogonal to recent advances in Vision Mamba models, and are likely to provide further improvements in future models.
Poster
Ke Niu · Haiyang Yu · Mengyang Zhao · Teng Fu · Siyang Yi · Wei Lu · Bin Li · Xuelin Qian · Xiangyang Xue
[ Exhibit Hall I ]
Abstract
Person re-identification (Re-ID) is a crucial task in computer vision, aiming to recognize individuals across non-overlapping camera views. While recent advanced vision-language models (VLMs) excel in logical reasoning and multi-task generalization, their applications in Re-ID tasks remain limited. They either struggle to perform accurate matching based on identity-relevant features or assist image-dominated branches as auxiliary semantics. In this paper, we propose a novel framework ChatReID, that shifts the focus towards a text-side-dominated retrieval paradigm, enabling flexible and interactive re-identification. To integrate the reasoning abilities of language models into Re-ID pipelines, We first present a large-scale instruction dataset, which contains more than 8 million prompts to promote the model fine-tuning. Next. we introduce a hierarchical progressive tuning strategy, which endows Re-ID ability through three stages of tuning, i.e., from person attribute understanding to fine-grained image retrieval and to multi-modal task reasoning.Extensive experiments across ten popular benchmarks demonstrate that ChatReID outperforms existing methods, achieving state-of-the-art performance in all Re-ID tasks. More experiments demonstrate that ChatReID not only has the ability to recognize fine-grained details but also to integrate them into a coherent reasoning process.
Poster
Fan Li · Xuanbin Wang · Xuan Wang · Zhaoxiang Zhang · yuelei xu
[ Exhibit Hall I ]
Abstract
Recently, open-vocabulary semantic segmentation has garnered growing attention. Most current methods leverage vision-language models like CLIP to recognize unseen categories through their zero-shot capabilities. However, CLIP struggles to establish potential spatial dependencies among scene objects due to its holistic pre-training objective, causing sub-optimal results. In this paper, we propose a DEnoising learning framework based on the Diffusion model for Open-vocabulary semantic Segmentation, called DEDOS, which is aimed at constructing the scene skeleton. Motivation stems from the fact that diffusion models incorporate not only the visual appearance of objects but also embed rich scene spatial priors. Our core idea is to view images as labels embedded with "noise"—non-essential details for perceptual tasks—and to disentangle the intrinsic scene prior from the diffusion feature during the denoising process of the images. Specifically, to fully harness the scene prior knowledge of the diffusion model, we introduce learnable proxy queries during the denoising process. Meanwhile, we leverage the robustness of CLIP features to texture shifts as supervision, guiding proxy queries to focus on constructing the scene skeleton and avoiding interference from texture information in the diffusion feature space. Finally, we enhance spatial understanding within CLIP features using proxy queries, which also serve as an interface …
Poster
Osman Ülger · Maksymilian Kulicki · Yuki Asano · Martin Oswald
[ Exhibit Hall I ]
Abstract
Open-Vocabulary Segmentation (OVS) methods are capable of performing semantic segmentation without relying on a fixed vocabulary, and in some cases, without training or fine-tuning. However, OVS methods typically require a human in the loop to specify the vocabulary based on the task or dataset at hand. In this paper, we introduce Auto-Vocabulary Semantic Segmentation (AVS), advancing open-ended image understanding by eliminating the necessity to predefine object categories for segmentation. Our approach, AutoSeg, presents a framework that autonomously identifies relevant class names using semantically enhanced BLIP embeddings and segments them afterwards. Given that open-ended object category predictions cannot be directly compared with a fixed ground truth, we develop a Large Language Model-based Auto-Vocabulary Evaluator (LAVE) to efficiently evaluate the automatically generated class names and their corresponding segments. With AVS, our method sets new benchmarks on datasets PASCAL VOC, Context, ADE20K, and Cityscapes, while showing competitive performance to OVS methods that require specified class names. All code will be publicly released.
Poster
Tom Nuno Wolf · Emre Kavak · Fabian Bongratz · Christian Wachinger
[ Exhibit Hall I ]
Abstract
The deployment of deep learning models in critical domains necessitates a balance between high accuracy and interpretability.We introduce SIC, an inherently interpretable neural network that provides local and global explanations of its decision-making process.Leveraging the concept of case-based reasoning, SIC extracts class-representative support vectors from training images, ensuring they capture relevant features while suppressing irrelevant ones.Classification decisions are made by calculating and aggregating similarity scores between these support vectors and the input's latent feature vector. We employ B-Cos transformations, which align model weights with inputs, to yield coherent pixel-level explanations in addition to global explanations of case-based reasoning.We evaluate SIC on three tasks: fine-grained classification on Stanford Dogs and FunnyBirds, multi-label classification on Pascal VOC, and pathology detection on the RSNA dataset.Results indicate that SIC not only achieves competitive accuracy compared to state-of-the-art black-box and inherently interpretable models but also offers insightful explanations verified through practical evaluation on the FunnyBirds benchmark.Our theoretical analysis proves that these explanations fulfill established axioms for explanations. Our findings underscore SIC's potential for applications where understanding model decisions is as critical as the decisions themselves.
Poster
Zuhao Yang · Yingchen Yu · Yunqing Zhao · Shijian Lu · Song Bai
[ Exhibit Hall I ]
Abstract
Video Temporal Grounding (VTG) aims to precisely identify video event segments in response to textual queries. The outputs of VTG tasks manifest as sequences of events, each defined by precise timestamps, saliency scores, and textual descriptions. Despite recent advances, a fundamental limitation persists in existing Video Large Language Models (Video-LLMs): they process all task tokens through identical and static pathways, failing to recognize that temporal localization, saliency assessment, and textual generation represent fundamentally distinct tasks requiring specialized processing. To address this, we introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks by dynamically routing task-specific tokens (e.g., timestamps, saliency scores) to specialized experts, with increased computational efficiency. Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications. Extensive experiments demonstrate that TimeExpert consistently achieves state-of-the-art performance on various VTG tasks such as Dense Video Captioning, Moment Retrieval, and Video Highlight Detection.
Poster
Shraman Pramanick · Effrosyni Mavroudi · Yale Song · Rama Chellappa · Lorenzo Torresani · Triantafyllos Afouras
[ Exhibit Hall I ]
Abstract
We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.
Poster
Hongliang Zhou · Yongxiang Liu · Canyu Mo · Weijie Li · Bowen Peng · Li Liu
[ Exhibit Hall I ]
Abstract
Few-shot object detection aims to detect novel classes with limited samples. Due to boundary and scale discrepancies with base classes, novel classes exhibit suboptimal performance under limited samples. Although recent methods leverage rich semantic representations of pretrained ViT to overcome limitations of model fine-tuning, thereby enhancing novel class performance, designing a ViT architecture that addresses boundary and scale issues to balance base and novel class performance remains challenging: (1) modeling feature distinctions at object boundaries at pixel level while preserving global information; and (2) applying scale-specific extraction for images containing multiscale objects, adaptively capturing of local details and global contours. So Pixel Difference Vision Transformer (PiDiViT) is proposed. Innovations include: (1) difference convolution fusion module (DCFM), which achieves precise object boundary localization and effective preservation of global object information by integrating direction-sensitive differential feature maps of pixel neighborhoods with original feature maps; and (2) multiscale feature fusion module (MFFM), which adaptively fuses features extracted by five different scale convolutional kernels using a scale attention mechanism to generate attention weights, achieving an optimal balance between local detail and global semantic information extraction. PiDiViT achieves SOTA on COCO benchmark: surpassing few-shot detection SOTA by 2.7 nAP50 (10-shot) and 4.0 nAP50 (30-shot) for …
Poster
Zhe Li · Lei Zhang · Zheren Fu · Kun Zhang · Zhendong Mao
[ Exhibit Hall I ]
Abstract
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text describing the user's intention without training on the triplet datasets. The key to this task is to make specified changes to specific objects in the reference image based on the text. Previous works generate single or multiple pseudo words by projecting the reference image to the word embedding space. However, these methods ignore the fact that the editing objects of CIR are naturally hierarchical, and lack the ability of text adaptation, thus failing to adapt to multi-level editing needs. In this paper, we argue that the hierarchical object decomposition is the key to learning pseudo words, and propose a hierarchy-aware dynamic pseudo word learning (HIT) framework to equip with HIerarchy semantic parsing and Text-adaptive filtering. The proposed HIT enjoys several merits. First, HIT is empowered to dynamically decompose the image into different granularity of editing objects by a set of learnable group tokens as guidance, thus naturally forming the hierarchical semantic concepts. Second, the text-adaptive filtering strategy is proposed to screen out specific objects from different levels based on the text, so as to learn hierarchical pseudo words that meet diverse …
Poster
Zeyu Xi · Haoying Sun · Yaofei Wu · Junchi Yan · Haoran Zhang · Lifang Wu · Liang Wang · Chang Wen Chen
[ Exhibit Hall I ]
Abstract
Existing sports video captioning methods often focus on the content yet overlook player identities, limiting their applicability. Although existing methods integrate extra information to generate identity-aware descriptions, player identities are sometimes incorrect because the extra information is independent of the video content. This paper introduces a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-VC), which focus on recognizing player identity from a visual perspective. Specifically, an identity related information extraction module (IRIEM) is designed to extract player related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct the NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance.
Poster
Wooseong Jeong · Jegyeong Cho · Youngho Yoon · Kuk-Jin Yoon
[ Exhibit Hall I ]
Abstract
Generalizing neural networks to unseen target domains is a significant challenge in real-world deployments. Test-time training (TTT) addresses this by using an auxiliary self-supervised task to reduce the domain gap caused by distribution shifts between the source and target. However, we find that when models are required to perform multiple tasks under domain shifts, conventional TTT methods suffer from unsynchronized task behavior, where the adaptation steps needed for optimal performance in one task may not align with the requirements of other tasks. To address this, we propose a novel TTT approach called Synchronizing Tasks for Test-time Training (S4T), which enables the concurrent handling of multiple tasks. The core idea behind S4T is that predicting task relations across domain shifts is key to synchronizing tasks during test time. To validate our approach, we apply S4T to conventional multi-task benchmarks, integrating it with traditional TTT protocols. Our empirical results show that S4T outperforms state-of-the-art TTT methods across various benchmarks.
Poster
Ru Zeng · Yan Song · Yang ZHANG · yanlinghu yanlinghu · Hui Yu
[ Exhibit Hall I ]
Abstract
GLOM, an innovative departure from standard deep learning architectures, has been proposed and gained special concern recently due to its good interpretability in representing part-whole relationships in computer vision. However, GLOM faces challenges in achieving agreement and is usually computationally demanding. First, current implementations struggle to produce identical vectors that reliably converge to represent nodes in a parse tree. Second, GLOM is computationally intensive due to the need to maintain equal resolution across all levels. To address these issues, inspired by contrastive learning, we proposed a contrastive agreement enhancer (CAE), which effectively promotes agreement between positive embedding pairs while pushing apart negative pairs, thereby facilitating forming distinct ``islands." Furthermore, we introduce a dissimilarity-focused head ($ H_d $) to reduce redundancy in the top-level embeddings, where embedding weights for downsampling are negatively correlated with similarity within a sliding window. The results of comparison experiments indicate that the proposed approach delicately retains informative content and significantly reduces the number of parameters. Additionally, the ablation experiments and visualization results demonstrate that CAE successfully promotes islands of agreement.
Poster
Maximilian Ulmer · Wout Boerdijk · Rudolph Triebel · Maximilian Durner
[ Exhibit Hall I ]
Abstract
This paper presents OC-DiT, a novel class of diffusion models designed for object-centric prediction, and applies it to zero-shot instance segmentation. We propose a conditional latent diffusion framework that generates instance masks by conditioning the generative process on object templates and image features within the diffusion model's latent space. This allows our model to effectively disentangle object instances through the diffusion process, which is guided by visual object descriptors and localized image cues. Specifically, we introduce two model variants: a coarse model for generating initial object instance proposals, and a refinement model that refines all proposals in parallel. We train these models on a newly created, large-scale synthetic dataset comprising thousands of high-quality object meshes. Remarkably, our model achieves state-of-the-art performance on multiple challenging real-world benchmarks, without requiring any retraining on target data. Through comprehensive ablation studies, we demonstrate the potential of diffusion models for instance segmentation tasks. Code and the synthetic dataset will be publicly released.
Poster
Tommaso Galliena · Tommaso Apicella · Stefano Rosa · Pietro Morerio · ALESSIO DEL BUE · Lorenzo Natale
[ Exhibit Hall I ]
Abstract
We present a self-supervised method to improve an agent's abilities in describing arbitrary objectswhile actively exploring a generic environment. This is a challenging problem, as current models struggle to obtain coherent image captions due to different camera viewpoints and clutter. We propose a three-phase framework to fine-tune existing captioning models that enhances caption accuracy and consistency across views via a consensus mechanism.First, an agent explores the environment, collecting noisy image-caption pairs. Then, a consistent pseudo-caption for each object instance is distilled via consensus using a large language model. Finally, these pseudo-captions are used to fine-tune an off-the-shelf captioning model, with the addition of contrastive learning. We analyse the performance of the combination of captioning models, exploration policies, pseudo-labeling methods, and fine-tuning strategies, on our manually labeled test set.Results show that a policy can be trained to mine samples with higher disagreement compared to classical baselines. Our pseudo-captioning method, in combination with all policies, has a higher semantic similarity compared to other existing methods, and fine-tuning improves caption accuracy and consistency by a significant margin. Code and test set annotations will be released upon paper acceptance.
Poster
Yuhui Zeng · Haoxiang Wu · Wenjie Nie · Xiawu Zheng · Guangyao Chen · Yunhang Shen · Jun Peng · Yonghong Tian · Rongrong Ji
[ Exhibit Hall I ]
Abstract
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities. This deficiency arises from their architecture's emphasis on discrete object identification rather than modeling the compositional reasoning, inter-object correlations, and contextual semantics essential for comprehensive event understanding. To address this challenge, we present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding through LLM-guided symbolic reasoning. Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive task-specific training. The proposed plug-and-play framework interfaces with any open-vocabulary detector while extending their inherent capabilities across architectures. At its core, our approach combines (i) a symbolic regression mechanism exploring relationship patterns among detected entities and (ii) a LLM-guided strategically guiding the search toward meaningful expressions. These discovered symbolic rules transform low-level visual perception into interpretable event understanding, providing a transparent reasoning path from objects to events with strong transferability across domains.We compared our training-free framework against specialized event recognition systems across diverse application domains. Experiments demonstrate that our framework enhances multiple object detector architectures to recognize complex events such as illegal fishing activities ($\textbf{75}$ %AUROC, $\textbf{+8.36}$ %improvement), construction …
Poster
Min Cen · Zhenfeng Zhuang · Yuzhe Zhang · Min Zeng · Baptiste Magnier · Lequan Yu · Hong Zhang · Liansheng Wang
[ Exhibit Hall I ]
Abstract
Graph-based Multiple Instance Learning (MIL) is commonly applied in survival analysis using Hematoxylin and Eosin (H\&E)-stained whole slide images (WSIs) because it effectively captures topological information. However, variations in WSI preparation—such as differences in staining and scanning—can introduce semantic bias. Additionally, topological subgraphs that are not relevant to the causal relationships can create noise, resulting in biased slide-level representations. These issues can hinder both the interpretability and generalization of the analysis. To address these issues, we introduce a dual structural causal model as the theoretical foundation and further propose a novel and interpretable dual causal graph-based MIL model, named C$^2$MIL, for robust survival analysis. C$^2$MIL adopts a novel cross-scale adaptive feature disentangling module for semantic causal intervention and a new Bernoulli differentiable causal subgraph sampling method for topological causal discovery. A joint optimization strategy integrating disentangling supervision and contrastive learning is proposed to ensure simultaneous refinement of semantic and topological causalities. Experimental results reveal that C$^2$MIL outperforms existing methods in both generalization and interpretability and can serve as a causal enhancement for various MIL baselines. The code will be available later.
Poster
Shijie Ma · Yuying Ge · Teng Wang · Yuxin Guo · Yixiao Ge · Ying Shan
[ Exhibit Hall I ]
Abstract
The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that **visually** perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing **only** global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth exploration, we have finally arrived at an effective method that consistently outperforms prior arts …
Poster
Junhao Xiao · Yang Wei · Jingyu Wang · Yongchao Wang · Xiuli Bi · Bin Xiao
[ Exhibit Hall I ]
Abstract
Morphological differences and dense spatial relations of organs make multi-organ segmentation challenging. Current segmentation networks, primarily based on CNNs and Transformers, represent organs by aggregating information within fixed regions. However, aggregated representations often fail to accurately describe the shape differences and spatial relationships of multi-organs, which leads to imprecise morphological modeling and ambiguous boundary representation. In response, we propose a novel multi-organ segmentation network via dynamic graph reconstruction, called DGRNet. Unlike existing approaches, DGRNet employs a graph-based paradigm to reconstruct multi-organs and leverages the topological flexibility of graphs to represent irregular organ morphology. Based on graph connectivity, the precise information interaction makes more efficient multi-organ modeling. Furthermore, DGRNet introduces a category-aware guidance mechanism that utilizes organ category-specific priors to explicitly define inter-organ boundaries, addressing the issue of ambiguous margin delineation in multi-organ regions. We conducted extensive experiments on five datasets (including CT and MRI), showing that DGRNet outperforms state-of-the-art methods and models complex multi-organ areas better, highlighting its effectiveness and robustness.
Poster
Bin Xie · Hao Tang · Bin Duan · Dawen Cai · Yan Yan · Gady Agam
[ Exhibit Hall I ]
Abstract
The Segment Anything Model (SAM), a prompt-driven foundation model for natural image segmentation, has demonstrated impressive zero-shot performance. However, SAM is not directly applicable to medical image segmentation due to its inability to predict semantic labels, reliance on additional prompts, and suboptimal performance in this domain. To address these limitations, we propose MaskSAM, a novel prompt-free SAM adaptation framework for medical image segmentation based on mask classification. MaskSAM introduces a prompt generator integrated with SAM’s image encoder to produce auxiliary classifier tokens, binary masks, and bounding boxes. Each pair of auxiliary mask and box prompts eliminates the need for user-provided prompts. Semantic label prediction is enabled by summing the auxiliary classifier tokens with learnable global classifier tokens in SAM’s mask decoder. Additionally, we design a 3D depth-convolution adapter for image embeddings and a 3D depth-MLP adapter for prompt embeddings, which are injected into each transformer block in the image encoder and mask decoder to efficiently fine-tune SAM for volumetric medical imaging.Our method achieves state-of-the-art performance, with a Dice score of 90.52% on AMOS2022, outperforming nnUNet by 2.7%. MaskSAM also surpasses nnUNet by 1.7% on ACDC and 1.0% on the Synapse dataset, demonstrating its effectiveness in medical image segmentation.
Poster
Evangelos Kazakos · Cordelia Schmid · Josef Sivic
[ Exhibit Hall I ]
Abstract
We propose a novel approach for captioning and object grounding in video, where the objects in the caption are grounded in the video via temporally dense bounding boxes. We introduce the following contributions. First, we present a large-scale automatic annotation method that aggregates captions grounded with bounding boxes across individual frames into temporally dense and consistent bounding box annotations. We apply this approach on the HowTo100M dataset to construct a large-scale pre-training dataset, named HowToGround1M. We also introduce a Grounded Video Caption Generation model, dubbed GROVE, and pre-train the model on HowToGround1M. Second, we introduce a new dataset, called iGround, of 3500 videos with manually annotated captions and dense spatio-temporally grounded bounding boxes. This allows us to measure progress on this challenging problem, as well as to fine-tune our model on this small-scale but high-quality data. Third, we demonstrate that our approach achieves state-of-the-art results on the proposed iGround dataset compared to a number of baselines, as well as on the VidSTG and ActivityNet-Entities datasets. We perform extensive ablations that demonstrate the importance of pre-training using our automatically annotated HowToGround1M dataset followed by fine-tuning on the manually annotated iGround dataset and validate the key technical contributions of our model. Data, …
Poster
Huy Ta · Duy Anh Huynh · Yutong Xie · Yuankai Qi · Qi Chen · Phi Le Nguyen · Sen Tran · Son Lam Phung · Anton Hengel · Zhibin Liao · Minh-Son To · Johan Verjans · Vu Phan
[ Exhibit Hall I ]
Abstract
Visual grounding (VG) is the capability to identify the specific regions in an image associated with a particular text description. In medical imaging, VG enhances interpretability by highlighting relevant pathological features corresponding to textual descriptions, improving model transparency and trustworthiness for wider adoption of deep learning models in clinical practice. Current models struggle to associate textual descriptions with disease regions due to inefficient attention mechanisms and a lack of fine-grained token representations. In this paper, we empirically demonstrate two key observations. First, current VLMs assign high norms to background tokens, diverting the model's attention from regions of disease. Second, the global tokens used for cross-modal learning are not representative of local disease tokens. This hampers identifying correlations between the text and disease tokens. To address this, we introduce simple, yet effective Disease-Aware Prompting (DAP) process, which uses the explainability map of a VLM to identify the appropriate image features. This simple strategy amplifies disease-relevant regions while suppressing background interference. Without any additional pixel-level annotations, DAP improves visual grounding accuracy by 20.74\% compared to state-of-the-art methods across three major chest X-ray datasets.
Poster
Xiao Li · Yiming Zhu · Yifan Huang · Wei Zhang · Yingzhe He · Jie Shi · Xiaolin Hu
[ Exhibit Hall I ]
Abstract
Object detection plays a crucial role in many security-sensitive applications, such as autonomous driving and video surveillance. However, several recent studies have shown that object detectors can be easily fooled by physically realizable attacks, e.g., adversarial patches and recent adversarial textures, which pose realistic and urgent threats. Adversarial Training (AT) has been recognized as the most effective defense against adversarial attacks. While AT has been extensively studied in the $l_\infty$-bounded attack settings on classification models, AT against physically realizable attacks on object detectors has received limited exploration. Early attempts are only performed to defend against adversarial patches, leaving AT against a wider range of physically realizable attacks under-explored. In this work, we consider defending against various physically realizable attacks with a unified AT method. We propose PBCAT, a novel Patch-Based Composite Adversarial Training strategy. PBCAT optimizes the model by incorporating the combination of small-area gradient-guided adversarial patches and imperceptible global adversarial perturbations covering the entire image. With these designs, PBCAT has the potential to defend against not only adversarial patches but also unseen physically realizable attacks such as adversarial textures. Extensive experiments in multiple settings demonstrated that PBCAT significantly improved robustness against various physically realizable attacks over state-of-the-art defense methods. …
Poster
Dongheon Lee · Seokju Yun · Youngmin Ro
[ Exhibit Hall I ]
Abstract
In this paper, we tackle the high computational cost of transformers for lightweight image super-resolution (SR).Motivated by the observations of self-attention's inter-layer repetition, we introduce a convolutionized self-attention module named Convolutional Attention (ConvAttn) that emulates self-attention's long-range modeling capability and instance-dependent weighting with a single shared large kernel and dynamic kernels.By utilizing the ConvAttn module, we significantly reduce the reliance on self-attention and its involved memory-bound operations while maintaining the representational capability of transformers.Furthermore, we overcome the challenge of integrating flash attention into the lightweight SR regime, effectively mitigating self-attention's inherent memory bottleneck.We scale up window size to 32$\times$32 with flash attention rather than proposing an intricated self-attention module, significantly improving PSNR by 0.31dB on Urban100$\times$2 while reducing latency and memory usage by 16$\times$ and 12.2$\times$.Building on these approaches, our proposed network, termed Emulating Self-attention with Convolution (ESC), notably improves PSNR by 0.27 dB on Urban100$\times$4 compared to HiT-SRF, reducing the latency and memory usage by 3.7$\times$ and 6.2$\times$, respectively.Extensive experiments demonstrate that our ESC maintains the ability for long-range modeling, data scalability, and the representational power of transformers despite most self-attentions being replaced by the ConvAttn module.
Poster
Jianfang He · Min Cao · Silong Peng · Qiong Xie
[ Exhibit Hall I ]
Abstract
Large vision-language models such as CLIP have made significant strides in zero-shot anomaly detection through prompt engineering.However, most existing methods typically process each test image individually, ignoring the practical rarity of abnormal patches in real-world scenarios.Although some batch-based approaches exploit the rarity by processing multiple samples concurrently, they generally introduce unacceptable latency for real-time applications.To mitigate these limitations, we propose RareCLIP, a novel online zero-shot anomaly detection framework that enables sequential image processing in real-time without requiring prior knowledge of the target domain.RareCLIP capitalizes on the zero-shot capabilities of CLIP and integrates a dynamic test-time rarity estimation mechanism.A key innovation of our framework is the introduction of a prototype patch feature memory bank, which aggregates representative features from historical observations and continuously updates their corresponding rarity measures.For each incoming image patch, RareCLIP computes a rarity score by aggregating the rarity measures of its nearest neighbors within the memory bank.Moreover, we introduce a prototype sampling strategy based on dissimilarity to enhance computational efficiency, as well as a similarity calibration strategy to enhance the robustness of rarity estimation.Extensive experiments demonstrate that RareCLIP attains state-of-the-art performance with 98.2\% image-level AUROC on MVTec AD and 94.5\% on VisA, while achieving a latency of 59.4 …
Poster
Maksim Golyadkin · Rubanova Alexandrovna · Aleksandr Utkov · Dmitry Nikolotov · Ilya Makarov
[ Exhibit Hall I ]
Abstract
The recognition of ancient Egyptian hieroglyphs presents significant challenges due to the vast stylistic variations and the scarcity of labeled data. While deep learning has shown promising results, existing approaches often rely on single-source or synthetic datasets, limiting their generalization ability. To advance research in hieroglyph recognition, we introduce the Multisource Egyptian Hieroglyphs (MEH) dataset, the first multi-style dataset for hieroglyph classification. MEH comprises 10 distinct groups, each representing a unique writing style, with labels derived from professionally verified text digitizations. Using this dataset, we explore three key aspects of hieroglyph recognition: (1) analyzing how different writing styles affect model generalization, (2) evaluating synthetic data generation for expanding hieroglyph class coverage, and (3) assessing classification performance of existing models. To support future large-scale dataset creation, we propose a style-aware synthetic data generation method and introduce a hieroglyph labeling tool to simplify annotation and accelerate text digitization.
Poster
Bangxiang Lan · Ruobing Xie · Ruixiang Zhao · Xingwu Sun · Zhanhui Kang · Gang Yang · Xirong Li
[ Exhibit Hall I ]
Abstract
The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, \ie F-Pig, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT-1k, MSRVTT-3k, MSVD, VATEX and DiDeMo, demonstrate that our method achieves a significant improvement over the baseline, with an increase of $1.6\% \sim 3.9\%$ in R@1.
Poster
Ruiting Dai · Chenxi Li · Yandong Yan · Lisi Mo · Ke Qin · Tao He
[ Exhibit Hall I ]
Abstract
Previous multimodal learning models for missing modalities predominantly employ diffusion models to recover absent data conditioned on the available modalities. However, these approaches often overlook a critical issue: modality generation bias. In other words, while some modalities may be generated with high quality, others—such as video—may prove challenging to synthesize effectively. We argue that this limitation is primarily due to the inherent modality gap, ultimately resulting in imbalanced training. To overcome this challenge, we introduce a novel Multi-stage Duplex Diffusion Network (MD^2N) designed to achieve unbiased missing-modality recovery. The key idea of our approach is the development of a modality transfer module within the recovery process, which facilitates smooth cross-modality generation. This module is trained using duplex diffusion models, enabling the available and missing modalities to generate each other in an intersecting manner through three distinct stages: global structure generation, modality transfer, and local cross-modal refinement. At training, the generation of the available and missing data mutually influences and finally achieves a generation balance state. Experimental results demonstrate that our proposed method significantly outperforms current state-of-the-art techniques, achieving up to a 4% improvement over IMDer on the CMU-MOSEI dataset.
Poster
Juncan Deng · Shuaiting Li · Zeyu Wang · Kedong Xu · Hong Gu · Kejie Huang
[ Exhibit Hall I ]
Abstract
Visual Mamba networks (ViMs) extend the selective space state model (Mamba) to various vision tasks and demonstrate significant potential. Vector quantization (VQ), on the other hand, decomposes network weights into codebooks and assignments, significantly reducing memory usage and computational latency to enable ViMs deployment on edge devices. Although existing VQ methods have achieved extremely low-bit quantization (e.g., 3-bit, 2-bit, and 1-bit) in convolutional neural networks and Transformer-based networks, directly applying these methods to ViMs results in unsatisfactory accuracy. We identify several key challenges: 1) The weights of Mamba-based blocks in ViMs contain numerous outliers, significantly amplifying quantization errors. 2) When applied to ViMs, the latest VQ methods suffer from excessive memory consumption, lengthy calibration procedures, and suboptimal performance in the search for optimal codewords. In this paper, we propose ViM-VQ, an efficient post-training vector quantization method tailored for ViMs. ViM-VQ consists of two innovative components: 1) a fast convex combination optimization algorithm that efficiently updates both the convex combinations and the convex hulls to search for optimal codewords, and 2) an incremental vector quantization strategy that incrementally confirms optimal codewords to mitigate truncation errors. Experimental results demonstrate that ViM-VQ achieves state-of-the-art performance in low-bit quantization across various visual tasks.
Poster
Xi Fang · Jiankun Wang · Xiaochen Cai · Shang Chien · Shuwen Yang · Haoyi Tao · Nan wang · Lin Yao · Linfeng Zhang · Guolin Ke
[ Exhibit Hall I ]
Abstract
In recent decades, chemistry publications and patents have increased rapidly. A significant portion of key information is embedded in molecular structure figures, complicating large-scale literature searches and limiting the application of large language models in fields such as biology, chemistry, and pharmaceuticals. The automatic extraction of precise chemical structures is of critical importance. However, the presence of numerous Markush structures in real-world documents, along with variations in molecular image quality, drawing styles, and noise, significantly limits the performance of existing optical chemical structure recognition (OCSR) methods. We present MolParser, a novel end-to-end OCSR method that efficiently and accurately recognizes chemical structures from real-world documents, including difficult Markush structure. We use a extended SMILES encoding rule to annotate our training dataset. Under this rule, we build MolParser-7M, the largest annotated molecular image dataset to our knowledge. While utilizing a large amount of synthetic data, we employed active learning methods to incorporate substantial in-the-wild data, specifically samples cropped from real patents and scientific literature, into the training process. We trained an end-to-end molecular image captioning model, MolParser, using a curriculum learning approach. MolParser significantly outperforms classical and learning-based methods across most scenarios, with potential for broader downstream applications. The dataset is publicly …
Poster
Zeqi Zheng · Yanchen Huang · Yingchao Yu · Zizheng Zhu · Junfeng Tang · Zhaofei Yu · Yaochu Jin
[ Exhibit Hall I ]
Abstract
Spiking Neural Networks (SNNs) based on Transformers have garnered significant attention due to their superior performance and high energy efficiency. However, the spiking attention modules of most existing Transformer-based SNNs are adapted from those of analog Transformers, failing to fully address the issue of over-allocating attention to irrelevant contexts. To fix this fundamental yet overlooked issue, we propose a Lateral Inhibition-inspired Spiking Transformer (SpiLiFormer). It emulates the brain's lateral inhibition mechanism, guiding the model to enhance attention to relevant tokens while suppressing attention to irrelevant ones. Our model achieves state-of-the-art (SOTA) performance across multiple datasets, including CIFAR-10 (+0.45\%), CIFAR-100 (+0.48\%), CIFAR10-DVS (+2.70\%), N-Caltech101 (+1.94\%), and ImageNet-1K (+1.6\%). Notably, on the ImageNet-1K dataset, SpiLiFormer (69.9M parameters, 4 time steps, 384 resolution) outperforms E-SpikeFormer (173.0M parameters, 8 time steps, 384 resolution), a SOTA spiking Transformer, by 0.46\% using only 39\% of the parameters and half the time steps. Our code and training checkpoints will be released upon acceptance.
Poster
Zhuqiang Lu · Zhenfei Yin · Mengwei He · Zhihui Wang · Zicheng Liu · Zhiyong Wang · Kun Hu
[ Exhibit Hall I ]
Abstract
Recently, Vision Large Language Models (VLLMs) with integrated vision encoders have shown promising performance in vision understanding. They encode visual content into sequences of visual tokens, enabling joint processing of visual and textual data. However, understanding videos, especially long videos, remains a challenge as the rapid growth of visual tokens during video encoding risks exceeding VLLMs' context window length and significantly escalates computational cost. To restrict the number of visual tokens, existing VLLMs either: (1) uniformly downsample videos into a fixed number of frames or (2) reducing the number of visual tokens encoded from each frame. We argue that the former neglects temporal dynamics in videos, while the latter fails to preserve spatial details within individual frame. In this work, we propose Balanced-VLLM (B-VLLM), a novel VLLM framework designed to model task relevant spatio-temporal cues, while restricting the number of visual tokens within the VLLM's context window length. Central to our framework is a text-conditioned adaptive frame selection module that dynamically identifies task-relevant frames, which are further de-duplicated with a temporal frame token merging strategy.The visual tokens of these frames then undergo spatial token sampling and an optional spatial token merging strategy for granular control against the token budget. Experiments …
Poster
Zhiqi Ge · Juncheng Li · Xinglei Pang · Minghe Gao · Kaihang Pan · Wang Lin · Hao Fei · Wenqiao Zhang · Siliang Tang · Yueting Zhuang
[ Exhibit Hall I ]
Abstract
Digital agents are increasingly employed to automate tasks in interactive digital environments such as web pages, software applications, and operating systems. While text-based agents built on Large Language Models (LLMs) often require frequent updates due to platform-specific APIs, visual agents leveraging Multimodal Large Language Models (MLLMs) offer enhanced adaptability by interacting directly with Graphical User Interfaces (GUIs). However, these agents face significant challenges in visual perception, particularly when handling high-resolution, visually complex digital environments. This paper introduces Iris, a foundational visual agent that addresses these challenges through two key innovations: Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL). ISC dynamically identifies and prioritizes visually dense regions using a edge detection algorithm, enabling efficient processing by allocating more computational resources to areas with higher information density. SRDL enhances the agent's ability to handle complex tasks by leveraging a dual-learning loop, where improvements in referring (describing UI elements) reinforce grounding (locating elements) and vice versa, all without requiring additional annotated data. Empirical evaluations demonstrate that Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations, outperforming methods using 10x more training data. These improvements further translate to significant gains in both web and OS agent downstream tasks. The project is …
Poster
Aashish Sharma
[ Exhibit Hall I ]
Abstract
In this paper, we address the problem of small object detection (SOD) by introducing our novel approach - Dynamically Multiplexed Expanded Features Set (DM-EFS) form. Detecting small objects is challenging as they usually suffer from inadequate feature representation. Hence, to address this, we propose the Expanded Features Set (EFS) form - a simple yet effective idea to improve the feature representation of small objects by utilizing the untapped higher resolution features from the shallower layers of the backbone module. We observe that the EFS form improves the SOD performance. However, due to processing of additional features, it has a higher computational cost which reduces inference efficiency. Hence, to address this, we propose Dynamic Feature Multiplexing (DFM) - a novel design that optimizes the usage of the EFS form during inference by dynamically multiplexing it to create our aforementioned DM-EFS form. Since our DM-EFS form is a multiplexed (or subsampled) optimal version of the EFS form, it improves the SOD performance like the EFS form but with a lower computational cost. Extensive experiments confirm the efficacy of our DM-EFS approach. Integrated with YOLOv7 base model, our DM-EFS achieves state-of-the art results on diverse SOD datasets outperforming the base model and SOD …
Poster
Zonglin Di · Jing Shi · Yifei Fan · Hao Tan · Alexander Black · John Collomosse · Yang Liu
[ Exhibit Hall I ]
Abstract
The image difference captioning (IDC) task is to describe the distinctions between two images. However, existing datasets do not offer comprehensive coverage across all image-difference categories. In this work, we introduce a high-quality dataset, DiffTell with various types of image manipulations, including global image alterations, object-level changes, and text manipulations. The data quality is controlled by careful human filtering. Additionally, to scale up the data collection without prohibitive human labor costs, we explore the possibility of automatically filtering for quality control. We demonstrate that both traditional methods and recent multimodal large language models (MLLMs) exhibit performance improvements on the IDC task after training on the DiffTell dataset. Through extensive ablation studies, we provide a detailed analysis of the performance gains attributed to DiffTell. Experiments show DiffTell significantly enhances the availability of resources for IDC research, offering a more comprehensive foundation and benchmark for future investigations.
Poster
Ao Wang · Lihao Liu · Hui Chen · Zijia Lin · Jungong Han · Guiguang Ding
[ Exhibit Hall I ]
Abstract
Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios. Recent open-set methods leverage text prompts, visual cues, or prompt-free paradigm to overcome this, but often compromise between performance and efficiency due to high computational demands or deployment complexity. In this work, we introduce YOLOE, which integrates detection and segmentation across diverse open prompt mechanisms within a single highly efficient model, achieving real-time seeing anything. For text prompts, we propose Re-parameterizable Region-Text Alignment (RepRTA) strategy. It refines pretrained textual embeddings via a re-parameterizable lightweight auxiliary network and enhances visual-textual alignment with zero inference and transferring overhead. For visual prompts, we present Semantic-Activated Visual Prompt Encoder (SAVPE). It employs decoupled semantic and activation branches to bring improved visual embedding and accuracy with minimal complexity. For prompt-free scenario, we introduce Lazy Region-Prompt Contrast (LRPC) strategy. It utilizes a built-in large vocabulary and specialized embedding to identify all objects, avoiding costly language model dependency. Extensive experiments show YOLOE's exceptional zero-shot performance and transferability with high inference efficiency and low training cost. Notably, on LVIS, with 3$\times$ less training cost and 1.4$\times$ inference …
Poster
WU Sitong · Haoru Tan · Yukang Chen · Shaofeng Zhang · Jingyao Li · Bei Yu · Xiaojuan Qi · Jiaya Jia
[ Exhibit Hall I ]
Abstract
Evaluating the quality of image-text pair data plays a crucial role in various data processing strategies for vision-language pre-training. Currently, most popular metrics rely on off-the-shelf vision-language models to generate quality scores for paired image and text based on their feature similarity, such as CLIP-Score. However, we observe a prevalent phenomenon, that is, different scoring models yield varying quality scores for the same data. This quality score disparity directly affects the result of data processing, leading to the discrepancy between datasets processed using different quality scores. Subsequently, this dataset disparity further results in the performance differences of models individually trained on the dataset processed by distinct quality scores. Notably, no single quality score performs optimally across all evaluation tasks. Each score exhibits an inherent bias towards certain concepts or tasks, and different scores have complementary effects on the model performance. This brings great confusion when choosing the scoring model. In this paper, we first investigate these disparity phenomena and analyze the reason. Then, we propose a simple yet effective method, named Mixture-of-Scores (MoS), to extract the essence of existing quality scores while eliminating their biases by integrating them into a more robust score based on a data-adaptive ensemble strategy. Particularly, …
Poster
Weiying Xie · Zihan Meng · Jitao Ma · Wenjin Guo · Haowei Li · Haonan Qin · Leyuan Fang · Yunsong Li
[ Exhibit Hall I ]
Abstract
Quantization-aware Training (QAT) technology helps deep models adapt to precision loss by simulating quantization operations. However, existing methods fail to reach the optimal solution due to inadequate exploration of quantization solution space. To address the issue, we propose a novel QAT method, Allowing Oscillation Quantization (AOQ), which expands the reachable solution space through weight oscillation. Notably, unlike previous methods that suppress oscillation throughout training, in the early and middle training stages, AOQ promotes oscillation to explore a broader range of quantized configurations. In the later stage, AOQ suppresses oscillation to ensure stable convergence. Furthermore, by decoupling the quantization thresholds and levels, we encourage meaningful oscillation and improve the stability of learnable quantization parameters. Extensive experiments on various models, including ResNet, MobileNet, DeiT and Swin Transformer, demonstrate the effectiveness of our method. Specifically, with 2-bit quantization, AOQ achieves a performance improvement of $0.4$%$\sim$$2.2$% on ImageNet compared to state-of-the-art methods.
Poster
Vladimir Kulikov · Matan Kleiner · Inbar Huberman-Spiegelglas · Tomer Michaeli
[ Exhibit Hall I ]
Abstract
Editing real images using a pre-trained text-to-image (T2I) diffusion/flow model often involves inverting the image into its corresponding noise map. However, inversion by itself is typically insufficient for obtaining satisfactory results, and therefore many methods additionally intervene in the sampling process. Such methods achieve improved results but are not seamlessly transferable between model architectures. Here, we introduce FlowEdit, a text-based editing method for pre-trained T2I flow models, which is inversion-free, optimization-free and model agnostic. Our method constructs an ODE that directly maps between the source and target distributions (corresponding to the source and target text prompts) and achieves a lower transport cost than the inversion approach. This leads to state-of-the-art results, as we illustrate with Stable Diffusion 3 and FLUX.
Poster
Yiren Song · Danze Chen · Mike Zheng Shou
[ Exhibit Hall I ]
Abstract
Generating cognitive-aligned layered SVGs remains challenging due to existing methods’ tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a DiT based framework that bridges this gap by learning designers’ layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments show that LayerTracer surpasses optimization-based and neural baselines in generation quality and editability.