Skip to yearly menu bar Skip to main content


Poster Session

Poster Session 1 & Exhibit Hall

Exhibit Hall I
Tue 21 Oct 2:45 p.m. PDT — 4:45 p.m. PDT
Abstract:
Chat is not available.


#1
Secure On-Device Video OOD Detection Without Backpropagation

Li Li · Peilin Cai · Yuxiao Zhou · Zhiyu Ni · Renjie Liang · QIN YOU · Yi Nian · Zhengzhong Tu · Xiyang Hu · Yue Zhao

Out-of-Distribution (OOD) detection is critical for ensuring the reliability of machine learning models in safety-critical applications such as autonomous driving and medical diagnosis. While deploying personalized OOD detection directly on edge devices is desirable, it remains challenging due to large model sizes and the computational infeasibility of on-device training. Federated learning partially addresses this but still requires gradient computation and backpropagation, exceeding the capabilities of many edge devices.To overcome these challenges, we propose \textbf{SecDOOD}, a secure cloud-device collaboration framework for efficient on-device OOD detection \textit{without} requiring device-side backpropagation.SecDOOD utilizes cloud resources for model training while ensuring user data privacy by retaining sensitive information on-device. Central to SecDOOD is a HyperNetwork-based personalized parameter generation module, which adapts cloud-trained models to device-specific distributions by dynamically generating local weight adjustments, effectively combining central and local information without local fine-tuning. Additionally, our dynamic feature sampling and encryption strategy selectively encrypts only the most informative feature channels, largely reducing encryption overhead without compromising detection performance.Extensive experiments across multiple datasets and OOD scenarios demonstrate that SecDOOD achieves performance comparable to fully fine-tuned models, enabling secure, efficient, and personalized OOD detection on resource-limited edge devices. To enhance accessibility and reproducibility, our code is publicly available at \url{https://anonymous.4open.science/r/SecDOOD/}.


#2
Learning Counterfactually Decoupled Attention for Open-World Model Attribution

Yu Zheng · Boyang Gong · Fanye Kong · Yueqi Duan · Bingyao Yu · Wenzhao Zheng · Lei Chen · Jiwen Lu · Jie Zhou

In this paper, we propose a Counterfactually Decoupled Attention Learning (CDAL) method for open-world model attribution. Existing methods rely on handcrafted design of region partitioning or feature space, which could be confounded by the spurious statistical correlations and struggle with novel attacks in open-world scenarios. To address this, CDAL explicitly models the causal relationships between the attentional visual traces and source model attribution, and counterfactually decouples the discriminative model-specific artifacts from confounding source biases for comparison. In this way, the resulting causal effect provides a quantification on the quality of learned attention maps, thus encouraging the network to capture essential generation patterns that generalize to unseen source models by maximizing the effect. Extensive experiments on existing open-world model attribution benchmarks show that with minimal computational overhead, our method consistently improves state-of-the-art models by large margins, particularly for unseen novel attacks.


#3
Latte: Collaborative Test-Time Adaptation of Vision-Language Models in Federated Learning

Wenxuan Bao · Ruxi Deng · Ruizhong Qiu · Tianxin Wei · Hanghang Tong · Jingrui He

Test-time adaptation with pre-trained vision-language models has gained increasing attention for addressing distribution shifts during testing. Among these approaches, memory-based algorithms stand out due to their training-free nature and ability to leverage historical test data. However, existing test-time adaptation methods are typically designed for a single domain with abundant data. In decentralized settings such as federated learning, applying these methods individually to each client suffers from limited test data, while directly sharing a single global memory via the server prevents proper personalization to each client's unique distribution. To address this, we propose Latte, a novel framework where each client maintains a local memory to store embeddings from its own historical test data and an external memory to store class prototypes from other relevant clients. During communication, each client retrieves prototypes from similar clients under the server’s coordination to expand its memory. For local adaptation, Latte utilizes both embedding similarity and uncertainty to enhance model performance. Our theoretical analysis shows that Latte effectively leverages in-distribution clients while remaining robust to out-of-distribution clients. Extensive experiments on domain adaptation and corruption benchmarks validate that Latte achieves superior performance in decentralized settings, while introducing only negligible communication and computation costs.


#4
Is Less More? Exploring Token Condensation as Training-free Test-time Adaptation

Zixin Wang · Dong Gong · Sen Wang · Zi Huang · Yadan Luo

Contrastive Language-Image Pretraining (CLIP) excels at learning generalizable image representations but often falls short in zero-shot inference on certain downstream datasets. Test-time adaptation (TTA) mitigates this issue by adjusting components like normalization layers or context prompts, yet it typically requires large batch sizes and extensive augmentations, leading to high computational costs. This raises a key question: Can VLMs' performance drop in specific test cases be mitigated through efficient, training-free approaches? To explore the solution, we investigate token condensation (TC) techniques, originally designed to enhance vision transformer efficiency by refining token usage during inference. We observe that informative tokens improve visual-text alignment in VLMs like CLIP on unseen datasets. However, existing TC methods often fail to maintain in-distribution performance when reducing tokens, prompting us to ask: How can we transform TC into an effective ``free-lunch'' adaptation strategy for VLMs? To address this, we propose Token Condensation as Adaptation (TCA), a training-free adaptation method that takes a step beyond standard TC. Rather than passively discarding tokens, TCA condenses token representation by introducing reservoir-based domain anchor tokens for information-preserving token reduction and logit correction. TCA achieves up to a 21.4\% performance improvement over the strongest baseline on cross-dataset benchmark and the CIFAR-100-Corrupted dataset while reducing GFLOPs by 12.2\% to 48.9\%, with minimal hyperparameter dependency on both CLIP and SigLIP series. Code is available in the supplementary material.


#5
Highlight
Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

Qifan Yu · Zhebei Shen · Zhongqi Yue · Yang Wu · Bosheng Qin · Wenqiao Zhang · Yunfei Li · Juncheng Li · Siliang Tang · Yueting Zhuang

Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles—informativeness, uniqueness, and representativeness—for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 101.3\% of the performance of full-data fine-tuning with only 15\% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the "Less is More" philosophy in MLLM development. The code is in https://anonymous.4open.science/r/DataTailor-5BC3.

Real-world infrared imagery presents unique challenges for vision-language models due to the scarcity of aligned text data and domain-specific characteristics. Although existing methods have advanced the field, their reliance on synthetic infrared images generated through style transfer from visible images, which limits their ability to capture the unique characteristics of the infrared modality. To address this, we propose IRGPT, the first multi-modal large language model for real-world infrared images, built upon a large-scale InfraRed-Text Dataset (IR-TD) comprising over 260K authentic image-text pairs. The proposed IR-TD dataset contains real infrared images paired with meticulously handcrafted texts, where the initial drafts originated from two complementary processes: (1) LLM-generated descriptions of visible images, and (2) rule-based descriptions of annotations. Furthermore, we introduce a bi-cross-modal curriculum transfer learning strategy that systematically transfers knowledge from visible to infrared domains by considering the difficulty scores of both infrared-visible and infrared-text. Evaluated on a benchmark of 9 tasks (e.g., recognition, grounding), IRGPT achieves state-of-the-art performance even compared with larger-scale models.


#7
SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning

Ziqi Wang · Chang Che · Qi Wang · Yangyang Li · Zenglin Shi · Meng Wang

Visual instruction tuning (VIT) enables multimodal large language models (MLLMs) to effectively handle a wide range of vision tasks by framing them as language-based instructions. Building on this, continual visual instruction tuning (CVIT) extends the capability of MLLMs to incrementally learn new tasks, accommodating evolving functionalities. While prior work has advanced CVIT through the development of new benchmarks and approaches to mitigate catastrophic forgetting, these efforts largely follow traditional continual learning paradigms, neglecting the unique challenges specific to CVIT. We identify a dual form of catastrophic forgetting in CVIT, where MLLMs not only forget previously learned visual understanding but also experience a decline in instruction following abilities as they acquire new tasks. To address this, we introduce the Separable Mixture of Low-Rank Adaptation (SMoLoRA) framework, which employs separable routing through two distinct modules—one for visual understanding and another for instruction following. This dual-routing design enables specialized adaptation in both domains, preventing forgetting while improving performance. Furthermore, we propose a new CVIT benchmark that goes beyond existing benchmarks by additionally evaluating a model's ability to generalize to unseen tasks and handle diverse instructions across various tasks. Extensive experiments demonstrate that SMoLoRA outperforms existing methods in mitigating dual forgetting, improving generalization to unseen tasks, and ensuring robustness in following diverse instructions.


#8
One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models

Jiale Zhao · XINYANG JIANG · Junyao Gao · Yuhao Xue · Cairong Zhao

Unified vision-language models (VLMs) have recently shown remarkable progress, enabling a single model to flexibly address diverse tasks through different instructions within a shared computational architecture. This instruction-based control mechanism creates unique security challenges, as adversarial inputs must remain effective across multiple task instructions that may be unpredictably applied to process the same malicious content. In this paper, we introduce CrossVLAD, a new benchmark dataset carefully curated from MSCOCO with GPT-4-assisted annotations for systematically evaluating cross-task adversarial attacks on unified VLMs. CrossVLAD centers on the object-change objective—consistently manipulating a target object's classification across four downstream tasks—and proposes a novel success rate metric that measures simultaneous misclassification across all tasks, providing a rigorous evaluation of adversarial transferability. To tackle this challenge, we present CRAFT (Cross-task Region-based Attack Framework with Token-alignment), an efficient region-centric attack method. Extensive experiments on Florence-2 and other popular unified VLMs demonstrate that our method outperforms existing approaches in both overall cross-task attack performance and targeted object-change success rates, highlighting its effectiveness in adversarially influencing unified VLMs across diverse tasks.


#9
Generalized Tensor-based Parameter-Efficient Fine-Tuning via Lie Group Transformations

Chongjie Si · Zhiyi Shi · Xuehui Wang · Yichen Xiao · Xiaokang Yang · Wei Shen

Adapting pre-trained foundation models for diverse downstream tasks is a core practice in artificial intelligence. However, the wide range of tasks and high computational costs make full fine-tuning impractical. To overcome this, parameter-efficient fine-tuning (PEFT) methods like LoRA have emerged and are becoming a growing research focus. Despite the success of these methods, they are primarily designed for linear layers, focusing on two-dimensional matrices while largely ignoring higher-dimensional parameter spaces like convolutional kernels. Moreover, directly applying these methods to higher-dimensional parameter spaces often disrupts their structural relationships. Given the rapid advancements in matrix-based PEFT methods, rather than designing a specialized strategy, we propose a generalization that extends matrix-based PEFT methods to higher-dimensional parameter spaces without compromising their structural properties. Specifically, we treat parameters as elements of a Lie group, with updates modeled as perturbations in the corresponding Lie algebra. These perturbations are mapped back to the Lie group through the exponential map, ensuring smooth, consistent updates that preserve the inherent structure of the parameter space. Extensive experiments on computer vision and natural language processing validate the effectiveness and versatility of our approach, demonstrating clear improvements over existing methods.


#10
Highlight
Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Jiaer Xia · Bingkui Tong · Yuhang Zang · Rui Shao · Kaiyang Zhou

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with chain-of-thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.


#11
Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate

Qidong Huang · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Jiaqi Wang · Weiming Zhang · Nenghai Yu

Multi-modal pre-training plays a pivotal role in aligning two modalities for Large Vision-Language Models (LVLMs), while evaluating its training quality usually requires the costly supervised fine-tuning (SFT) stage to verify the downstream benchmark scores. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when quantifying the pre-trained LVLMs. Due to the lack of proper metrics, the research of LVLMs in the critical pre-training stage is hindered greatly, including the training data choice, efficient module design, etc.In this paper, we first present Modality Integration Rate ($\textbf{MIR}$), an effective, robust, and generalized metric to indicate the multi-modal pre-training quality of LVLMs without SFT. This metric evaluates LVLM pre-training from the inter-modal distribution distance perspective, which is 1) $\textbf{Effective}$ to represent the pre-training quality and show a positive relation with the benchmark performance after SFT, 2) $\textbf{Robust}$ toward different training/evaluation data, and 3) $\textbf{Generalize}$ across training configurations and architecture choices.Complementing MIR, we further propose learnable Modality Calibration ($\textbf{MoCa}$), a lightweight module to narrow the modality gap at each language model layer during training. A series of experiments are conducted to explore the effectiveness of MIR and MoCa, demonstrating that MIR is highly indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. We hope MIR could be a helpful evaluator for building capable LVLMs and inspire the following research about modality alignment in different areas.


#12
X-Fusion: Introducing New Modality to Frozen Large Language Models

Sicheng Mo · Thao Nguyen · Xun Huang · Siddharth Iyer · Yijun Li · Yuchen Liu · Abhishek Tandon · Eli Shechtman · Krishna Kumar Singh · Yong Jae Lee · Bolei Zhou · Yuheng Li

We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM’s parameters frozen while integrating vision-specific information for both understanding and generation. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.


#13
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models

Yuxuan Cai · Jiangning Zhang · Haoyang He · Xinwei He · Ao Tong · Zhenye Gan · Chengjie Wang · Zhucun Xue · Yong Liu · Xiang Bai

The success of Large Language Models (LLMs) has inspired the development of Multimodal Large Language Models (MLLMs) for unified understanding of vision and language. However, the increasing model size and computational complexity of large-scale MLLMs ($l$-MLLMs) limit their use in resource-constrained scenarios. Although small-scale MLLMs ($s$-MLLMs) are designed to reduce computational costs, they typically suffer from performance degradation.To mitigate this limitation, we propose a novel \method~framework to transfer knowledge from $l$-MLLMs to $s$-MLLMs. Specifically, we introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities, and Relation Distillation (RDist) to transfer teacher model's ability to capture visual token relationships.Additionally, we propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy: \textit{1)} Distilled Pre-Training to strengthen the alignment between visual-linguistic representations in $s$-MLLMs, \textit{2)} Supervised Fine-Tuning to equip the $s$-MLLMs with multimodal understanding capacity, and \textit{3)} Distilled Fine-Tuning to refine $s$-MLLM's knowledge.Our approach significantly improves $s$-MLLMs performance without altering the model architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component. Code will be available.

We focus on the source-free domain adaptive object detection (SFDAOD) problem when source data is unavailable during adaptation and the model must adapt to the unlabeled target domain. The majority of approaches for the problem employ a self-supervised approach using a student-teacher (ST) framework where pseudo-labels are generated via a source-pretrained model for further fine-tuning. We observe that the performance of a student model often degrades drastically, due to the collapse of the teacher model primarily caused by high noise in pseudo-labels, resulting from domain bias, discrepancies, and a significant domain shift across domains. To obtain reliable pseudo-labels, we propose a Target-based Iterative Query-Token Adversarial Network (TITAN) which separates the target images into two subsets that are similar to the source (easy) and those that are dissimilar (hard). We propose a strategy to estimate variance to partition the target domain. This approach leverages the insight that higher detection variances correspond to higher recall and greater similarity to the source domain. Also, we incorporate query-token based adversarial modules into a student-teacher baseline framework to reduce the domain gaps between two feature representations. Experiments conducted on four natural imaging datasets and two challenging medical datasets have substantiated the superior performance of TITAN compared to existing state-of-the-art (SOTA) methodologies. We report an \map improvement of +22.7, +22.2, +21.1, and +3.7 percent over the current sota on cf, cb, sc, and kc benchmarks respectively.


#15
Highlight
StolenLoRA: Exploring LoRA Extraction Attacks via Synthetic Data

Yixu Wang · Yan Teng · Yingchun Wang · Xingjun Ma

Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA have transformed vision model adaptation, enabling the rapid deployment of customized models. However, the compactness of LoRA adaptations introduces new safety concerns, particularly their vulnerability to model extraction attacks. This paper introduces a new focus of model extraction attacks named LoRA extraction that extracts LoRA-adaptive models based on a public pre-trained model. We then propose a novel extraction method called StolenLoRA which trains a substitute model to extract the functionality of a LoRA-adapted model using synthetic data. StolenLoRA leverages a Large Language Model to craft effective prompts for data generation, and it incorporates a Disagreement-based Semi-supervised Learning (DSL) strategy to maximize information gain from limited queries.Our experiments demonstrate the effectiveness of StolenLoRA, achieving up to a 96.60% attack success rate with only 10k queries, even in cross-backbone scenarios where the attacker and victim models utilize different pre-trained backbones. These findings reveal the specific vulnerability of LoRA-adapted models to this type of extraction and underscore the urgent need for robust defense mechanisms tailored to PEFT methods.We also explore a preliminary defense strategy based on diversified LoRA deployments, highlighting its potential to mitigate such attacks.


#16
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

Huanjin Yao · Jiaxing Huang · Yawen Qiu · Michael K. Chen · Wenzheng Liu · Wei Zhang · wenjie zeng · Xikun ZHANG · Jingyi Zhang · YuXin Song · Wenhao Wu · Dacheng Tao

Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence.However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps.To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions.First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers).Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations.Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps.With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities.We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research.


#17
Doodle Your Keypoints: Sketch-Based Few-Shot Keypoint Detection

Subhajit Maity · Ayan Bhunia · Subhadeep Koley · Pinaki Chowdhury · Aneeshan Sain · Yi-Zhe Song

Keypoint detection, integral to modern machine perception, faces challenges in few-shot learning, particularly when source data from the same distribution as the query is unavailable. This gap is addressed by leveraging sketches, a popular form of human expression, providing a source-free alternative. However, challenges arise in mastering cross-modal embeddings and handling user-specific sketch styles. Our proposed framework overcomes these hurdles with a prototypical setup, combined with a grid-based locator and prototypical domain adaptation. We also demonstrate success in few-shot convergence across novel keypoints and classes through extensive experiments.


#18
Highlight
Dissecting Generalized Category Discovery: Multiplex Consensus under Self-Deconstruction

Luyao Tang · Kunze Huang · Yuxuan Yuan · Chenxin Li · Xiaotong Tu · Xinghao Ding · Chaoqi Chen · Yue Huang

Human perceptual systems excel at inducing and recognizing objects across both known and novel categories, a capability far beyond current machine learning frameworks. While generalized category discovery (GCD) aims to bridge this gap, existing methods predominantly focus on optimizing objective functions. We present an orthogonal solution, inspired by the human cognitive process for novel object understanding: decomposing objects into visual primitives and establishing cross-knowledge comparisons. We propose ConGCD, which establishes primitive-oriented representations through high-level semantic reconstruction, binding intra-class shared attributes via deconstruction. Mirroring human preference diversity in visual processing, where distinct individuals leverage dominant or contextual cues, we implement dominant and contextual consensus units to capture class-discriminative patterns and inherent distributional invariants, respectively. A consensus scheduler dynamically optimizes activation pathways, with final predictions emerging through multiplex consensus integration. Extensive evaluations across coarse- and fine-grained benchmarks demonstrate ConGCD's effectiveness as a consensus-aware paradigm.


#19
LMM-Det: Make Large Multimodal Models Excel in Object Detection

Jincheng Li · Chunyu Xie · Ji Ao · Dawei Leng · Yuhui Yin

Large multimodal models (LMMs) have garnered wide-spread attention and interest within the artificial intelligence research and industrial communities, owing to their remarkable capability in multimodal understanding, reasoning, and in-context learning, among others.While LMMs have demonstrated promising results in tackling multimodal tasks like image captioning, visual question answering, and visual grounding, the object detection capabilities of LMMs exhibit a significant gap compared to specialist detectors.To bridge the gap, we depart from the conventional methods of integrating heavy detectors with LMMs and propose LMM-Det, a simple yet effective approach that leverages a large multimodal model for vanilla object detection without relying on specialized detection modules.Specifically, we conduct a comprehensive exploratory analysis when a large multimodal model meets with object detection, revealing that the recall rate degrades significantly compared with specialist detection models. To mitigate this, we propose to increase the recall rate by introducing data distribution adjustment and inference optimization tailored for object detection. We re-organize the instruction conversations to enhance the object detection capabilities of large multimodal models.We claim that a large multimodal model possesses detection capability without any extra modules such as a specialist detection model or a region proposal network.Extensive experiments support our claim and show the effectiveness of the versatile LMM-Det. We provide the model weights and code and hope our release will inspire and accelerate advancements in the exploration of the object detection ability of large multimodal models.


#20
Partial Forward Blocking: A Novel Data Pruning Paradigm for Lossless Training Acceleration

Dongyue Wu · Zilin Guo · Jialong Zuo · Nong Sang · Changxin Gao

The ever-growing size of training datasets enhances the generalization capability of modern machine learning models but also incurs exorbitant computational costs. Existing data pruning approaches aim to accelerate training by removing those less important samples. However, they often rely on gradients or proxy models, leading to prohibitive additional costs of gradient back-propagation and proxy model training.In this paper, we propose Partial Forward Blocking (PFB), a novel framework for lossless training acceleration. The efficiency of PFB stems from its unique pipeline: sample importance is assessed based on features extracted from the shallow layers of the target model. Less important samples are then pruned, allowing only the retained ones to proceed with the subsequent forward pass and loss back-propagation. This mechanism significantly reduces the computational overhead of deep-layer forward passes and back-propagation for pruned samples, while also eliminating the need for auxiliary backward computations and proxy model training.Moreover, PFB introduces probability density as an indicator of sample importance. Combined with an adaptive distribution estimation module, our method dynamically prioritizes relatively rare samples, aligning with the constantly evolving training state.Extensive experiments demonstrate the significant superiority of PFB in performance and speed.On ImageNet, PFB achieves a 0.5\% accuracy improvement and 33\% training time reduction with 40\% data pruned. Our code will be publicly available.


#21
ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Qianhao Yuan · Qingyu Zhang · yanjiang liu · Jiawei Chen · Yaojie Lu · Hongyu Lin · Jia Zheng · Xianpei Han · Le Sun

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively.The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens.Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens.Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers and freezes visual token updates in these layers.Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens.For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance.The code will be publicly available.


#22
Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning

Haoran Chen · Ping Wang · Zihan Zhou · Xu Zhang · Zuxuan Wu · Yu-Gang Jiang

Class-incremental learning (CIL) enables models to learn new classes progressively while preserving knowledge of previously learned ones. Recent advances in this field have shifted towards parameter-efficient fine-tuning techniques, with many approaches building upon the framework that maintains a pool of learnable prompts. Although effective, these methods introduce substantial computational overhead, primarily due to prompt pool querying and increased input sequence lengths from prompt concatenation. In this work, we present a novel prompt-based approach that addresses this limitation. Our method trains a single set of shared prompts across all tasks and, rather than concatenating prompts to the input, directly modifies the CLS token's attention computation by adding the prompts to it. This simple and lightweight design not only significantly reduces computational complexity—both in terms of inference costs and the number of trainable parameters—but also eliminates the need to optimize prompt lengths for different downstream tasks, offering a more efficient yet powerful solution for rehearsal-free class-incremental learning. Extensive experiments across a diverse range of CIL benchmarks demonstrate the effectiveness of our approach, highlighting its potential to establish a new prompt-based CIL paradigm. Furthermore, experiments on general recognition benchmarks beyond the CIL setting also show strong performance, positioning our method as a promising candidate for a general parameter-efficient fine-tuning approach.


#23
CIARD: Cyclic Iterative Adversarial Robustness Distillation

Liming Lu · Shuchao Pang · Xu Zheng · Xiang GU · Anan Du · Yunhuai Liu · Yongbin Zhou

Adversarial robustness distillation (ARD) aims to transfer both performance and robustness from teacher model to lightweight student model, enabling resilient performance on resource-constrained scenarios. Though existing ARD approaches enhance student model's robustness, the inevitable by-product leads to the degraded performance on clean examples. We summarize the causes of this problem inherent in existing methods with dual-teacher framework as: ① The divergent optimization objectives of dual-teacher models, i.e., the clean and robust teachers, impede effective knowledge transfer to the student model, and ② The iteratively generated adversarial examples during training lead to performance deterioration of the robust teacher model. To address these challenges, we propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations: ① A multi-teacher framework with contrastive push-loss alignment to resolve conflicts in dual-teacher optimization objectives, and ② Continuous adversarial retraining to maintain dynamic teacher robustness against performance degradation from the varying adversarial examples. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CIARD achieves remarkable performance with an average $\textbf{3.53\%}$ improvement in adversarial defense rates across various attack scenarios and a $\textbf{5.87\%}$ increase in clean sample accuracy, establishing a new benchmark for balancing model robustness and generalization. Our code is available at https://github.com/CIARD2025/CIARD.


#24
Moderating the Generalization of Score-based Generative Model

Wan Jiang · He Wang · Xin Zhang · Dan Guo · Zhaoxin Fan · Yunfeng Diao · Richang Hong

Score-based Generative Models (SGMs) have demonstrated remarkable generalization capabilities, \eg generating unseen, but natural data. However, the greater the generalization power, the more likely the unintended generalization, and the more dangerous the abuse. Despite these concerns, research on unlearning SGMs has not been explored. To fill this gap, we first examine the current `gold standard' in Machine Unlearning (MU), \ie, re-training the model after removing the undesirable training data, and find it does not work in SGMs. Further analysis of score functions reveals that the MU ‘gold standard’ does not alter the original score function, which explains its ineffectiveness. Building on this insight, we propose the first Moderated Score-based Generative Model (MSGM), which introduces a novel score adjustment strategy that redirects the score function away from undesirable data during the continuous-time stochastic differential equation process. Albeit designed for SGMs, MSGM is a general and flexible MU framework compatible with diverse diffusion architectures, training strategies and downstream tasks. The code will be shared upon acceptance.


#25
Highlight
Scaling Language-Free Visual Representation Learning

David Fan · Shengbang Tong · Jiachen Zhu · Koustuv Sinha · Zhuang Liu · Xinlei Chen · Michael Rabbat · Nicolas Ballas · Yann LeCun · Amir Bar · Saining Xie

Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced by language supervision, even though visual SSL and CLIP models are often trained on different data. In this work, we ask the question: "Do visual self-supervised approaches lag behind CLIP due to the lack of language supervision, or differences in the training data?" We study this question by training both visual SSL and CLIP models on the same MetaCLIP data, and leveraging VQA as a diverse testbed for vision encoders. In this controlled setup, visual SSL models scale better than CLIP models in terms of data and model capacity, and visual SSL performance does not saturate even after scaling up to 7B parameters. Consequently, we observe visual SSL methods achieve CLIP-level performance on a wide range of VQA and classic vision benchmarks. These findings demonstrate that pure visual SSL can match language-supervised visual pretraining at scale, opening new opportunities for vision-centric representation learning.


#26
LLM-assisted Entropy-based Adaptive Distillation for Unsupervised Fine-grained Visual Representation Learning

Jianfeng Dong · Danfeng Luo · Daizong Liu · Jie Sun · Xiaoye Qu · Xun Yang · Dongsheng Liu · Xun Wang

Unsupervised Fine-grained Visual Represent Learning (FVRL) aims to learn discriminative features to distinguish subtle differences among visually similar categories without using labeled fine-grained data. Existing works, which typically learn representation from target data, often struggle to capture subtle inter-class variations due to the limited prior fine-grained knowledge. To alleviate it, this paper proposes LLM-assisted Entropy-based Adaptive Distillation (LEAD), a novel unsupervised FVRL framework that selectively distills fine-grained knowledge from a powerful teacher model built upon pre-trained models. Specifically, we first harness the powerful reasoning capabilities of Large Language Models (LLMs) to generate contextual knowledge of fine-grained category-aware descriptions, enriching semantic priors in the teacher model. These descriptions are then used to form a prototype-driven fine-grained classifier, which acts as an assistant to generate rich knowledge with a frozen vision-language model. Besides, to achieve effective knowledge transfer, we further introduce an entropy-based adaptive mechanism, which dynamically adjusts the distillation strength based on the information entropy to identify and prioritize valuable knowledge. Extensive experimental results on three fine-grained datasets demonstrate the effectiveness and efficiency of our proposed LEAD for unsupervised FVRL. Our source code is available at https://anonymous.4open.science/r/EAD-FFAB.


#27
InfoBridge: Balanced Multimodal Integration through Conditional Dependency Modeling

Chenxin Li · Yifan Liu · Panwang Pan · Hengyu Liu · Xinyu Liu · Wuyang Li · Cheng Wang · Weihao Yu · Yiyang LIN · Yixuan Yuan

Developing systems that can interpret diverse real-world signals remains a fundamental challenge in multimodal learning. Current approaches to multimodal fusion face significant obstacles stemming from inherent modal heterogeneity. While existing methods attempt to enhance fusion through cross-modal alignment or interaction mechanisms, they often struggle to balance effective integration with preserving modality-specific information, and frequently neglect crucial contextual nuances unique to each modality. We introduce ModBridge, a novel framework grounded in conditional information maximization principles that addresses these limitations. Our approach reframes multimodal fusion through two key innovations: (1) we formulate fusion as a conditional mutual information optimization problem with an integrated protective margin that simultaneously encourages cross-modal information sharing while safeguarding against over-fusion that could eliminate unique modal characteristics; and (2) we enable fine-grained contextual fusion by leveraging modality-specific conditions (such as audio event detection signals) to guide the integration process. Comprehensive evaluations across multiple benchmarks demonstrate that ModBridge consistently outperforms state-of-the-art multimodal architectures, establishing a more principled and effective approach to multimodal learning that better captures complementary information across diverse input signals.


#28
A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention

Qiyu Xu · Zhanxuan Hu · Yu Duan · Ercheng Pei · Yonghang Tai

Generalized Category Discovery (GCD) aims to classify unlabeled data from both known and unknown categories by leveraging knowledge from labeled known categories. While existing methods have made notable progress, they often overlook a hidden stumbling block in GCD: distracted attention. Specifically, when processing unlabeled data, models tend to focus not only on key objects in the image but also on task-irrelevant background regions, leading to suboptimal feature extraction. To remove this stumbling block, we propose Attention Focusing (AF), an adaptive mechanism designed to sharpen the model's focus by pruning non-informative tokens. AF consists of two simple yet effective components: Token Importance Measurement (TIME) and Token Adaptive Pruning (TAP), working in a cascade. TIME quantifies token importance across multiple scales, while TAP prunes non-informative tokens by utilizing the multi-scale importance scores provided by TIME. AF is a lightweight, plug-and-play module that integrates seamlessly into existing GCD methods with minimal computational overhead. When incorporated into one prominent GCD method, SimGCD, AF achieves up to $15.4\%$ performance improvement over the baseline with minimal computational overhead. The implementation code is provided in:\url{https://anonymous.4open.science/r/AFGCD-E652}.


#29
Meta-Learning Dynamic Center Distance: Hard Sample Mining for Learning with Noisy Labels

Chenyu Mu · Yijun Qu · Jiexi Yan · Erkun Yang · Cheng Deng

The sample selection approach is a widely adopted strategy for learning with noisy labels, where examples with lower losses are effectively treated as clean during training. However, this clean set often becomes dominated by easy examples, limiting the model’s meaningful exposure to more challenging cases and reducing its expressive power. To overcome this limitation, we introduce a novel metric called Dynamic Center Distance (DCD), which can quantify sample difficulty and provide information that critically complements loss values. Unlike approaches that rely on predictions, DCD is computed in feature space as the distance between sample features and a dynamically updated center, established through a proposed meta-learning framework. Building on preliminary semi-supervised training that captures fundamental data patterns, we incorporate DCD to further refine the classification loss, down-weighting well-classified examples and strategically focusing training on a sparse set of hard instances. This strategy prevents easy examples from dominating the classifier, leading to more robust learning. Extensive experiments across multiple benchmark datasets, including synthetic and real-world noise settings, as well as natural and medical images, consistently demonstrate the effectiveness of our method.


#30
ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

Zhengzhuo Xu · Sinan Du · Yiyan Qi · Siwen Lu · Chengjin Xu · Chun Yuan · Jian Guo

Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions. We further introduce an automated pipeline to construct ChartPoint-SFT-62k, a dataset featuring 19.2K high-quality chart samples with step-by-step CoT, bounding box, and re-rendered visualizations. Leveraging this data, we develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks, e.g., +5.04\% on ChartBench.


#31
Multimodal Large Language Model-Guided ISP Hyperparameter Optimization with Dynamic Preference Learning

Xinyu Sun · Zhikun Zhao · congyan lang · Bing Li · Juan Wang

The image signal processing (ISP) pipeline is responsible for converting the RAW images collected from the sensor into high-quality RGB images. It contains a series of image processing modules and associated ISP hyperparameters. Recent learning-based approaches aim to automate ISP hyperparameter optimization using solely image data. However, their unimodal nature limits their ability to capture richer contextual information, reducing robustness and adaptability across diverse application scenarios. To address this limitation, we propose a Multimodal Large Language Model (MLLM)-guided ISP hyperparameter optimization framework, which integrates textual insights generated by MLLMs into the optimization process. By incorporating both high-level semantic cues and low-level image quality descriptors, our method enhances contextual understanding and task adaptability. Additionally, we introduce a Dynamic Pair Generation (DPG) refinement strategy based on Direct Preference Optimization (DPO), facilitating efficient preference alignment without the need for extensive human-labeled data. This novel framework not only improves the directional consistency of optimization but also significantly reduces the computational and data preparation overhead. We validate our proposed methods on both high-level and low-level vision tasks, demonstrating superior performance compared to existing methods.


#32
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Xinyu Fang · Zhijian Chen · Kai Lan · Lixin Ma · Shengyuan Ding · Yingji Liang · Xiangyu Zhao · Farong Wen · Zicheng Zhang · Guofeng Zhang · Haodong Duan · Kai Chen · Dahua Lin

Creativity is a fundamental aspect of intelligence, involving the ability to generate novel and appropriate solutions across diverse contexts. While Large Language Models (LLMs) have been extensively evaluated for their creative capabilities, the assessment of Multimodal Large Language Models (MLLMs) in this domain remains largely unexplored. To address this gap, we introduce Creation-MMBench, a multimodal benchmark specifically designed to evaluate the creative capabilities of MLLMs in real-world, image-based tasks. The benchmark comprises 765 test cases spanning 51 fine-grained tasks.To ensure rigorous evaluation, we define instance-specific evaluation criteria for each test case, guiding the assessment of both general response quality and factual consistency with visual inputs. Experimental results reveal that current open-source MLLMs significantly underperform compared to proprietary models in creative tasks. Furthermore, our analysis demonstrates that visual fine-tuning can negatively impact the base LLM’s creative abilities.Creation-MMBench provides valuable insights for advancing MLLM creativity and establishes a foundation for future improvements in multimodal generative intelligence. Full data and evaluation code will be released soon.

Out-of-Distribution (OOD) detection is critical for safely deploying deep models in open-world environments, where inputs may lie outside the training distribution. During inference on a model trained exclusively with In-Distribution (ID) data, we observe a salient \emph{gradient} phenomenon: around an ID sample, the local gradient directions for “enhancing” that sample’s predicted class remain relatively consistent, whereas OOD samples—unseen in training—exhibit disorganized or conflicting gradient directions in the same neighborhood. Motivated by this observation, we propose an inference-stage technique to \emph{short-circuit} those feature coordinates that spurious gradients exploit to inflate OOD confidence, while leaving ID classification largely intact. To circumvent the expense of recomputing the logits after this gradient short-circuit, we further introduce a local first-order approximation that accurately captures the post-modification outputs without a second forward pass. Experiments on standard OOD benchmarks show our approach yields substantial improvements. Moreover, the method is lightweight and requires minimal changes to the standard inference pipeline, offering a practical path toward robust OOD detection in real-world applications.


#34
Boundary Probing for Input Privacy Protection When Using LMM Services

Xiaofei Hui · Haoxuan Qu · Ping Hu · Hossein Rahmani · Jun Liu

Alongside the rapid development of Large Multimodel Models (LMMs) like GPT-4V, privacy concerns also rise. As LMMs are commonly deployed as cloud services, users are typically required to upload their personal images and videos to the cloud to access these services, raising great concerns about visual privacy leakage. In this paper, we investigate the critical but underexplored problem of keeping LMM's good performance while protecting visual privacy information in the input data. We tackle this problem in the practical scenario where the LMM remains a black box, i.e., we can only access its input and output without knowing the LMM's internal information. To address such a challenging problem, we propose a new Privacy-Aware Boundary Probing (PABP) framework, which, from a novel perspective, converts this problem into a privacy optimization problem guided by the decision boundary between the "satisfactory" and "unsatisfactory" LMM utility states. We propose two tailored schemes, Gradually-Expanding-Probing (GEP) and Prior-Guided-Probing (PGP), to maintain satisfactory LMM performance while achieving privacy protection. We show the effectiveness of our framework on different benchmarks (code will be released).


#35
Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

Shiming Chen · Bowen Duan · Salman Khan · Fahad Khan

Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and their associated attributes, facilitating effective alignment and providing interpretable similarity without the need for additional training. Extensive experiments demonstrate that our method offers several advantages, including enhanced interpretability, improved accuracy, and strong domain generalization.


#36
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers

Yang Tian · Zheng Lu · Mingqi Gao · Zheng Liu · Bo Zhao

Fully comprehending scientific papers by machines reflects a high level of Artificial General Intelligence, requiring the ability to reason across fragmented and heterogeneous sources of information, presenting a complex and practically significant challenge. While Vision-Language Models (VLMs) have made remarkable strides in various tasks, particularly those involving reasoning with evidence source from single image or text page, their ability to use cross-source information for reasoning remains an open problem. This work presents MMCR, a high-difficulty benchmark designed to evaluate VLMs' capacity for reasoning with cross-source information from scientific papers. The benchmark comprises 276 high-quality questions, meticulously annotated by humans across 7 subjects and 10 task types. Experiments with 18 VLMs demonstrate that cross-source reasoning presents a substantial challenge for existing models. Notably, even the top-performing model, GPT-4o, achieved only 48.55% overall accuracy, with just 20% accuracy in multi-table comprehension tasks, while the second-best model, Qwen2.5-VL-72B, reached 39.86% overall accuracy. These results highlight the pressing need to develop VLMs capable of effectively utilizing cross-source information for reasoning.


#37
ConsNoTrainLoRA: Data-driven Weight Initialization of Low-rank Adapters using Constraints

Debasmit Das · Hyoungwoo Park · Munawar Hayat · Seokeon Choi · Sungrack Yun · Fatih Porikli

Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.


#38
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Xiao Zhang · Fei Wei · Yong Wang · Wenda Zhao · Feiyi Li · Xiangxiang Chu

Zero-shot domain adaptation (ZSDA) presents substantial challenges due to the lack of images in the target domain. Previous approaches leverage Vision-Language Models (VLMs) to tackle this challenge, exploiting their zero-shot learning capabilities. However, these methods primarily address domain distribution shifts and overlook the misalignment between the detection task and VLMs, which rely on manually crafted prompts. To overcome these limitations, we propose the unified prompt and representation enhancement (UPRE) framework, which jointly optimizes both textual prompts and visual representations. Specifically, our approach introduces a multi-view domain prompt that combines linguistic domain priors with detection-specific knowledge, and a visual representation enhancement module that produces domain style variations. Furthermore, we introduce multi-level enhancement strategies, including relative domain distance and positive-negative separation, which align multi-modal representations at the image level and capture diverse visual representations at the instance level, respectively. Extensive experiments conducted on nine benchmark datasets demonstrate the superior performance of our framework in ZSDA detection scenarios.


#39
Dataset Distillation as Data Compression: A Rate-Utility Perspective

Youneng Bao · Yiping Liu · Zhuo Chen · Yongsheng Liang · Mu Li · Kede Ma

The ``scale-is-everything" paradigm in machine learning has resulted in escalating computational and storage demands as datasets and models grow increasingly large. Dataset distillation addresses this challenge by compressing datasets into compact latent representations that generate synthetic data capable of matching the performance of models trained on the original data, formulated as a rate-utility optimization problem. Existing dataset distillation methods fail to achieve Pareto optimality due to their inability to jointly optimize compression rate and utility within a differentiable framework.Drawing inspiration from learned image compression (LIC), we propose a unified framework where latent representations are modeled as optimizable parameter grids (codes) and a generator (decoder) to transform codes to synthesized images. This approach subsumes nearly all existing latent representations while explicitly modeling the rate as an optimizable term through precise entropy estimation of the latent. To quantify compression efficiency, we introduce bits per class (BPC), a novel metric for distilled datasets. We optimize the uniform latent representation according to joint rate-utility trade off and achieve state-of-the-art results on CIFAR-10/100 and ImageNet-128. For instance, on the ImageNet-Subset dataset, our method achieves a 170$\times$ compression rate improvement over the baseline approach while maintaining comparable utility.The framework is compatible with most existing distillation algorithms and serves as a plug-in component to enhance rate-utility performance without modifications.


#40
Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

Shangbo Wu · Yu-an Tan · Ruinan Ma · Wencong Ma · Dehua Zhu · Yuanzhang Li

The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA---a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts.


#41
Open-set Cross Modal Generalization via Multimodal Unified Representation

Hai Huang · Yan Xia · Shulei Wang · Hanting Wang · Minghui Fang · Shengpeng Ji · Sashuai Zhou · Tao Jin · Zhou Zhao

This paper extends Cross Modal Generalization (CMG) to open-set environments by proposing the more challenging Open-set Cross Modal Generalization (OSCMG) task. This task evaluates multimodal unified representations in open-set conditions, addressing the limitations of prior closed-set cross-modal evaluations. OSCMG requires not only cross-modal knowledge transfer but also robust generalization to unseen classes within new modalities, a scenario frequently encountered in real-world applications. Existing multimodal unified representation work lacks consideration for open-set environments. To tackle this, we propose MICU, comprising two key components: Fine-Coarse Masked multimodal InfoNCE (FCMI) and Cross modal Unified Jigsaw Puzzles (CUJP). FCMI enhances multimodal alignment by applying contrastive learning at both holistic semantic and temporal levels, incorporating masking to enhance generalization. CUJP enhances feature diversity and model uncertainty by integrating modality-agnostic feature selection with self-supervised learning, thereby strengthening the model’s ability to handle unknown categories in open-set tasks. Extensive experiments on CMG and the newly proposed OSCMG validate the effectiveness of our approach. Code is available in supplementary material.


#42
Adversarial Data Augmentation for Single Domain Generalization via Lyapunov Exponent-Guided Optimization

ZUYU ZHANG · Ning Chen · Yongshan Liu · Qinghua Zhang · Xu Zhang

Single Domain Generalization (SDG) aims to develop models capable of generalizing to unseen target domains using only one source domain, a task complicated by substantial domain shifts and limited data diversity. Existing SDG approaches primarily rely on data augmentation techniques, which struggle to effectively adapt training dynamics to accommodate large domain shifts. To address this, we propose LEAwareSGD, a novel Lyapunov Exponent (LE)-guided optimization approach inspired by dynamical systems theory. By leveraging LE measurements to modulate the learning rate, LEAwareSGD encourages model training near the edge of chaos, a critical state that optimally balances stability and adaptability. This dynamic adjustment allows the model to explore a wider parameter space and capture more generalizable features, ultimately enhancing the model's generalization capability. Extensive experiments on PACS, OfficeHome, and DomainNet demonstrate that LEAwareSGD yields substantial generalization gains, achieving up to 9.47% improvement on PACS in low-data regimes. These results underscore the effectiveness of training near the edge of chaos for enhancing model generalization capability in SDG tasks.


#43
Adversarial Robust Memory-Based Continual Learner

Xiaoyue Mi · Fan Tang · Zonghan Yang · Danding Wang · Juan Cao · Peng Li · Yang Liu

Despite the remarkable advances that have been made in continual learning, the adversarial vulnerability of such methods has not been fully discussed. We delve into the adversarial robustness of memory-based continual learning algorithms and observe limited robustness improvement by directly applying adversarial training techniques. Our preliminary studies reveal the twin challenges for building adversarial robust continual learners: \textbf{accelerated forgetting} in continual learning and \textbf{gradient obfuscation} in adversarial robustness. In this study, we put forward a novel adversarial robust memory-based continual learner that adjusts data logits to mitigate the forgetting of pasts caused by adversarial samples. Furthermore, we devise a gradient-based data selection mechanism to overcome the gradient obfuscation caused by limited stored data. The proposed approach can widely integrate with existing memory-based continual learning and adversarial training algorithms in a plug-and-play way. Extensive experiments on Split-CIFAR10/100 and Split-Tiny-ImageNet demonstrate the effectiveness of our approach, achieving a maximum forgetting reduction of 34.17% in adversarial data for ResNet, and 20.10% for ViT.


#44
NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection

Amirhossein Ansari · Ke Wang · Pulei Xiong

Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. We propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set, and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. We evaluate NegRefine on large-scale benchmarks, including ImageNet-1K. Source code is available in the supplementary material.

Semi-supervised continual learning (SSCL) seeks to leverage both labeled and unlabeled data in a sequential learning setup, aiming to reduce annotation costs while managing continual data arrival. SSCL introduces complex challenges, including ensuring effective unlabeled learning (UL), while balancing memory stability (MS) and learning plasticity (LP). Previous SSCL efforts have typically focused on isolated aspects of the three, while this work presents USP, a divide-and-conquer framework designed to synergistically enhance these three aspects: (1) Feature Space Reservation (FSR) strategy for LP, which constructs reserved feature locations for future classes by shaping old classes into an equiangular tight frame; (2) Divide-and-Conquer Pseudo-labeling (DCP) approach for UL, which assigns reliable pseudo-labels across both high- and low-confidence unlabeled data; and (3) Class-mean-anchored Unlabeled Distillation (CUD) for MS, which reuses DCP's outputs to anchor unlabeled data to stable class means for distillation to prevent forgetting. Comprehensive evaluations show USP outperforms prior SSCL methods, with gains up to 18.26% in the last accuracy, validating its effectiveness. The source code will be made available upon acceptance of the paper.


#46
A Unified Framework to BRIDGE Complete and Incomplete Deep Multi-View Clustering under Non-IID Missing Patterns

Xiaorui Jiang · Buyun He · Peng Yuan Zhou · Xinyue Chen · Jingcai Guo · Jie Xu · Yong Liao

Incomplete multi-view clustering (IMVC) has gained increasing attention due to its ability to analyze incomplete multi-view data.Despite deep IMVC methods achieved significant progress, they still face two challenges: (I) The method-specific inseparable designs limit their application. (II) Non-independent and identically distributed (Non-IID) missing patterns has not been considered and caused degeneration. To address these issues, we propose a novel unified framework that bridges from deep MVC to deep IMVC, while emphasizing the robustness against Non-IID missing patterns. Our framework has a two-stage process: (I) Multi-view learning on complete data, where our framework is modularly established to be compatible with different multi-view interaction objectives. (II) Transfer learning and clustering on incomplete data, where we propose a multi-view domain adversarial learning method to improve the model robustness to Non-IID missing patterns. Moreover, an intra-view and inter-view imputation strategy is introduced for more reliable clustering.Based on our unified framework, we easily construct multiple IMVC instances and extensive experiments verified their clustering effectiveness.


#47
HumorDB: Can AI understand graphical humor?

Vedaant V Jain · Gabriel Kreiman · Felipe Feitosa

Despite significant advancements in image segmentation and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical humor as a paradigmatic example of image interpretation that requires elucidating the interaction of different scene elements in the context of prior cognitive knowledge. This paper introduces HumorDB, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches, and AI-generated content, including minimally contrastive pairs where subtle edits differentiate between humorous and non-humorous versions. We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on the precise regions that make the image funny. Preliminary mechanistic interpretability studies and evaluation of model explanations provide initial insights into how different architectures process humor. Our results identify promising trends and current limitations, suggesting that an effective understanding of visual humor requires sophisticated architectures capable of detecting subtle contextual features and bridging the gap between visual perception and abstract reasoning.All the code and data are available here: https://anonymous.4open.science/r/HumorDB_-049A


#48
GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability

Zhenghao He · Sanchit Sinha · Guangzhi Xiong · Aidong Zhang

Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV on GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts.

Adversarially robust knowledge distillation transfers the robustness of a large-scale teacher model to a lightweight student while preserving natural performance. However, foundation Vision-Language Models (VLMs) also demand the transfer of zero-shot inference capabilities. We find that standard robust distillation using untargeted adversarial examples fails to transfer out-of-distribution (zero-shot) robustness, as these adversaries primarily push inputs away from their original distribution, exploring a limited portion of the teacher’s decision space and miss more diverse failure modes. A natural solution is to generate multiple targeted adversaries that traverse diverse paths across decision boundaries. Thus, these adversaries probe a broader region of the teacher’s decision surface. However, naive targeted adversary optimization often converges to local optima within a single category’s decision region, limiting the diversity. To address this, we propose a Multi-Objective Optimization (MOO)-based adversarial distillation framework that transfers robustness from large VLMs to lightweight ones by exploiting adversaries with two main objectives: misclassification and category-level adversarial diversity. Theoretically, we show that optimizing for diversity mitigates adversarial collapse into local optima, ensuring adversaries span multiple decision regions and capture the teacher’s generalizable robust features. Extensive experiments demonstrate the superiority of our method over state-of-the-art adversarial learning across diverse scenarios.


#50
Mitigating Object Hallucinations via Sentence-Level Early Intervention

Shangpin Peng · Senqiao Yang · Li Jiang · Zhuotao Tian

Multimodal large language models (MLLMs) have revolutionized cross-modal understanding but continue to struggle with hallucinations - fabricated content contradicting visual inputs. Existing hallucination mitigation methods either incur prohibitive computational costs or introduce distribution mismatches between training data and model outputs. We identify a critical insight: hallucinations predominantly emerge at the early stages of text generation and propagate through subsequent outputs. To address this, we propose SENTINEL (Sentence-level Early iNtervention Through IN-domain prEference Learning), a framework that eliminates dependency on human annotations. Specifically, we first bootstrap high-quality in-domain preference pairs by iteratively sampling model outputs, validating object existence through cross-checking with two open-vocabulary detectors, and classifying sentences into hallucinated/non-hallucinated categories. Subsequently, we use context-coherent positive samples and hallucinated negative samples to iteratively build context-aware preference data. Finally, we train models using a context-aware preference loss (C-DPO) that emphasizes discriminative learning at the sentence level where hallucinations initially manifest. Experimental results show that SENTINEL can reduce hallucinations by 90\% over the original model and outperforms the previous state-of-the-art method on both the hallucination benchmarks and general capabilities benchmarks, manifesting its superiority and generalization ability. The proposed models, datasets and code will be made publicly available.


#51
Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning.

Daniel DeAlcala · Aythami Morales · Julian Fierrez · Gonzalo Mancera · Ruben Tolosana · Javier Ortega-Garcia

Active Membership Inference Test (aMINT) is a method designed to detect if given data was used during the training of machine learning models. In Active MINT, we propose a novel multi-task learning process that involves training simultaneously two models: the original or Audited Model, and a secondary model, referred to as the MINT Model, responsible for identifying the data used for training the Audited Model. This novel multi-task learning approach has been designed to incorporate the auditability of the model as an optimization objective during the training process of neural networks. The proposed approach incorporates intermediate activation maps as inputs to MINT layers, which are trained to enhance the detection of the training data. We present results using a wide range of neural networks, from lighter architectures like MobileNet to more complex ones such as Vision Transformers, evaluated across 5 public benchmarks. Our proposed Active MINT achieves over 80% accuracy in detecting if given data was used for training, significantly outperforming previous approaches in the literature. Our proposed aMINT and related methodological developments contribute to increasing transparency in AI training, therefore facilitating stronger safeguards in AI deployments in order to achieve proper security, privacy, and copyright protection (Code will be available in https://github.com/Anonymized).


#52
Unknown Text Learning for CLIP-based Few-Shot Open-set Recognition

Rui Ma · Qilong Wang · Bing Cao · Qinghua Hu · Yahong Han

Recently, vision-language models (e.g., CLIP) with prompt learning have shown great potential in few-shot learning. However, an open issue remains for the effective extension of CLIP-based models to few-shot open-set recognition (FSOR), which requires classifying known classes and detecting unknown samples using a few known samples. The core challenge is that unknown samples and their textual descriptions are unavailable. To address this, we propose an Unknown Text Learning (UTL) method for CLIP-based FSOR tasks with only known samples. Specifically, UTL involves two key components, i.e., universal unknown words optimization (U$^{2}$WO) and unknown label smoothing (ULS). Specifically, U$^{2}$WO constructs the universal space of unknown words with basis vectors and characterizes unknown text based on a linear combination of those basis vectors. To efficiently learn unknown text without unknown samples, ULS is presented to perform contrast learning between unknown text and known samples by regulating the label of unknown classes to a small constant, which flexibly empowers unknown text to be non-matching with and confused on known visual samples. In addition, our UTL incorporates an additional context for known classes to mitigate conflicts of context optimization between known and unknown classes. UTL effectively regularizes the predicted probability by integrating learnable unknown text. Experimental results on various benchmarks show that our UTL is superior to its counterparts while achieving state-of-the-art performance.

Edge computing in person re-identification (ReID) is crucial for reducing the load on central cloud servers and ensuring user privacy. Conventional methods for obtaining compact models require computations for each individual student model. When multiple models of varying sizes are needed to accommodate different resource conditions, this leads to repetitive and cumbersome calculations.To address this challenge, we propose a novel knowledge inheritance approach named OSKT (One-Shot Knowledge Transfer), which consolidates the knowledge of the teacher model into an intermediate carrier called a weight chain. When a downstream scenario demands a model that meets specific resource constraints, this weight chain can be expanded to the target model size without additional computation.OSKT significantly outperforms state-of-the-art compression methods, with the added advantage of one-time knowledge transfer that eliminates the need for frequent computations for each target model.On the Market1501 benchmark, using pre-trained ResNet50 or ViT-S as the teacher model, OSKT generates smaller student models (1/64th and 1/10th the parameters respectively) achieving accuracies of 89.4\% and 87.1\%, outperforming pruning (80.7\%, 74.1\%) and knowledge distillation (65.7\%, 38.7\%).


#54
ShortFT: Diffusion Model Alignment via Shortcut-based Fine-Tuning

Xiefan Guo · Miaomiao Cui · Liefeng Bo · Di Huang

Backpropagation-based approaches aim to align diffusion models with reward functions through end-to-end backpropagation of the reward gradient within the denoising chain, offering a promising perspective. However, due to the computational costs and the risk of gradient explosion associated with the lengthy denoising chain, existing approaches struggle to achieve complete gradient backpropagation, leading to suboptimal results. In this paper, we introduce Shortcut-based Fine-Tuning (ShortFT), an efficient fine-tuning strategy that utilizes the shorter denoising chain. More specifically, we employ the recently researched trajectory-preserving few-step diffusion model, which enables a shortcut over the original denoising chain, and construct a shortcut-based denoising chain of shorter length. The optimization on this chain notably enhances the efficiency and effectiveness of fine-tuning the foundational model. Our method has been rigorously tested and can be effectively applied to various reward functions, significantly improving alignment performance and surpassing state-of-the-art alternatives.


#55
PRISM: Reducing Spurious Implicit Biases in Vision-Language Models with LLM-Guided Embedding Projection

Mahdiyar Molahasani · Azadeh Motamedi · Michael Greenspan · Il-Min Kim · Ali Etemad

We introduce Projection-based Reduction of Implicit Spurious bias in vision-language Models (PRISM), a new data-free and task-agnostic solution for bias mitigation in VLMs like CLIP. VLMs often inherit and amplify biases in their training data, leading to skewed predictions.PRISM is designed to debias VLMs without relying on predefined bias categories or additional external data. It operates in two stages: first, an LLM is prompted with simple class prompts to generate scene descriptions that contain spurious correlations. Next, PRISM uses our novel contrastive-style debiasing loss to learn a projection that maps the embeddings onto a latent space that minimizes spurious correlations while preserving the alignment between image and text embeddings. Extensive experiments demonstrate that PRISM outperforms current debiasing methods on the commonly used Waterbirds and CelebA datasets We make our code public at: https://anonymous.4open.science/r/PRISM_official.


#56
Open-Unfairness Adversarial Mitigation for Generalized Deepfake Detection

Zhaoyang Li · Zhu Teng · Baopeng Zhang · Jianping Fan

Deepfake detection methods are becoming increasingly crucial for identity security and have recently been employed to support legal proceedings. However, these methods often exhibit unfairness due to flawed logical reasoning, undermining the reliability of their predictions and raising concerns about their applicability in legal contexts. To mitigate this bias, existing approaches typically rely on predefined demographic attributes, such as race and gender. However, these assumptions are inherently limited, as different deepfake detectors exhibit substantial variations in fairness performance, often uncovering intricate and unforeseen bias patterns. To this end, we propose the Adversarial Open-Unfairness Discovery and Mitigation Network (AdvOU), a novel framework designed to mitigate unpredictable unfairness in deepfake detection. Our approach strengthens general deepfake detectors by equipping them with a lightweight Unfairness Regulator (UR), which dynamically identifies and mitigates bias. Furthermore, we propose an adversarial learning paradigm that alternates between the training of the Open-Unfairness Discovery (OUD) module and the Unfairness Adversarial Mitigation (UAM) module. The former intensifies unfairness within UR to reveal underlying bias patterns, while the latter leverages fairness in the detector by enforcing adversarial robustness against unfairness. Extensive experiments on widely used deepfake datasets validate the effectiveness of our approach, outperforming state-of-the-art methods in both fairness and generalization evaluations for cross-domain deepfake detection. The code is available at [link].


#57
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models

Young-Jun Lee · Byung-Kwan Lee · Jianshu Zhang · Yechan Hwang · Byungsoo Ko · Han-Gyu Kim · Dongyu Yao · Xuankun Rong · Eojin Joo · Seung-Ho Han · Bowon Ko · Ho-Jin Choi

Vision-and-Language Models (VLMs) have shown impressive capabilities on single-turn benchmarks, yet real-world applications often demand more intricate multi-turn dialogues. Existing multi-turn datasets (e.g, MMDU, ConvBench) only partially capture the breadth and depth of conversational scenarios encountered by users. In this work, we introduce MultiVerse, a novel multi-turn conversation benchmark featuring 647 dialogues—each averaging four turns—derived from a diverse set of 12 popular VLM evaluation benchmarks. With 484 tasks and 484 interaction goals, MultiVerse covers a wide range of topics, from factual knowledge and perception to advanced reasoning tasks such as mathematics and coding. To facilitate robust assessment, we propose a checklist-based evaluation method that leverages GPT-4o as the automated evaluator, measuring performance across 37 key aspects, including perceptual accuracy, linguistic clarity, and factual correctness. We evaluate 18 VLMs on MultiVerse, revealing that even the strongest models (e.g., GPT-4o) achieve only a 50% success rate in complex multi-turn conversations, highlighting the dataset's challenging nature. Notably, we find that providing full dialogue context significantly enhances performance for smaller or weaker models, emphasizing the importance of in-context learning. We believe MultiVerse is a landscape of evaluating multi-turn interaction abilities for VLMs.


#58
Spatial Preference Rewarding for MLLMs Spatial Understanding

Han Qiu · Peng Gao · Lewei Lu · Xiaoqin Zhang · Ling Shao · Shijian Lu

Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released.


#59
EA-KD: Entropy-based Adaptive Knowledge Distillation

Chi-Ping Su · Ching-Hsun Tseng · Bin Pu · Lei Zhao · Jiewen Yang · Zhuangzhuang Chen · Shin-Jye Lee

Knowledge distillation (KD) enables a smaller "student" model to mimic a larger "teacher" model by transferring knowledge from the teacher's output or features. However, most KD methods treat all samples uniformly, overlooking the varying learning value of each sample and thereby limiting effectiveness. In this paper, we propose Entropy-based Adaptive Knowledge Distillation (EA-KD), a simple yet effective plug-and-play KD method that prioritizes learning from valuable samples. EA-KD quantifies each sample’s learning value by strategically combining the entropy of the teacher and student output, then dynamically reweights the distillation loss to place greater emphasis on high-entropy samples. Extensive experiments across diverse KD frameworks and tasks—including image classification, object detection, and large language model (LLM) distillation—demonstrate that EA-KD consistently enhances performance, achieving state-of-the-art results with negligible computational cost. Our code will be publicly available.


#60
Structured Policy Optimization: Enhance Large Vision-Language Model via Self-referenced Dialogue

Guohao Sun · Can Qin · Yihao Feng · Zeyuan Chen · Ran Xu · Sohail Dianat · MAJID RABBANI · Raghuveer Rao · Zhiqiang Tao

Preference optimization algorithms typically enhance LLM response quality by leveraging human feedback on multiple answers given a fixed instruction. However, these methods often lack capturing the dynamic nature of conversational exchanges. For large vision-language models (LVLMs), direct preference optimization (DPO) can over-emphasize linguistic nuances while overlooking visual context. To address this challenge, we introduce structured policy optimization (SPO) -- a novel preference optimization method that simultaneously aligns preference instructions, responses, and dialogue interactions to improve multi-modal understanding and reasoning capabilities. The efficacy of SPO is attributed to one key design:treating the questioning and answering as a sequential action and binding them through a trajectory reward. This reward formulation better aligns with real-world dialogue studies and eliminates the need for fixed instructions. We evaluate our models on interleaved benchmarks, including image, multi-image, and video-based understanding and reasoning tasks. Experimental results show that the proposed SPO fine-tuning LVLM with multi-modal preference data can align with human preference more efficiently than DPO.


#61
Seal Your Backdoor with Variational Defense

Ivan Sabolic · Matej Grcic · Siniša Šegvić

We propose VIBE, a model-agnostic framework that trains classifiers resilient to backdoor attacks.The key concept behind our approachis to treat malicious inputs and corrupted labels from the training dataset as observed random variables,while the actual clean labelsare latent.VIBE then recovers the corresponding latent clean label posteriorthrough variational inference. The resulting training procedure follows the expectation-maximization (EM) algorithm.The E-step infers the clean pseudolabels by solvingan entropy-regularized optimal transport problem,while the M-step updates the classifier parameters via gradient descent.Being modular,VIBE can seamlessly integratewith recent advancements in self-supervised representation learning,which enhance its ability to resist backdoor attacks.We experimentally validate the method effectiveness against contemporary backdoor attacks on standard datasets, a large-scale setup with 1$k$ classes,and a dataset poisoned with multiple attacks.VIBE consistently outperforms previous defenses across all tested scenarios.


#62
Semi-ViM: Bidirectional State Space Model for Mitigating Label Imbalance in Semi-Supervised Learning

Hongyang He · Hongyang Xie · Haochen You · Victor Sanchez

Semi-supervised learning (SSL) is often hindered by learning biases when imbalanced datasets are used for training, which limits its effectiveness in real-world applications. In this paper, we propose Semi-ViM, a novel SSL framework based on Vision Mamba, a bidirectional state space model (SSM) that serves as a superior alternative to Transformer-based architectures for visual representation learning. Semi-ViM effectively deals with label imbalance and improves model stability through two key innovations: LyapEMA, a stability-aware parameter update mechanism inspired by Lyapunov theory, and SSMixup, a novel mixup strategy applied at the hidden state level of bidirectional SSMs. Experimental results on ImageNet-1K and ImageNet-LT demonstrate that Semi-ViM significantly outperforms state-of-the-art SSL models, achieving 85.40% accuracy with only 10% of the labeled data, surpassing Transformer-based methods such as Semi-ViT.


#63
CODE-CL: Conceptor-Based Gradient Projection for Deep Continual Learning

Marco P. Apolinario · Sakshi Choudhary · Kaushik Roy

Continual learning (CL) — the ability to progressively acquire and integrate new concepts — is essential to intelligent systems to adapt to dynamic environments. However, deep neural networks struggle with catastrophic forgetting (CF) when learning tasks sequentially, as training for new tasks often overwrites previously learned knowledge. To address this, recent approaches constrain updates to orthogonal subspaces using gradient projection, effectively preserving important gradient directions for previous tasks. While effective in reducing forgetting, these approaches inadvertently hinder forward knowledge transfer (FWT), particularly when tasks are highly correlated. In this work, we propose Conceptor-based gradient projection for Deep Continual Learning (CODE-CL), a novel method that leverages conceptor matrix representations, a form of regularized reconstruction, to adaptively handle highly correlated tasks. CODE-CL mitigates CF by projecting gradients onto pseudo-orthogonal subspaces of previous task feature spaces while simultaneously promoting FWT. It achieves this by learning a linear combination of shared basis directions, allowing efficient balance between stability and plasticity and transfer of knowledge between overlapping input feature representations. Extensive experiments on continual learning benchmarks validate CODE-CL’s efficacy, demonstrating superior performance, reduced forgetting, and improved FWT as compared to state-of-the-art methods.

Multi-task learning (MTL) enables a joint model to capture commonalities across multiple tasks, reducing computation costs and improving data efficiency. However, a major challenge in MTL optimization is task conflicts, where the task gradients differ in direction or magnitude, limiting model performance compared to single-task counterparts. Sharpness-aware minimization (SAM) minimizes task loss while simultaneously reducing the sharpness of the loss landscape. Our empirical observations show that SAM effectively mitigates task conflicts in MTL. Motivated by these findings, we explore integrating SAM into MTL but face two key challenges. On one hand, both the average loss gradient and individual task gradients--referred to as global and local information--contribute to SAM, but how to combine them remains unclear. On the other hand, directly computing each task gradient introduces significant computational and memory overheads. To address these challenges, we propose SAMO, a lightweight Sharpness-Aware Multi-task Optimization approach, that leverages a joint global-local perturbation. The local perturbations are approximated using only forward passes and are layerwise normalized to improve efficiency. Extensive experiments on a suite of multi-task benchmarks demonstrate both the effectiveness and efficiency of our method.


#65
Beyond the Limits: Overcoming Negative Correlation of Activation-Based Training-Free NAS

Haidong Kang · Lianbo Ma · Pengjun Chen · Guo Yu · Xingwei Wang · Min Huang

Training-free Neural Architecture Search (NAS) has emerged an efficient way to discover high-performing lightweight models with zero-cost proxies (e.g., the activation-based proxies (AZP)). In this paper, we observe a new \textit{negative correlation phenomenon} that the correlations of the AZP dramatically decrease to be negative with the increasing number of convolutions, significantly degrading the prediction performance of AZP over target architectures. No existing works focus on such negative correlation and its underlying mechanism. To address this, through deep analysis of the architectural characteristics scored by AZP, we propose a series of AZP design principles and reveal the potential reason of the above phenomenon that \textit{high non-linearity dramatically degrades the magnitude of AZP score}. Those findings show that existing AZP designs do not obey the proposed principles. Finally, grounded in these insights, we propose a simple yet efficient \underline{N}egative \underline{C}orrelations-\underline{D}efied (\textbf{NCD}) method, which utilize stochastic activation masking (SAM) and non-linear rescaling (NIR) to effectively eliminate negative correlation of AZP and significantly improve performance. Extensive experimental results validate the effectiveness and efficiency of our method, outperforming state-of-the-art methods on mainstream 12 search spaces with 4 real-world tasks.

Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Despite Pre-trained Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting still occurs as the model learns new concepts. Existing methods often freeze the pre-trained network and adapt to incremental tasks using additional lightweight modules. At inference time, the model must accurately identify the most suitable module, but errors in retrieving irrelevant modules can lead to a decline in performance. Additionally, the selected module concentrates solely on task-specific knowledge and neglects the general knowledge shared across tasks, so it is prone to make erroneous predictions when it is presented with several similar classes from different tasks. To address the aforementioned challenges, we propose integrating Task-Specific and Universal Adapters (TUNA) in this paper. Specifically, we design an orthogonal mechanism to train task-specific adapters, so that they can capture the most crucial features relevant to their respective tasks. Furthermore, we introduce an adapter fusion strategy to construct a universal adapter, which encodes the shared general knowledge across tasks. During inference, we combine predictions from both the task-specific adapter and the universal adapter to effectively utilize both specialized and general knowledge. Extensive experiments on various benchmark datasets demonstrate the state-of-the-art performance of our approach.


#67
I Am Big, You Are Little; I Am Right, You Are Wrong

David A Kelly · Akchunya Chanchal · Nathan Blake

Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model's classification accuracy statistically, our understanding of the way these models work is unfortunately quite limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets. These pixels capture the essence of an image through the lens of the model. By comparing position, overlap and size of sets of pixels, we identify that different architectures have statistically different minimal pixels sets, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with statistically significant larger pixels sets than correct classifications.


#68
Semi-supervised Deep Transfer for Regression without Domain Alignment

Mainak Biswas · Ambedkar Dukkipati · Devarajan Sridharan

Deep learning models are seldom deployed widely for real-world applications (e.g., medicine), because source models do not generalize well to \``domain-shifted'' target data. Many successful domain adaptation approaches require full access to source data and reliably labeled target data. Yet, such requirements are unrealistic in scenarios where source data cannot be shared either because of privacy concerns or are too large, and incur prohibitive storage or computation costs. Moreover, resource constraints may limit the availability of labeled targets. We illustrate this challenge in a neuroscience setting where source data are unavailable, labeled target data are meager, and predictions involve continuous-valued outputs. We build upon Contradistinguisher (CUDA), an efficient framework that learns a shared model across the labeled source and unlabeled target samples, without intermediate alignment of representations. Yet, CUDA was designed for unsupervised DA, with full access to source data and for classification tasks. We develop CRAFT -- a CUDA-based Regularization Approach for Flexible Training -- for source-free (SF), semi-supervised transfer of pretrained models in regression tasks. We showcase the efficacy of CRAFT in two important neuroscience settings: gaze prediction with electroencephalography (EEG) data and ``brain age'' prediction with structural MRI data. For both datasets, CRAFT yielded up to $9\\%$ improvement in root-mean-squared error (RMSE) over finetuned models when labeled training examples were scarce. CRAFT leveraged unlabeled target data and outperformed four competing state-of-the-art source-free domain adaptation models by up to $4\\%$. We propose CRAFT as an efficient approach for source-free, semi-supervised deep transfer for regression that is ubiquitous in biology and medicine.

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in document understanding. However, their reasoning processes remain largely black-box, making it difficult to ensure reliability and trustworthiness, especially in high-stakes domains such as legal, financial, and medical document analysis. Existing methods use fixed Chain-of-Thought (CoT) reasoning with supervised fine-tuning (SFT) but suffer from catastrophic forgetting, poor adaptability, and limited generalization across domain tasks.In this paper, we propose DocThinker, a rule-based Reinforcement Learning (RL) framework for dynamic inference-time reasoning. Instead of relying on static CoT templates, DocThinker autonomously refines reasoning strategies via policy learning, generating interpretable intermediate results, including structured reasoning processes, rephrased questions, regions of interest (RoI) supporting the answer, and the final answer. By integrating multi-objective rule-based rewards and KL-constrained optimization, our method mitigates catastrophic forgetting and enhances both adaptability and transparency.Extensive experiments on multiple benchmarks demonstrate that DocThinker significantly improves generalization while producing more interpretable and human-understandable reasoning steps. Our findings highlight RL as a powerful alternative for enhancing explainability and adaptability in MLLM-based document understanding.


#70
CVPT: Cross Visual Prompt Tuning

Lingyun Huang · Jianxu Mao · Junfei YI · Ziming Tao · Yaonan Wang

In recent years, the rapid expansion of model sizes has introduced huge computational overhead. To address these issues, Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced. This method optimizes large-scale pre-trained models for specific tasks by fine-tuning a select group of parameters. Among these PEFT methods, adapter-based and prompt-based methods are the primary techniques. Specifically, in the field of visual fine-tuning, adapters gain prominence over prompts because of the latter’s relatively weaker performance and efficiency. Under the circumstances, we conducted a detailed analysis of Visual Prompt Tuning (VPT) and attributed its shortcomings to the deployment of prompts in VPT. Consequently, we proposed Cross Visual Prompt Tuning (CVPT), which introduces cross-attention to directly capture the relationships between prompts and the original tokens, allowing the prompts to integrate visual features efficiently. This changes the original deployment of prompts, thereby decoupling the prompts from the original tokens and avoiding the distortion of self-attention. Furthermore, we introduce the weight-sharing mechanism to initialize the parameters of cross-attention, which avoids massive learnable parameters from cross-attention and enhances the representative capability of cross-attention. We conduct comprehensive testing across 25 datasets and the result indicates that CVPT significantly improves VPT’s performance and efficiency in visual tasks. For example, on the VTAB-1K benchmark, CVPT outperforms VPT by over 4\% in average accuracy, rivaling the advanced adapter-based methods in performance and efficiency. Our experiments confirm that prompt-based methods can achieve exceptional results in visual fine-tuning. The code is available at https://anonymous.4open.science/r/CVPT-A873/readme.md


#71
From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning

Hang Du · Jiayang Zhang · Guoshun Nan · Wendi Deng · Zhenyan Chen · Chenyang Zhang · Wang Xiao · Shan Huang · Yuqi Pan · Tao Qi · Sicong Leng

Multi-image Interleaved Reasoning aims to improve Multimodal Large Language Models' (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks.While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations.To bridge this gap, we introduce a novel benchmark \textbf{MIR}, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images.To enhance MLLMs' ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an ``easy to hard'' approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks.Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models' reasoning performance on MIR and other established benchmarks, highlighting the challenges current MLLMs face with multi-image interleaved reasoning.We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs' capability to handle complex inter-modal tasks.

Learning fine-grained representations from coarse labels for fine-grained visual recognition (FGVR) is a challenging yet valuable task, as it alleviates the reliance on labor-intensive fine-grained annotations. Early approaches focused primarily on minimizing intra-fine-grained-class variation but overlooked inter-fine-grained-class separability, resulting in limited FGVR performance. Subsequent studies employed a top-down paradigm to enhance separability via deep clustering, yet these methods require predefining the number of fine-grained classes, which is often impractical to obtain. Here, we introduce a bottom-up learning paradigm that constructs a hierarchical dendrogram by iteratively merging similar instances/clusters, inferring higher-level semantics from lowest-level instances without predefining class numbers. Leveraging this, we propose BuCSFR, a novel method that integrates a Bottom-up Construction (BuC) module to build the dendrogram based on a minimal information loss criterion, and a Separable Fine-grained Representation (SFR) module that treats dendrogram nodes as pseudo-labels to ensure representation separability. The synergistic interaction between these modules enables iterative enhancement, grounded theoretically in the Expectation-Maximization (EM) framework. Extensive experiments on five benchmark datasets demonstrate the superiority of our approach, showcasing its effectiveness in learning separable representations for FGVR.

Reinforcement learning (RL) has proven its potential in complex decision-making tasks. Yet, many RL systems rely on manually crafted state representations, requiring effort in feature engineering. Visual Reinforcement Learning (VRL) offers a way to address this challenge by enabling agents to learn directly from raw visual input. Nonetheless, VRL continues to face generalization issues, as models often overfit to specific domain features.To tackle this issue, we propose Diffusion Guided Adaptive Augmentation (DGA2), an augmentation method that utilizes Stable Diffusion to enhance domain diversity.We introduce an Adaptive Domain Shift strategy that dynamically adjusts the degree of domain shift according to the agent’s learning progress for effective augmentation with Stable Diffusion.Additionally, we employ saliency as the mask to preserve the semantics of data.Our experiments on the DMControl-GB, Adroit, and Procgen environments demonstrate that DGA2 improves generalization performance compared to existing data augmentation and generalization methods.


#74
Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

Tianjiao Jiang · Zhen Zhang · Yuhang Liu · Javen Qinfeng Shi

Few-shot learning (FSL) aims to enable models to learn effectively from limited labeled data. However, existing methods often struggle with overfitting due to the high dimensionality of feature spaces and the small sample sizes typically available. More precisely, the features used in most FSL applications can be viewed as a mixture of latent disentangled features. As a result, the learner is often required to implicitly infer the mixing procedure, which involves estimating a large number of parameters and frequently leads to overfitting. Building on recent theoretical advances in multi-modal contrastive learning, we propose the Causal CLIP Adapter (CCA), a novel approach that disentangles visual features obtained from CLIP by applying independent component analysis (ICA). While ICA effectively disentangles latent features, it may inadvertently introduce misalignment in the feature space. To address this, we leverage CLIP's inherent cross-modal alignment and enhance it both unidirectionally and bidirectionally through fine-tuning and cross-attention mechanisms. The logits from uni-modal and cross-modal classifications are then combined linearly to improve overall classification accuracy. Extensive experiments conducted across 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art (SOTA) techniques in terms of robustness to distributional shifts and resistance to adversarial noise, all while maintaining computational efficiency. These results underscore the effectiveness of causal disentanglement and enhanced cross-modal alignment in significantly boosting FSL performance.


#75
Multi-View 3D Point Tracking

Frano Rajič · Haofei Xu · Marko Mihajlovic · Siyuan Li · Irem Demir · Emircan Gündoğdu · Lei Ke · Sergey Prokudin · Marc Pollefeys · Siyu Tang

We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or previous multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks—Panoptic Studio and DexYCB—where we achieve median trajectory errors of 3.2 cm and 2.3 cm, respectively. Notably, on DexYCB, our method surpasses the strongest single-view tracker by 58.2% and a simpler multi-view triplane-based baseline by 46.5%. It also generalizes better to diverse camera setups of 1–8 cameras with varying vantage points and video lengths of 24–150 frames. By releasing our pre-trained tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for a wide range of real-world applications.


#76
Removing Cost Volumes from Optical Flow Estimators

Simon Kiefhaber · Stefan Roth · Simone Schaub-Meyer

Cost volumes are used in every modern optical flow estimator, but due to their computational and space complexity, they are often a limiting factor in optical flow methods regarding both processing speed and the resolution of input frames. Motivated by our empirical observation that cost volumes lose their importance once all other network parts of, e.g., a RAFT-based pipeline have been sufficiently trained, we introduce a training strategy that allows to remove the cost volume from optical flow estimators throughout training. This leads to significantly improved inference speed and reduced memory requirements. Using our training strategy, we create three different models covering different compute budgets. Our most accurate model reaches state-of-the-art accuracy while being $1.2\times$ faster and having a $6\times$ lower memory footprint than comparable models; our fastest model is capable of processing Full HD frames at $20\mathrm{FPS}$ using only $500\mathrm{MB}$ of memory.


#77
CA2C: A Prior-Knowledge-Free Approach for Robust Label Noise Learning via Asymmetric Co-learning and Co-training

Mengmeng Sheng · Zeren Sun · Tianfei Zhou · Xiangbo Shu · Jinshan Pan · Yazhou Yao

Label noise learning (LNL), a practical challenge in real-world applications, has recently attracted significant attention. While demonstrating promising effectiveness, existing LNL approaches typically rely on various forms of prior knowledge, such as noise rates or thresholds, to sustain performance. This dependence limits their adaptability and practicality in real-world scenarios where such priors are usually unavailable. To this end, we propose a novel LNL approach, termed CA2C (Combined Asymmetric Co-learning and Co-training), which alleviates the reliance on prior knowledge through an integration of complementary learning paradigms. Specifically, we first introduce an asymmetric co-learning strategy with paradigm deconstruction. This strategy trains two models simultaneously under distinct learning paradigms, harnessing their complementary strengths to enhance robustness against noisy labels. Then, we propose an asymmetric co-training strategy with cross-guidance label generation, wherein knowledge exchange is facilitated between the twin models to mitigate error accumulation. Moreover, we design a confidence-based re-weighting approach for label disambiguation, enhancing robustness against potential disambiguation failures. Extensive experiments on synthetic and real-world noisy datasets demonstrate the effectiveness and superiority of CA2C.


#78
Highlight
Fast Globally Optimal and Geometrically Consistent 3D Shape Matching

Paul Roetzer · Florian Bernard

Geometric consistency, i.e. the preservation of neighbourhoods, is a natural and strong prior in 3D shape matching. Geometrically consistent matchings are crucial for many downstream applications, such as texture transfer or statistical shape modelling. Yet, in practice, geometric consistency is often overlooked, or only achieved under severely limiting assumptions (e.g.~a good initialisation). In this work, we propose a novel formalism for computing globally optimal and geometrically consistent matchings between 3D shapes which is scalable in practice. Our key idea is to represent the surface of the source shape as a collection of cyclic paths, which are then consistently matched to the target shape. Mathematically, we construct a hyper product graph (between source and target shape), and then cast 3D shape matching as a minimum-cost circulation flow problem in this hyper graph, which yields global geometrically consistent matchings between both shapes. We empirically show that our formalism is efficiently solvable and that it leads to high-quality results.

While foundation models (FMs) pre-trained on large-scale data exhibit good zero-shot performance in many downstream tasks, there is often scope for performance improvement via task-specific adaptation of the FM. However, the data required for this adaptation is typically spread across multiple entities (data owners) and cannot be collated at a central location due to privacy concerns. At the same time, a learning service provider (LSP) who owns the FM cannot share the model with data owners due to proprietary reasons. In this work, we propose the BlindFed framework, which enables multiple data owners to collaboratively adapt an FM (owned by an LSP) for a specific downstream task while preserving the interests of both the data owners and the LSP. Specifically, data owners do not see the FM as well as each other's data, and the LSP does not see sensitive task-specific data. The BlindFed framework relies on fully homomorphic encryption (FHE) and consists of three key innovations: (i) We introduce FHE-friendly architectural modifications of the given FM, leveraging existing tools such as polynomial approximations and low-rank parallel adapters. (ii) We propose a two-stage split learning process, where FHE-friendly FM blocks are learned through offline knowledge distillation and task-specific local parallel adapters are learned via online encrypted inference without backpropagation through the FM. (iii) Since local adapter learning requires the LSP to share intermediate representations with the data owners, we propose a privacy-boosting scheme based on sample permutations within a batch and stochastic block sampling to prevent data owners from learning the FM through model extraction attacks. Empirical results on four image classification datasets demonstrate the practical feasibility of the BlindFed framework, albeit at a high communication cost and large computational complexity for the LSP.


#80
Customizing Domain Adapters for Domain Generalization

Yuyang Ji · Zeyi Huang · Haohan Wang · Yong Jae Lee

In this paper, we study domain generalization, where the goal is to develop models that can effectively generalize from multiple source domains to unseen target domains. Different from traditional approaches that aim to create a single, style-invariant model, we propose a new ``Customized Domain Adapters'' method, named CDA. This method leverages parameter-efficient adapters to construct a model with domain-specific components, each component focusing on learning from its respective domain. We focus on integrating the unique strengths of different adapter architectures, such as ViT and CNN, to create a model adept at handling the distinct statistical properties of each domain. Our experimental results on standard domain generalization datasets demonstrate the superiority of our method over traditional approaches, showcasing its enhanced adaptability and robustness in domain generalization tasks.


#81
Highlight
Towards Scalable Spatial Intelligence via 2D-to-3D Data Lifting

Xingyu Miao · Haoran Duan · Quanhao Qian · Jiuniu Wang · Yang Long · Ling Shao · Deli Zhao · Ran Xu · Gongjie Zhang

Spatial intelligence is emerging as a transformative frontier in AI, yet it remains constrained by the scarcity of large-scale 3D datasets. Unlike the abundant 2D imagery, acquiring 3D data typically requires specialized sensors and laborious annotation. In this work, we present a scalable pipeline that converts single-view images into comprehensive, scale- and appearance-realistic 3D representations — including point clouds, camera poses, depth maps, and pseudo-RGBD — via integrated depth estimation, camera calibration, and scale calibration. Our method bridges the gap between the vast repository of imagery and the increasing demand for spatial scene understanding. By automatically generating authentic, scale-aware 3D data from images, we significantly reduce data collection costs and open new avenues for advancing spatial intelligence. We release multiple generated spatial datasets, i.e., COCO-3D and Objects365-v2-3D, and demonstrate through extensive experiments that our generated data can benefit various spatial tasks, ranging from basic perception to MLLM-based reasoning. These results validate our pipeline as an effective solution for developing AI systems capable of perceiving, understanding, and interacting with physical environments.


#82
Highlight
SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement

Liwen Xiao · Zhiyu Pan · Zhicheng Wang · Zhiguo Cao · Wei Li

Accurate prediction of multi-agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory, we propose a novel trajectory refinement approach, Soft-Braid Refiner (SRefiner), guided by the soft-braid topological structure of trajectories using Soft-Braid Attention. Soft-Braid Attention captures spatio-temporal topological relationships between trajectories by considering both spatial proximity and vehicle motion states at ``soft intersection points". Additionally, we extend this approach to model interactions between trajectories and lanes, further improving the prediction accuracy. SRefiner is a multi-iteration, multi-agent framework that iteratively refines trajectories, incorporating topological information to enhance interactions within traffic scenarios. SRefiner achieves significant performance improvements over four baseline methods across two datasets, establishing a new state-of-the-art in trajectory refinement.


#83
Highlight
Underwater Visual SLAM with Depth Uncertainty and Medium Modeling

Rui Liu · Sheng Fan · Wenguan Wang · Yi Yang

Underwater visual simultaneous localization and mapping (SLAM) faces critical challenges in light attenuation and degraded geometric consistency. Despite recent advances of visual SLAM in indoor and urban scenes, these approaches typically assume a clear medium and neglect medium-light interactions, leading to performance degradation in underwater environments. To overcome these limitations, we propose DUV-SLAM, a dense underwater visual SLAM framework that integrates uncertainty-aware geometry estimation with physics-inspired neural scattering modeling. Our method introduces two core innovations: i) depth uncertainty quantification derived from differentiable bundle adjustment, which propagates geometric confidence to guide mapping optimization; and ii) a neural-Gaussian hybrid representation that combines adaptive 3D Gaussians for underwater reconstruction with a neural field capturing wavelength-dependent medium properties, optimized using a combination of photometric, geometric, and distribution losses. Experiments on synthetic and real-world datasets demonstrate that DUV-SLAM achieves high-quality monocular reconstruction while maintaining real-time efficiency and robust tracking accuracy. Our code will be released.


#84
Aligning Vision to Language: Annotation-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning

Junming Liu · Siyuan Meng · Yanting Gao · Song Mao · Pinlong Cai · Guohang Yan · Yirong Chen · Zilin Bian · DING WANG · Botian Shi

Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts, challenges that textual Knowledge Graphs (KGs) only partially mitigate due to their modality isolation. While Multimodal Knowledge Graphs (MMKGs) promise enhanced cross-modal understanding, their practical construction is impeded by semantic narrowness of manual text annotations and inherent noise in visual-semantic entity linkages. In this paper, we propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing MMKGs that enhances LLMs reasoning through cross-modal information supplementation. Specifically, we cascade pre-trained Vision-Language Models (VLMs) to align image features with text, transforming them into descriptions that encapsulate image-specific information. Furthermore, we developed a cross-modal similarity verification mechanism to quantify semantic consistency, effectively filtering out noise introduced during feature alignment. Even without manually annotated image captions, the refined descriptions alone suffice to construct the MMKG. Compared to conventional MMKGs construction paradigms, our approach achieves substantial storage efficiency gains while maintaining direct entity-to-image linkage capability. Experimental results on multimodal reasoning tasks demonstrate that LLMs augmented with VaLiK outperform previous state-of-the-art models.


#85
Highlight
AIM: Amending Inherent Interpretability via Self-Supervised Masking

Eyad Alshami · Shashank Agnihotri · Bernt Schiele · Margret Keuper

It has been observed that deep neural networks (DNNs) often use both genuine as well as spurious features.In this work, we propose ''Amending Inherent Interpretability via Self-Supervised Masking'' (AIM), a simple yet surprisingly effective method that promotes the network’s utilization of genuine features over spurious alternatives without requiring additional annotations.In particular, AIM uses features at multiple encoding stages to guide a self-supervised, sample-specific feature-masking process. As a result, AIM allows training well-performing and inherently interpretable models that faithfully summarize the decision process.When tested on challenging datasets designed to assess reliance on spurious features and out-of-domain generalization, AIM networks demonstrate significant dual benefits: Evaluations show that AIM improves interpretability, as measured by the Energy Pointing Game (EPG) score, by $\sim$6$-$37\%, while simultaneously enhancing accuracy by $\sim$10$-$40\%. These impressive performance gains are further validated on the standard in-domain CUB-200 dataset for fine-grained classification. The results provide compelling evidence supporting our hypothesis that AIM finds genuine and meaningful features that directly contribute to its improved human interpretability.


#86
Reinforcement Learning-Guided Data Selection via Redundancy Assessment

Suorong Yang · Peijia Li · Furao Shen · Jian Zhao

Modern deep architectures often rely on large-scale datasets, but training on these datasets incurs high computational and storage overhead. Real-world datasets often contain substantial redundancies, prompting the need for more data-efficient training paradigms. Data selection has shown promise to mitigate redundancy by identifying the most representative samples, thereby reducing training costs without compromising performance. Existing methods typically rely on static scoring metrics or pretrained models, overlooking the combined effect of selected samples and their evolving dynamics during training. To address this, we introduce the concept of $\epsilon$-sample cover, which quantifies sample redundancy based on inter-sample relationships, capturing the intrinsic structure of the dataset. Based on this, we reformulate data selection as a reinforcement learning (RL) process, where a lightweight RL agent optimizes the selection policy by leveraging $\epsilon$-sample cover derived from evolving dataset distribution as a reward signal. Extensive experiments across benchmark datasets and diverse architectures demonstrate that our method consistently outperforms existing state-of-the-art baselines. Models trained with our selected datasets show enhanced generalization performance with improved training efficiency. Code will be made publicly available soon.


#87
Deep Incomplete Multi-view Clustering with Distribution Dual-Consistency Recovery Guidance

Jiaqi Jin · Siwei Wang · Zhibin Dong · Xihong Yang · Xinwang Liu · En Zhu · Kunlun He

Multi-view clustering leverages complementary representations from diverse sources to enhance performance. However, real-world data often suffer incomplete cases due to factors like privacy concerns and device malfunctions. A key challenge is effectively utilizing available instances to recover missing views. Existing methods frequently overlook the heterogeneity among views during recovery, leading to significant distribution discrepancies between recovered and true data. Additionally, many approaches focus on cross-view correlations, neglecting insights from intra-view reliable structure and cross-view clustering structure. To address these issues, we propose BURG, a novel method for incomplete multi-view clustering with distri\textbf{B}ution d\textbf{U}al-consistency \textbf{R}ecovery \textbf{G}uidance. We treat each sample as a distinct category and perform cross-view distribution transfer to predict the distribution space of missing views. To compensate for the lack of reliable category information, we design a dual-consistency guided recovery strategy that includes intra-view alignment guided by neighbor-aware consistency and cross-view alignment guided by prototypical consistency. Extensive experiments on benchmarks demonstrate the superiority of BURG in the incomplete multi-view scenario.


#88
VGGSounder: Audio-Visual Evaluations for Foundation Models

Daniil Zverev · Thaddäus Wiedemer · Ameya Prabhu · Matthias Bethge · Wieland Brendel · A. Sophia Koepke

Designing effective foundation models requires high-quality evaluation datasets. With the emergence of audio-visual foundation models, reliable assessment of their multi-modal understanding is essential. The current gold standard for evaluating audio-visual understanding is the popular classification dataset VGGSound. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of models' true auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is explicitly designed to accurately evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. We believe VGGSounder offers a robust and reliable benchmark supporting the future development of audio-visual foundation models.


#89
EA-Vit: Efficient Adaptation for Elastic Vision Transformer

Chen Zhu · Wangbo Zhao · Huiwen Zhang · Yuhao Zhou · Weidong Tang · Shuo Wang · Zhihang Yuan · Yuzhang Shang · Xiaojiang Peng · Kai Wang · Dawei Yang

Vision Transformer (ViT) has emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, supporting diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and expensive. In this paper, we propose \emph{Efficient Elastic ViT Adaptation}, a single ViT framework that encapsulates multiple submodels of varying sizes, eliminating the need for repeated adaptation.We introduce elastic configurations along four key dimensions—embedding dimension, attention heads, MLP expansion ratio, and layer depth—and a lightweight router that selects the optimal submodel under different computational budgets. Training proceeds in two stages: \emph{Staged Elastic Adaptation} progressively introduces complexity for efficient joint training of submodels while preserving as much pre-trained knowledge as possible; Subsequently, we integrate the router to refine the model by balancing accuracy and MACs, guiding it to initially focus on a small set of promising submodels for faster convergence within the large design space.Our approach captures an exponentially large family of submodels in a single adaptation process. Extensive experiments demonstrate that, for any resource constraint, the router identifies the best submodel, delivering high performance and reduced overhead compared to previous methods.


#90
Web Artifact Attacks Disrupt Vision Language Models

Maan Qraitem · Piotr Teterwak · Kate Saenko · Bryan Plummer

Vision-language models (VLMs) (\eg, CLIP, LLaVA) are trained on large-scale, lightly curated web datasets, leading them to learn unintended correlations between semantic concepts and unrelated visual signals. These associations degrade model accuracy by causing predictions to rely on incidental patterns rather than genuine visual understanding. Prior work has weaponized these correlations as an attack vector to manipulate model predictions, such as inserting a deceiving class text onto the image in a typographic attack. These attacks succeed due to VLMs' text-heavy bias—a result of captions that echo visible words rather than describing content. However, this attack has focused solely on text that matches the target class exactly, overlooking a broader range of correlations, including non-matching text and graphical symbols, which arise from the abundance of branding content in web-scale data. To address this gap, we introduce artifact-based attacks: a novel class of manipulations that mislead models using both non-matching text and graphical elements. Unlike typographic attacks, these artifacts are not predefined, making them harder to defend against but also more challenging to find. We address this by framing artifact attacks as a search problem and demonstrate their effectiveness across five datasets, with some artifacts reinforcing each other to reach 100\% attack success rates. These attacks transfer across models with up to 90\% effectiveness, making it possible to attack unseen models. To defend against these attacks, we extend prior work's artifact aware prompting to the graphical setting. We see a moderate reduction of success rates of up to 15\% relative to standard prompts, suggesting a promising direction for enhancing model robustness.


#91
Tensor-aggregated LoRA in Federated Fine-tuning

Zhixuan Li · Binqian Xu · Xiangbo Shu · Jiachao Zhang · Yazhou Yao · Guo-Sen Xie · Jinhui Tang

The combination of Large Language Models (LLMs) and Federated Learning (FL) to leverage privacy-preserving data has emerged as a promising approach to further enhance the Parameter-Efficient Fine-Tuning (PEFT) capabilities of LLMs. In real-world FL settings with resource heterogeneity, the training process of Low-Rank Adaptation (LoRA), the representative PEFT method, still faces two major challenges: aggregagion noise and aggregagion misalignment. In this paper, we propose a novel Tensor-aggregated LoRA (Te-LoRA) in Federated Fine-tuning based on an alternating-freeze training strategy to avoid aggregating noise without additional server-side computational costs, while mitigating aggregation suboptimality caused by parameter misalignment between heterogeneous LoRAs. Especially in addressing the aggregation suboptimality issue, we design the Pre-Aggregation Alignment strategy (PAA-strategy) and Tensor-to-Matrix strategy (T2M-strategy) for aligning heterogeneous LoRAs and aggregating them into an united tensor, which is then decomposed into matrices adapted for client download. Extensive experiments demonstrate the effectiveness and robustness of Te-LoRA in both homogeneous and heterogeneous settings.


#92
Feature Coding in the Era of Large Models: Dataset, Test Conditions, and Benchmark

Changsheng Gao · Yifan Ma · Qiaoxi Chen · Xu yenan · Dong Liu · Weisi Lin

Large models have achieved remarkable performance across various tasks, yet they incur significant computational costs and privacy concerns during both training and inference. Distributed deployment has emerged as a potential solution, but it necessitates the exchange of intermediate information between model segments, with feature representations serving as crucial information carriers. To optimize information exchange, feature coding is required to reduce transmission and storage overhead. Despite its importance, feature coding for large models remains an under-explored area.In this paper, we draw attention to large model feature coding and make three fundamental contributions. First, we introduce a comprehensive dataset encompassing diverse features generated by three representative types of large models. Second, we establish unified test conditions, enabling standardized evaluation pipelines and fair comparisons across future feature coding studies. Third, we introduce two baseline methods derived from widely used image coding techniques and benchmark their performance on the proposed dataset. These contributions aim to provide a foundation for future research and inspire broader engagement in this field. To support a long-term study, all source code and the dataset will be made publicly available and actively maintained.


#93
Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery

Xiao Liu · Nan Pu · Haiyang Zheng · Wenjing Li · Nicu Sebe · Zhun Zhong

In this paper, we investigate a practical yet challenging task: On-the-fly Category Discovery (OCD). This task focuses on the online identification of newly arriving stream data that may belong to both known and unknown categories, utilizing the category knowledge from only labeled data. Existing OCD methods are devoted to fully mining transferable knowledge from only labeled data. However, the transferability learned by these methods is limited because the knowledge contained in known categories is often insufficient, especially when few annotated data/categories are available in fine-grained recognition. To mitigate this limitation, we propose a diffusion-based OCD framework, dubbed DiffGRE, which integrates Generation, Refinement, and Encoding in a multi-stage fashion. Specifically, we first design an attribute-composition generation method based on cross-image interpolation in the diffusion latent space to synthesize novel samples. Then, we propose a diversity-driven refinement approach to select the synthesized images that differ from known categories for subsequent OCD model training. Finally, we leverage a semi-supervised leader encoding to inject additional category knowledge contained in synthesized data into the OCD models, which can benefit the discovery of both known and unknown categories during the on-the-fly inference process. Extensive experiments demonstrate the superiority of our DiffGRE over previous methods on six fine-grained datasets.

Humans perceive the world through multimodal cues to understand and interact with the environment. Learning a scene presentation for multiple modalities enhances comprehension of the physical world. However, modality conflicts, arising from inherent distinctions among different modalities, present two critical challenges: property disparity and granularity disparity. To address these challenges, we propose a general framework, MMOne, to represent multiple modalities in one scene, which can be readily extended to additional modalities. Specifically, a modality modeling module with a novel modality indicator is proposed to capture the unique properties of each modality. Additionally, we design a multimodal decomposition mechanism to separate multi-modal Gaussians into single-modal Gaussians based on modality differences. We address the essential distinctions among modalities by disentangling multimodal information into shared and modality-specific components, resulting in a more compact and efficient multimodal scene representation. Extensive experiments demonstrate that our method consistently enhances the representation capability for each modality and is scalable to additional modalities. The code will be released.


#95
MM-IFEngine: Towards Multimodal Instruction Following

Shengyuan Ding · Wu Shenxi · Xiangyu Zhao · Yuhang Zang · Haodong Duan · Xiaoyi Dong · Pan Zhang · Yuhang Cao · Dahua Lin · Jiaqi Wang

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and doing it right.Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints.To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs.Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO).We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both textual constraints for output responses and visual constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating rule-based assessment and LLM-as-a-Judge evaluation.We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieve notable gains on various IF benchmarks, such as MM-IFEval (+11.8$\%$), MIA (+7.7$\%$), and IFEval (+10.5$\%$).


#96
Knowledge Distillation with Refined Logits

Wujie Sun · Defang Chen · Siwei Lyu · Genlang Chen · Chun Chen · Can Wang

Recent research on knowledge distillation has increasingly focused on logit distillation because of its simplicity, effectiveness, and versatility in model compression. In this paper, we introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods. Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions, creating an exacerbated divergence between the standard distillation loss and the cross-entropy loss, which can undermine the consistency of the student model's learning objectives. Previous attempts to use labels to empirically correct teacher predictions may undermine the class correlation. In contrast, our RLD employs labeling information to dynamically refine teacher logits. In this way, our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations, thus enhancing the value and efficiency of distilled knowledge. Experimental results on CIFAR-100 and ImageNet demonstrate its superiority over existing methods. The code is provided in the supplementary material.


#97
Highlight
CoST: Efficient Collaborative Perception From Unified Spatiotemporal Perspective

Zongheng Tang · Yi Liu · Yifan Sun · Yulu Gao · Jinyu Chen · Runsheng Xu · Si Liu

Collaborative perception shares information among different agents and helps solving problems that individual agents may face, e.g., occlusions and small sensing range. Prior methods usually separate the multi-agent fusion and multi-time fusion into two consecutive steps. In contrast, this paper proposes an efficient collaborative perception that aggregates the observations from different agents (space) and different times into a unified spatio-temporal space simultanesouly. The unified spatio-temporal space brings two benefits, i.e., efficient feature transmission and superior feature fusion. 1) Efficient feature transmission: each static object yields a single observation in the spatial temporal space, and thus only requires transmission only once (whereas prior methods re-transmit all the object features multiple times). 2) superior feature fusion: merging the multi-agent and multi-time fusion into a unified spatial-temporal aggregation enables a more holistic perspective, thereby enhancing perception performance in challenging scenarios. Consequently, our Collaborative perception with Spatio-temporal Transformer (CoST) gains improvement in both efficiency and accuracy. Notably, CoST is not tied to any specific method and is compatible with a majority of previous methods, enhancing their accuracy while reducing the transmission bandwidth.


#98
RainbowPrompt: Diversity-Enhanced Prompt-Evolving for Continual Learning

Kiseong Hong · Gyeong-Hyeon Kim · Eunwoo Kim

Prompt-based continual learning provides a rehearsal-free solution by tuning small sets of parameters while keeping pre-trained models frozen. To meet the complex demands of sequential tasks, it is crucial to integrate task-specific knowledge within prompts effectively. However, existing works rely on either fixed learned prompts (i.e., prompts whose representations remain unchanged during new task learning) or on prompts generated from an uninformative task-shared space, limiting the representational diversity of the integrated prompt. To address this issue, we propose a novel prompt-evolving mechanism to adaptively aggregate base prompts (i.e., task-specific prompts) into a unified prompt while ensuring diversity. By transforming and aligning all base prompts, both previously learned and newly introduced, our approach continuously evolves accumulated knowledge to facilitate learning new tasks. We further introduce a learnable probabilistic gate that adaptively determines which layers to activate during the evolution process. We validate our method on image classification and video action recognition tasks in class-incremental learning, achieving average gains of 9.07% and 7.40% over existing methods across all scenarios.

Prompt learning has recently attracted much attention for adapting pre-trained vision-language models (e.g., CLIP) to downstream recognition tasks. However, most of the existing CLIP-based prompt learning methods in literature only show a limited ability for handling fine-grained datasets. To address this issue, we propose a causality-guided text prompt learning method via visual granulation for CLIP, called CaPL, where the explored visual granulation technique could construct sets of visual granules for the text prompt to capture subtle discrepancies among different fine-grained classes through casual inference. The CaPL method contains the following two modules: (1) An attribute disentanglement module is proposed to decompose visual features into non-individualized attributes (shared by some classes) and individualized attributes (specific to single classes) using a Brownian Bridge Diffusion Model; (2) A granule learning module is proposed to construct visual granules by integrating the aforementioned attributes for recognition under two causal inference strategies. Thanks to the learned visual granules, more discriminative text prompt is expected to be learned. Extensive experimental results on 15 datasets demonstrate that our CaPL method significantly outperforms the state-of-the-art prompt learning methods, especially on fine-grained datasets.


#100
FA: Forced Prompt Learning of Vision-Language Models for Out-of-Distribution Detection

Xinhua Lu · Runhe Lai · Yanqi Wu · Kanghao Chen · Wei-Shi Zheng · Ruixuan Wang

Pre-trained vision-language models (VLMs) have advanced out-of-distribution (OOD) detection recently. However, existing CLIP-based methods often focus on learning OOD-related knowledge to improve OOD detection, showing limited generalization or reliance on external large-scale auxiliary datasets. In this study, instead of delving into the intricate OOD-related knowledge, we propose an innovative CLIP-based framework based on Forced prompt leArning (FA), designed to make full use of the In-Distribution (ID) knowledge and ultimately boost the effectiveness of OOD detection. Our key insight is to learn a prompt (i.e., forced prompt) that contains more diversified and richer descriptions of the ID classes beyond the textual semantics of class labels. Specifically, it promotes better discernment for ID images, by forcing more notable semantic similarity between ID images and the learnable forced prompt. Moreover, we introduce a forced coefficient, encouraging the forced prompt to learn more comprehensive and nuanced descriptions of the ID classes. In this way, FA is capable of achieving notable improvements in OOD detection, even when trained without any external auxiliary datasets, while maintaining an identical number of trainable parameters as CoOp. Extensive empirical evaluations confirm our method consistently outperforms current state-of-the-art methods. The codes will be released publicly.


#101
VisionMath: Vision-Form Mathematical Problem-Solving

Zongyang Ma · Yuxin Chen · Ziqi Zhang · Zhongang Qi · Chunfeng Yuan · Shaojie Zhu · Chengxiang Zhuo · Bing Li · Ye Liu · Zang Li · Ying Shan · Weiming Hu

Mathematical problems in real-world scenarios are often presented in a purely vision-form, where textual problem statement and accompanying math figures, e.g., geometry figures and functional graphs, are integrated into a single image. This vision-form problem-solving task requires precise comprehension and reasoning on both textual and graphical elements in the images, posing significant challenge to current Multimodal Large Language Models (MLLMs), which process text and math figures in isolation. In this work, we propose VisionMath, the first exploration for vision-form mathematical problem-solving model, which employs a three-stage progressive multimodal reasoning alignment strategy to systematically enhance task-specific capabilities. Building upon a LLM proficient in unimodal mathematical reasoning, VisionMath first establishes foundational OCR capabilities through capturing rendered mathematical problem images. Subsequently, the model develops comprehensive understanding of figure structures and properties via learning from figure descriptions and mathematical educational videos. Finally, the model's reasoning capacity is activated using carefully constructed visual-form problem-solving datasets VisionMath-IT with chain-of-thought annotations. For comprehensive evaluation, we construct multilingual benchmarks covering diverse problem types, including geometry, algebra, function problems in both English and Chinese. Our model weights, data and code will be public available.


#102
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

Xiyao Wang · Zhengyuan Yang · Linjie Li · Hongjin Lu · Yuancheng Xu · Chung-Ching Lin · Kevin Lin · Furong Huang · Lijuan Wang

Despite significant advancements in vision-language models (VLMs), there lacks effective approaches to enhance response quality by scaling inference-time computation. This capability is known to be a core step towards the self-improving models in recent large language model studies. In this paper, we present Vision Value Model (VisVM) that can guide VLM inference-time search to generate responses with better visual comprehension. Specifically, VisVM not only evaluates the generated sentence quality in the current search step, but also anticipates the quality of subsequent sentences that may result from the current step, thus providing a long-term value. In this way, VisVM steers VLMs away from generating sentences prone to hallucinations or insufficient detail, thereby producing higher quality responses. Experimental results demonstrate that VisVM-guided search significantly enhances VLMs' ability to generate descriptive captions with richer visual details and fewer hallucinations, compared with greedy decoding and search methods with other visual reward signals. Furthermore, we find that self-training the model with the VisVM-guided captions improve VLM's performance across a wide range of multimodal benchmarks, indicating the potential for developing self-improving VLMs.


#103
Contrastive Flow Matching

George Stoica · Vivek Ramanujan · Xiang Fan · Ali Farhadi · Ranjay Krishna · Judy Hoffman

Unconditional flow-matching trains diffusion models to efficiently transport samples from a source distribution to samples of target distribution by enforcing that the flows between sample pairs from the source and target distributions are unique. However, in conditional settings (e.g., class-conditioned models), this uniqueness is no longer guaranteed—flows from different conditions may overlap, leading to more ambiguous generations. We introduce Contrastive Flow Matching (CFM) an extension to the flow-matching objective that explicitly enforces uniqueness across all conditional flows, enhancing condition separation. Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs. We validate Contrastive Flow Matching by conducting extensive experiments across varying SiT model sizes on the popular ImageNet-1 (256x256) and (512x512) benchmarks.Notably, we find that training models with CFM (1) improves training speed by a factor of up to 2x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow-matching.We commit to releasing our code upon publication.

Deep learning models rely on large-scale labeled datasets, but collecting such data is expensive and time-consuming. Semi-supervised learning (SSL) mitigates this issue by learning from a small set of labeled samples along with a large pool of unlabeled data. However, existing SSL methods struggle with fine-grained classification when dealing with visually similar classes, as they rely solely on visual features and ignore the semantics information within label names.This paper introduces \algo, an SSL enhancement approach that utilizes semantic information from label names to guide visual feature learning, addressing the challenges of fine-grained classification. By aligning text embeddings from label names with visual features, our method helps the model capture subtle visual distinctions that purely visual representations may overlook. To enhance robustness, we propose two key components: (1) text embedding de-similarity (TEDS) to reduce confusion caused by similar text embeddings across different class names, and (2) class-aware visual-text alignment loss to accurately define positive and negative pairs during visual-text alignment. Our method achieves state-of-the-art performance on the latest SSL benchmarks. Additionally, on the challenging Food-101 dataset, which contains many visually similar classes and uses only 404 labeled images, our approach improves performance by approximately 13.6\% over the second-best method. Code is available at \href{https://anonymous.4open.science/r/ICCV6983-SemiVisBooster}{ICCV6983-SemiVisBooster Repository}


#105
Dataset Distillation via the Wasserstein Metric

Haoyang Liu · Peiran Wang · Yijiang Li · Tiancheng Xing · Vibhu Dalal · Luwei LI · Jingrui He · Haohan Wang

Dataset Distillation (DD) aims to generate a compact synthetic dataset that enables models to achieve performance comparable to training on the full large dataset, significantly reducing computational costs. Drawing from optimal transport theory, we introduce WMDD (Dataset Distillation with Wasserstein Metric-based Feature Matching), a straightforward yet powerful method that employs the Wasserstein metric to enhance distribution matching.We compute the Wasserstein barycenter of features from a pretrained classifier to capture essential characteristics of the original data distribution. By optimizing synthetic data to align with this barycenter in feature space and leveraging per-class BatchNorm statistics to preserve intra-class variations, WMDD maintains the efficiency of distribution matching approaches while achieving state-of-the-art results across various high-resolution datasets. Our extensive experiments demonstrate WMDD's effectiveness and adaptability, highlighting its potential for advancing machine learning applications at scale.


#106
Membership Inference Attacks with False Discovery Rate Control

Chenxu Zhao · Wei Qian · Aobo Chen · Mengdi Huai

Recent studies have shown that deep learning models are vulnerable to membership inference attacks (MIAs), which aim to infer whether a data record was used to train a target model or not. To analyze and study these vulnerabilities, various MIA methods have been proposed. Despite the significance and popularity of MIAs, existing works on MIAs are limited in providing guarantees on the false discovery rate (FDR), which refers to the expected proportion of false discoveries among the identified positive discoveries. However, it is very challenging to ensure the false discovery rate guarantees, because the underlying distribution is usually unknown, and the estimated non-member probabilities often exhibit interdependence. To tackle the above challenges, in this paper, we design a novel membership inference attack method, which can provide the guarantees on the false discovery rate. Additionally, we show that our method can also provide the marginal probability guarantee on labeling true non-member data as member data. Notably, our method can work as a wrapper that can be seamlessly integrated with existing MIA methods in a post-hoc manner, while also providing the FDR control. We perform the theoretical analysis for our method. Extensive experiments in various settings (e.g., the black-box setting and the lifelong learning setting) are also conducted to verify the desirable performance of our method. The source code is available in the supplementary material.


#107
Acknowledging Focus Ambiguity in Visual Questions

Chongyan Chen · Yu-Yun Tseng · Zhuoheng Li · Anush Venkatesh · Danna Gurari

No existing work on visual question answering explicitly acknowledges there can be ambiguity regarding where the content described in the question is located in the image. To fill this gap, we introduce VQ-FocusAmbiguity, the first VQA dataset that visually grounds each region described in the question that is necessary to arrive at the answer. We next analyze and compare our dataset to existing datasets to reveal its unique properties. Finally, we benchmark modern models for two novel tasks related to acknowledging focus ambiguity: recognizing whether a visual question has focus ambiguity and locating all plausible focus regions within the image. Results show that the dataset is challenging for modern models. To facilitate future progress on these tasks, we publicly-share the dataset with an evaluation server at https://placeholder.github.io/.


#108
A Good Teacher Adapts Their Knowledge for Distillation

Chengyao Qian · Trung Le · Mehrtash Harandi

Knowledge distillation (KD) is an effective method for enhancing a small model, named student, by training it under the supervision of larger teacher models. However, existing studies indicate that a substantial capacity gap between the student and teacher can lead to poor learning for the student model. This capacity gap problem limits the applicability of KD and necessitates careful selection of the teacher's size.%Despite its importance, the underlying cause of the capacity gap problem remains underexplored. In this paper, we reveal that a substantial disparity in the output distributions of teacher and student models is a key factor behind this issue. To demonstrate this, we decompose the KD loss into two components: class-wise similarity and inner-class distribution, and analyze the contribution of each term. Our analysis shows that a large distributional mismatch can lead to poor student learning.%Inspired by this observation, we propose the Adapted Inner-class Distribution (AID) method, wherein the teacher model is fine-tuned to optimize its inner-class distribution to better align with the student's capacity prior to knowledge distillation. This approach effectively bridges the capacity gap between teacher and student models and consistently achieves state-of-the-art performance across a diverse range of architectures.


#109
Highlight
Evading Data Provenance in Deep Neural Networks

Hongyu Zhu · Sichu Liang · Wenwen Wang · Zhuomeng Zhang · Fangqi Li · Shi-Lin Wang

Modern over-parameterized deep models are highly data-dependent, with large scale general-purpose and domain-specific datasets serving as the bedrock for rapid advancements. However, many datasets are proprietary or contain sensitive information, making unrestricted model training problematic. In the open world where data thefts cannot be fully prevented, Dataset Ownership Verification (DOV) has emerged as a promising method to protect copyright by detecting unauthorized model training and tracing illicit activities. Due to its diversity and superior stealth, evading DOV is considered extremely challenging. However, this paper identifies that previous studies have relied on oversimplistic evasion attacks for evaluation, leading to a false sense of security. We introduce a unified evasion framework, in which a teacher model first learns from the copyright dataset and then transfers task-relevant yet identifier-independent domain knowledge to a surrogate student using an out-of-distribution (OOD) dataset as the intermediary. Leveraging Vision-Language Models and Large Language Models, we curate the most informative and reliable subsets from the OOD gallery set as the final transfer set, and propose selectively transferring task-oriented knowledge to achieve a better trade-off between generalization and evasion effectiveness. Experiments across diverse datasets covering eleven DOV methods demonstrate our approach simultaneously eliminates all copyright identifiers and significantly outperforms nine state-of-the-art evasion attacks in both generalization and effectiveness, with moderate computational overhead. As a proof of concept, we reveal key vulnerabilities in current DOV methods, highlighting the need for long-term development to enhance practicality.


#110
Boosting Adversarial Transferability via Residual Perturbation Attack

Jinjia Peng · Zeze Tao · Huibing Wang · Meng Wang · Yang Wang

Deep neural networks are susceptible to adversarial examples, which can lead to incorrect predictions by introducing imperceptible perturbations. Transfer-based attacks create adversarial examples for surrogate models and transfer these examples to victim models deployed in black-box scenarios. Recent studies reveal that adversarial examples in flat loss landscapes can alleviate overfitting on surrogate models and exhibit superior transferability. However, these works ignore the influence of perturbation directions, resulting in limited transferability. To overcome this limitation, this paper proposes a new attack method named Residual Perturbation Attack (ResPA), which employs the residual gradient as the perturbation direction to guide the adversarial examples to search toward the flat regions of the loss function. Specifically, ResPA conducts an exponential moving average operation on the input gradients to obtain the first moment as the referenced gradient, which encompasses the direction information of historical gradients. Moreover, to avoid over-relying on the local flatness, instead of directly using the current gradient as the perturbation direction, ResPA further considers the residual between the current gradient and the referenced gradient, which can capture the changes in the global perturbation direction. Comprehensive experimental comparisons show that ResPA can remarkably enhance adversarial transferability. In addition, ResPA can be naturally combined with existing input transformations to further improve transferability.


#111
MAVias: Mitigate any Visual Bias

Ioannis Sarridis · Christos Koutlis · Symeon Papadopoulos · Christos Diou

Mitigating biases in computer vision models is an essential step towards the trustworthiness of artificial intelligence models. Existing bias mitigation methods focus on a small set of predefined biases, limiting their applicability in visual datasets where multiple, possibly unknown biases exist. To address this limitation, we introduce MAVias, an open-set bias mitigation approach leveraging foundation models to discover spurious associations between visual attributes and target classes. MAVias first captures a wide variety of visual features in natural language via a foundation image tagging model, and then leverages a large language model to select those visual features defining the target class, resulting in a set of language-coded potential visual biases. We then translate this set of potential biases into vision-language embeddings and introduce an in-processing bias mitigation approach to prevent the model from encoding information related to them. Our experiments on diverse datasets, including CelebA, Waterbirds, ImageNet, and UrbanCars, show that MAVias effectively detects and mitigates a wide range of biases in visual recognition tasks outperforming current state-of-the-art.

Open-set fine-grained recognition (OSFGR) is the core exploration of building open-world intelligent systems. The challenge lies in the gradual semantic drift during the transition from coarse-grained to fine-grained categories. However, although existing methods leverage hierarchical representations to assist progressive reasoning, they neglect semantic consistency across hierarchies. To address this, we propose a multimodal progressive bidirectional reasoning framework: (1) In forward reasoning, the model progressively refines visual features to capture hierarchical structural representations, while (2) in backward reasoning, variational inference integrates multimodal information to constraint consistency in category-aware latent spaces. This mechanism mitigates semantic drift through bidirectional information flow and cross-hierarchical feature consistency constraints. Extensive experiments on iNat2021-OSR dataset, the largest open-set fine-grained dataset with over 600K images, demonstrate that our proposed method achieves superior performance over the state-of-the-art methods.


#113
Towards Higher Effective Rank in Parameter-Efficient Fine-tuning using Khatri-Rao Product

Paul Albert · Frederic Zhang · Hemanth Saratchandran · Anton Hengel · Ehsan Abbasnejad

Parameter-efficient fine-tuning (PEFT) has become a standard for adapting large pre-trained models. While low-rank adaptation (LoRA) has achieved notable success, recent studies highlight its limitations when compared to full-rank variants, particularly when scaling to demanding tasks such as vision-language classification or common-sense reasoning.We propose to quantitavely compare full and rank-restricted PEFT methods using a spectrum-controlled matrix approximation benchmark. Our results validate LoRA's rank limitations when approximating matrix presenting highly decorrelated or high frequency features. We further show that full-rank methods can reduce LoRA's approximation error on these matrix types for an equal parameter count.Our evaluation then extends beyond synthetic tasks where we observe that LoRA's restricted work subspace can produce high norm updates, leading to over-fitting and poor out-of-distribution generalization. We address these limits by introducing KRAdapter, a novel PEFT algorithms that uses properties of the Kathri-Rao matrix product to produce weight matrices of higher effective rank and lower norm than related PEFT algorithms.We show the performance improvements of KRAdapter on vision-language models up to 1B parameters and 8B %32Bfor LLMs where we report from 20 to 25 points of accuracy improvements over LoRA when reasoning on commonsense tasks unseen during training. Crucially, KRAdapter maintains the favorable training speed and memory efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models. Code for reproducing toy experiments is available in the supplementary and will be released upon acceptance.


#114
Revisiting Pool-based Prompt Learning for Few-shot Class-incremental Learning

Yongwei Jiang · Yixiong Zou · Yuhua Li · Ruixuan Li

Few-Shot Class-Incremental Learning (FSCIL) faces dual challenges of data scarcity and incremental learning in real-world scenarios. While pool-based prompting methods have demonstrated success in traditional incremental learning, their effectiveness in FSCIL settings remains unexplored. This paper presents the first study of current prompt pool methods in FSCIL tasks, revealing an unanticipated performance degradation in incremental sessions. Through comprehensive analysis, we identify that this phenomenon stems from token-dimension saturation: with limited data, excessive prompts compete for task-relevant information, leading to model overfitting. Based on this finding, we propose LGSP-Prompt (Local-Global Spatial Prompting), which innovatively shifts pool-based prompt learning from the token dimension to the spatial dimension. LGSP-Prompt generates spatial prompts by synergistically combining local spatial features and global frequency-domain representations to highlight key patterns in input images. We construct two spatial prompt pools enabling dynamic prompt selection to maintain acquired knowledge while effectively learning novel sessions. Extensive experiments demonstrate that our approach achieves state-of-the-art performance across multiple FSCIL benchmarks, showing significant advantages in both base knowledge preservation and incremental learning. Our codes will be released.


#115
Federated Representation Angle Learning

Liping Yi · Han Yu · Gang Wang · xiaoguang Liu · Xiaoxiao Li

Model-heterogeneous federated learning (MHFL) is a challenging FL paradigm designed to allow FL clients to train structurally heterogeneous models under the coordination of an FL server. Existing MHFL methods face significant limitations when it comes to transferring global knowledge to clients as a result of sharing only partial homogeneous model parameters or calculating distance loss, leading to inferior model generalization. To bridge this gap, we propose a novel model-heterogeneous Federated learning method with Representation Angle Learning (FedRAL). It consists of three innovative designs: (1) We first introduce representation angle learning into MHFL. Specifically, we embed a homogeneous square matrix into the local heterogeneous model of each client, which learns the angle information of local representations. These homogeneous representation angle square matrices are aggregated on the server to fuse representation angle knowledge shared by clients for enhancing the generalization of local representations. (2) As different clients might have heterogeneous system resources, we propose an adaptive diagonal sparsification strategy to reduce the numbers of the parameters of representation angle square matrices uploaded to the server, to improve FL communication efficiency. (3) To enable the effective fusion of sparsified homogeneous local representation angle square matrices, we design an element-wise weighted aggregation approach. Experiments on 4 benchmark datasets under 2 types of non-IID divisions over 6 state-of-the-art baselines demonstrate that FedRAL achieves the best performance. It improves test accuracy, communication efficiency and computational efficiency by up to 5.03%, 12.43× and 6.49×, respectively.


#116
Federated Continual Instruction Tuning

Haiyang Guo · Fanhu Zeng · Fei Zhu · Wenzhuo Liu · Da-Han Wang · Jian Xu · Xu-Yao Zhang · Cheng-Lin Liu

A vast amount of instruction tuning data is crucial for the impressive performance of Large Multimodal Models (LMMs), but the associated computational costs and data collection demands during supervised fine-tuning make it impractical for most researchers. Federated learning (FL) has the potential to leverage all distributed data and training resources to reduce the overhead of joint training. However, most existing methods assume a fixed number of tasks, while in real-world scenarios, clients continuously encounter new knowledge and often struggle to retain old tasks due to memory constraints. In this work, we introduce the Federated Continual Instruction Tuning (FCIT) benchmark to model this real-world challenge. Our benchmark includes two realistic scenarios, encompassing four different settings and twelve carefully curated instruction tuning datasets. To address the challenges posed by FCIT, we propose dynamic knowledge organization to effectively integrate updates from different tasks during training and subspace selective activation to allocate task-specific output during inference. Extensive experimental results demonstrate that our proposed method significantly enhances model performance across varying levels of data heterogeneity and catastrophic forgetting. Our source code and dataset will be made publicly available.

This work introduces Multimodal Context (MiCo), a scalable pretraining framework designed to advance omni-modal intelligence—an AI system capable of understanding and learning from multiple modalities to achieve universal representation learning. MiCo allows for efficient scaling of both the number of modalities and the volume of data, along with model parameters, during the pretraining phase. We evaluate the pretrained models across a diverse set of tasks, including: (i) single-modality perception benchmarks covering 10 distinct modalities, (ii) 25 cross-modal tasks spanning retrieval, question-answering, and captioning, and (iii) 18 large-scale multimodal language model benchmarks. MiCo consistently delivers state-of-the-art results, setting 37 new benchmarks across these tasks. The pretrained models, along with the collected datasets and codebase, will be made publicly available to support the development of omni-modal intelligence and broader research in multimodal learning.

Multi-label learning is a challenging computer vision task that requires assigning multiple categories to each image. However, fully annotating large-scale datasets is often impractical due to high costs and effort, motivating the study of learning from partially annotated data. In the extreme case of Single Positive Multi-Label Learning (SPML), each image is provided with only one positive label, while all other labels remain unannotated. Traditional SPML methods that treat missing labels as unknown or negative tend to yield inaccuracies and false negatives, and integrating various pseudo-labeling strategies can introduce additional noise. To address these challenges, we propose the Generalized Pseudo-Label Robust Loss (GPR Loss), a novel loss function that effectively learns from diverse pseudo-labels while mitigating noise. Complementing this, we introduce a simple yet effective Dynamic Augmented Multi-focus Pseudo-labeling (DAMP) technique. Together, these contributions form the Adaptive and Efficient Vision-Language Pseudo-Labeling (AEVLP) framework. Extensive experiments on four benchmark datasets demonstrate that our framework significantly advances multi-label classification, achieving state-of-the-art results.


#119
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Aritra Bhowmik · Mohammad Mahdi Derakhshani · Dennis Koelma · Yuki Asano · Martin R. Oswald · Cees Snoek

Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Multimodal Large Language Models (MLLMs) struggle at this task. In this paper, we introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propose TWIST, a twin-expert stepwise tuning module that modifies the decoder of the language model using one frozen module pre-trained on image understanding tasks and another learnable one for visual grounding tasks. This allows the MLLM to retain previously learned knowledge and skills, while acquiring what is missing. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT, which mimics human reasoning in visual grounding. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process, thereby simplifying the task of visual grounding. We evaluate our approach on several standard benchmark datasets, encompassing grounded image captioning, zero-shot localization, and visual grounding tasks. Our method consistently delivers strong performance across all tasks, while retaining the pre-trained image understanding capabilities.


#120
Generate, Transduct, Adapt: Iterative Transduction with VLMs

Oindrila Saha · Logan Lawrence · Grant Horn · Subhransu Maji

Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 9.5% and 4.0% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning.


#121
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning

Shengao Wang · Arjun Chandra · Aoming Liu · Boqing Gong · Venkatesh Saligrama

Human infants rapidly develop visual reasoning skills from minimal input, suggesting that developmentally inspired pretraining could significantly enhance the efficiency of vision-language models (VLMs). Although recent efforts have leveraged infant-inspired datasets like SAYCam, existing evaluation benchmarks remain misaligned—they are either too simplistic, narrowly scoped, or tailored for large-scale pretrained models. Additionally, training exclusively on infant data overlooks the broader, diverse input from which infants naturally learn. To address these limitations, we propose BabyVLM, a novel framework comprising comprehensive in-domain evaluation benchmarks and a synthetic training dataset created via child-directed transformations of existing datasets. We demonstrate that VLMs trained with our synthetic dataset achieve superior performance on BabyVLM tasks compared to models trained solely on SAYCam or existing general-purpose datasets. BabyVLM thus provides a robust, developmentally aligned evaluation tool and illustrates how compact models trained on carefully curated data can generalize effectively, opening pathways toward data-efficient vision-language learning paradigms.


#122
Controlling Multimodal LLMs via Reward-guided Decoding

Oscar Mañas · Pierluca D'Oro · Koustuv Sinha · Adriana Romero-Soriano · Michal Drozdzal · Aishwarya Agrawal

As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of MLLMs through controlled decoding. To achieve this, we introduce the first method for reward-guided decoding of MLLMs and demonstrate its application in improving their visual grounding. Our method involves building reward models for visual grounding and using them to guide the MLLM's decoding process. Concretely, we build two separate reward models to independently control the degree of object precision and recall in the model's output. Our approach enables on-the-fly controllability of an MLLM's inference process in two ways: first, by giving control over the relative importance of each reward function during decoding, allowing a user to dynamically trade off between object precision and recall in image captioning tasks; second, by giving control over the breadth of the search during decoding, allowing the user to control the trade off between the amount of test-time compute and the degree of visual grounding. We evaluate our method on standard object hallucination benchmarks, showing that it provides significant controllability over MLLM inference, while matching or surpassing the performance of existing hallucination mitigation methods.


#123
Improving Large Vision and Language Models by Learning from a Panel of Peers

Jefferson Hernandez · Jing Shi · Simon Jenni · Vicente Ordonez · Kushal Kafle

Traditional alignment methods for Large Vision and Language Models (LVLMs) primarily rely on human-curated preference data. Human-generated preference data is costly; machine-generated preference data is limited in quality; and self-supervised preference data often introduces hallucinations. To overcome these limitations, we propose a novel Panel-of-Peers learning framework inspired by collaborative learning among humans. This approach leverages a panel of LVLMs, each evaluating and learning from their collective outputs through an iterative self-improvement process. By simulating a peer review system, our models generate, assess, and refine outputs in response to a curated set of prompts, mimicking a classroom learning environment. We demonstrate that this methodology enhances model performance without requiring extensive human-labeled datasets. Our experiments show significant improvement across multiple benchmarks, demonstrating the potential of peer evaluations as a scalable alternative to self-supervised alignment. Notably, we show that Panel-of-Peers increases the average score on fifteen benchmarks from 48% to 57%.


#124
CE-FAM: Concept-Based Explanation via Fusion of Activation Maps

Michihiro Kuroki · Toshihiko Yamasaki

Although saliency maps can highlight important regions to explain the reasoning behind image classification in artificial intelligence (AI), the meaning of these regions is left to the user's interpretation. In contrast, concept-based explanations decompose AI predictions into human-understandable concepts, clarifying their contributions. However, few methods can simultaneously reveal what concepts an image classifier learns, which regions are associated with them, and how they contribute to predictions.We propose a novel concept-based explanation method, Concept-based Explanation via Fusion of Activation Maps (CE-FAM). It employs a branched network that shares activation maps with an image classifier and learns to mimic the embeddings of a Vision and Language Model (VLM). The branch network predicts concepts in an image, and their corresponding regions are represented by a weighted sum of activation maps, with weights given by the gradients of the concept prediction scores. Their contributions are quantified based on their impact on the image classification score. Our method provides a general framework for identifying the concept regions and their contributions while leveraging VLM knowledge to handle arbitrary concepts without requiring an annotated dataset. Furthermore, we introduce a novel evaluation metric to assess the accuracy of the concept regions. Our qualitative and quantitative evaluations demonstrate our method outperforms existing approaches and excels in zero-shot inference for unseen concepts.


#125
Leveraging Spatial Invariance to Boost Adversarial Transferability

Zihan Zhou · LI LI · Yanli Ren · Chuan Qin · Guorui Feng

Adversarial examples, crafted with imperceptible perturbations, reveal a significant vulnerability of Deep Neural Networks (DNNs). More critically, the transferability of adversarial examples allows attackers to induce unreasonable predictions without requiring knowledge about the target model. DNNs exhibit spatial invariance, meaning that the position of an object does not affect the classification result. However, existing input transformation-based adversarial attacks solely focus on behavioral patterns at a singular position, failing to fully exploit the spatial invariance exhibited by DNNs across multiple positions, thus constraining the transferability of adversarial examples. To address this, we propose a multi-scale, multi-position input transformation-based attack called Spatial Invariance Diversity (SID). Specifically, SID uses hybrid spatial-spectral fusion mechanisms within localized receptive fields, followed by multi-scale spatial downsampling and positional perturbations via random transformations, thereby crafting an ensemble of inputs to activate diverse behavioral patterns for effective adversarial perturbations. Extensive experiments on the ImageNet dataset demonstrate that SID could achieve better transferability than the current state-of-the-art input transformation-based attacks. Additionally, SID can be flexibly integrated with other input transformation-based or gradient-based attacks, further enhancing the transferability of adversarial examples.


#126
Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing

Yongxin Guo · Lin Wang · Xiaoying Tang · Tao Lin

Federated Learning (FL) is a privacy-preserving distributed machine learning paradigm. Nonetheless, the substantial distribution shifts among clients pose a considerable challenge to the performance of current FL algorithms. To mitigate this challenge, various methods have been proposed to enhance the FL training process.This paper endeavors to tackle the issue of data heterogeneity from another perspective---by improving FL algorithms prior to the actual training stage. Specifically, we introduce the Client2Vec mechanism, which generates a unique client index that contains clients' distribution shifts information for each client before the commencement of FL training. Subsequently, we leverage the generated client index to enhance the subsequent FL training process. To demonstrate the effectiveness of the proposed Client2Vec method, we conduct three case studies that assess the impact of the client index on the FL training process. These case studies encompass enhanced client sampling, model aggregation, and local training. Extensive experiments conducted on diverse datasets and model architectures show the efficacy of Client2Vec across all three case studies. Our code will be publicly available.

Long-Tailed Class-Incremental Learning (LT-CIL) faces critical challenges due to biased gradient updates arising from imbalanced data distributions and the inherent stability-plasticity trade-off, which collectively degrade tail-class performance and induce catastrophic forgetting. To address these limitations, we introduce Geometric Prototype Alignment (GPA), a model-agnostic classifier initialization method that calibrates learning dynamics through geometric feature space alignment. GPA initializes classifier weights by aligning them with frozen class prototypes onto a unit hypersphere, explicitly disentangling magnitude imbalance from directional discriminability. During incremental training, we introduce Dynamic Anchoring to adjust weights while preserving geometric consistency, thereby balancing plasticity for new classes while keeping stability for previously learned knowledge. When integrated into state-of-the-art CIL frameworks such as LUCIR and DualPrompt, GPA demonstrates significant improvements: achieving an average incremental accuracy increase of 12.3% and decreasing forgetting rates by 12.2% on CIFAR100-LT. Theoretical analysis reveals that GPA accelerates convergence by 2.7x and achieves nearly Fisher-optimal decision boundaries. Our work lays a geometric foundation for stable representation learning in LT-CIL scenarios, which addresses both catastrophic forgetting and tail-class degradation.


#128
PEFTDiff: Diffusion-Guided Transferability Estimation for Parameter-Efficient Fine-Tuning

PRAFFUL KHOBA · Zijian Wang · Chetan Arora · Mahsa Baktashmotlagh

Selecting an optimal Parameter-Efficient Fine-Tuning (PEFT) technique for a downstream task is a fundamental challenge in transfer learning. Unlike full fine-tuning, where all model parameters are updated, PEFT techniques modify only a small subset of parameters while keeping the backbone frozen, making them computationally efficient. However, this introduces a unique problem: selecting the most effective PEFT method for a given dataset. Existing transferability estimation (TE) metrics primarily focus on ranking distinct architectures and struggle to detect subtle embedding differences introduced by various PEFT methods sharing the same backbone. To address this limitation, we propose a novel diffusion-based metric explicitly designed for PEFT selection. Unlike conventional metrics, our approach models the fine-grained geometric relationships of embedding spaces through a diffusion process, effectively quantifying intra-class compactness and inter-class separability. Extensive evaluations on the VTAB-1k benchmark validate our method’s effectiveness, demonstrating a substantial 68.95\% improvement over LogME, 1297.29\% over $\mathcal{N}$LEEP, 149.75\% over NCTI, and 140.46\% over SFDA—four widely used TE methods designed for ranking pre-trained models.


#129
One Last Attention for Your Vision-Language Model

Liang Chen · Ghazi Shazan Ahmad · Tianjun Yao · Lingqiao Liu · Zhiqiang Shen

Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective $\textbf{R}$ational $\textbf{Ada}$ptaion (RAda) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings ($i.e.$ updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings.


#130
Forgetting Through Transforming: Enabling Federated Unlearning via Class-Aware Representation Transformation

Qi Guo · Zhen Tian · Minghao Yao · Saiyu Qi · Yong Qi · Bingyi Liu

Federated Unlearning (FU) should satisfy three key requirements: a guarantee of data erasure, preservation of model utility, and reduction of unlearning time. Recent studies focus on identifying and modifying original model parameters relevant to unlearning data. While they can achieve faster unlearning, they degrade the model performance on remaining data or fail to forget unlearning data due to the difficulty in isolating specific parameters of the unlearning data. By revisiting the representation distribution of the optimal unlearning models (i.e., the retrained models), we observe that unlearning data tends to cluster within semantically related categories of remaining data. This inspired us to transform the distribution of unlearning data to fuse with similar categories in the remaining data for effective FU. Based on this insight, we propose a novel framework, named FUCRT, to achieve Federated Unlearning via Class-aware Representation Transformation. FUCRT consists of two key components: (1) a transformation class identification strategy (TCI) that leverages the original model to identify appropriate transformation classes for unlearning data, and (2) a targeted transformation learning process (TTL) with cross-class fusion mechanism to ensure effective and consistent transformation of unlearning data. Extensive experiments on four datasets demonstrate that FUCRT not only achieves 100\% of data erasure but also outperforms state-of-the-art methods by an average of 2.96\% and 3.78\% in utility preservation under IID and Non-IID settings, respectively. Moreover, it reduces unlearning time by 19.13\%$\sim$ 96.38\%.


#131
MMAT-1M: A Large Reasoning Dataset for Multimodal Agent Tuning

Tianhong Gao · Yannian Fu · Weiqun Wu · Haixiao Yue · Shanshan Liu · Gang Zhang

Large Language Models (LLMs), enhanced through agent tuning, have demonstrated remarkable capabilities in Chain-of-Thought (CoT) and tool utilization, significantly surpassing the performance of standalone models. However, the multimodal domain still lacks a large-scale, high-quality agent tuning dataset to unlock the full potential of multimodal large language models. To bridge this gap, we introduce MMAT-1M, the first million-scale multimodal agent tuning dataset designed to support CoT, reflection, and dynamic tool usage. Our dataset is constructed through a novel four-stage data engine: 1) We first curate publicly available multimodal datasets containing question-answer pairs; 2) Then, leveraging GPT-4o, we generate rationales for the original question-answer pairs and dynamically integrate API calls and Retrieval Augmented Generation (RAG) information through a multi-turn paradigm; 3) Furthermore, we refine the rationales through reflection to ensure logical consistency and accuracy, creating a multi-turn dialogue dataset with both Rationale and Reflection (RR); 4) Finally, to enhance efficiency, we optionally compress multi-turn dialogues into a One-turn Rationale and Reflection format (ORR). By fine-tuning open-source multimodal models on the MMAT-1M, we observe significant performance gains. For instance, the InternVL2.5-8B-RR model achieves an average improvement of 2.7% across eight public benchmarks and 8.8% on the RAG benchmark Dyn-VQA, demonstrating the dataset's effectiveness in enhancing multimodal reasoning and tool-based capabilities. The code and dataset will be publicly available to support reproducibility and further research.


#132
Highlight
Geometry Distributions

Biao Zhang · Jing Ren · Peter Wonka

Neural representations of 3D data have been widely adopted across various applications, particularly in recent work leveraging coordinate-based networks to model scalar or vector fields. However, these approaches face inherent challenges, such as handling thin structures and non-watertight geometries, which limit their flexibility and accuracy. In contrast, we propose a novel geometric data representation that models geometry as distributions-a powerful representation that makes no assumptions about surface genus, connectivity, or boundary conditions. Our approach uses diffusion models with a novel network architecture to learn surface point distributions, capturing fine-grained geometric details. We evaluate our representation qualitatively and quantitatively across various object types, demonstrating its effectiveness in achieving high geometric fidelity. Additionally, we explore applications using our representation, such as textured mesh representation, neural surface compression, dynamic object modeling, and rendering, highlighting its potential to advance 3D geometric learning.


#133
FastJSMA: Accelerating Jacobian-based Saliency Map Attacks through Gradient Decoupling

Zhenghao Gao · Shengjie Xu · Zijing Li · Meixi Chen · Chaojian Yu · Yuanjie Shao · Changxin Gao

Adversarial attack plays a critical role in evaluating the robustness of deep learning models. Jacobian-based Saliency Map Attack (JSMA) is an interpretable adversarial method that offers excellent pixel-level control and provides valuable insights into model vulnerabilities. However, its quadratic computational complexity $O(M^2 \times N)$ renders it impractical for large-scale datasets, limiting its application despite its inherent value. This paper proposes FastJSMA, an efficient attack method that addresses these computational limitations. Our approach introduces a gradient decoupling mechanism that decomposes the Jacobian calculation into complementary class suppression ($g^-$) and class excitation ($g^+$) gradients, reducing complexity to $O(M\sqrt{N})$. Additionally, we implement a class probing mechanism and an adaptive saliency threshold to further optimize the process. Experimental results across multiple datasets demonstrate that FastJSMA maintains high attack success rates (98.4\% relative efficiency) while dramatically reducing computation time—requiring only 1.8\% of JSMA's processing time on CIFAR-100 and successfully operating on ImageNet where traditional JSMA fails due to memory constraints. This advancement enables the practical application of interpretable saliency map-based attacks on large-scale datasets, balancing effectiveness with computational efficiency.

The problem of learning from long-tailed noisy data, referred to as Long-Tailed Noisy Label Learning (LTNLL), presents significant challenges in deep learning. LTNLL datasets are typically affected by two primary issues: class imbalance and label noise. While previous methods have addressed these problems separately, the simultaneous presence of both in real-world applications remains underexplored. In this paper, we introduce a simple yet effective method, **I**nstances **B**enefitting **C**lasses (**IBC**). Our philosophy is to simultaneously overcome overfitting to noisy classes and transfer knowledge between semantically related classes. At the instance level, we propose selecting top-$k$ semantically similar classes and use them to construct soft labels. Specifically, we soften noisy hard labels by reducing the probability of noisy classes and reallocating this probability to the semantically similar classes. **This reduces the model's overconfidence in noisy classes while enhancing its focus on tail classes.** We next propose a novel shot-specific multi-expert ensemble learning framework to make knowledge transfer more targeted, where we maintain multiple shot-specific soft labels for each instance, with each expert supervised by one of these labels. By integrating these experts, we demonstrate that IBC exhibits more separable representations, improving both overall and partition performance. Extensive experiments show that IBC outperforms existing state-of-the-art (SOTA) methods on a variety of benchmark and real-world datasets, achieving improvements ranging from **1.89\%** to **4.99\%** on the CIFAR-10 and CIFAR-100 datasets across all settings. **The source code is provided in the supplementary material.**


#135
Pi-GPS: Enhancing Geometry Problem Solving by Unleashing the Power of Diagrammatic Information

Junbo Zhao · Ting Zhang · Jiayu Sun · Mi Tian · Hua Huang

Geometry problem solving has garnered increasing attention due to its potential applications in intelligent education field. Inspired by the observation that text often introduces ambiguities that diagrams can clarify, this paper presents Pi-GPS, a novel framework that unleashes the power of diagrammatic information to resolve textual ambiguities, an aspect largely overlooked in prior research. Specifically, we design a micro module comprising a rectifier and verifier: the rectifier employs MLLMs to disambiguate text based on the diagrammatic context, while the verifier ensures the rectified output adherence to geometric rules, mitigating model hallucinations. Additionally, we explore the impact of LLMs in theorem predictor based on the disambiguated formal language. Empirical results demonstrate that Pi-GPS surpasses state-of-the-art models, achieving a nearly 10\% improvement on Geometry3K over prior neural-symbolic approaches. We hope this work highlights the significance of resolving textual ambiguity in multimodal mathematical reasoning, a crucial factor limiting performance.


#136
VAGUE: Visual Contexts Clarify Ambiguous Expressions

Heejeong Nam · Jinwoo Ahn · Keummin Ka · Jiwan Chung · Youngjae Yu

Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them.


#137
Diffusion-based Source-biased Model for Single Domain Generalized Object Detection

Han Jiang · Wenfei Yang · Tianzhu Zhang · Yongdong Zhang

Single domain generalized object detection aims to train an object detector on a single source domain and generalize it to any unseen domain. Although existing approaches based on data augmentation exhibit promising results, they overlook domain discrepancies across multiple augmented domains, which limits the performance of object detectors. To tackle these problems, we propose a novel diffusion-based framework, termed SDG-DiffDet, to mitigate the impact of domain gaps on object detectors. The proposed SDG-DiffDet consists of a memory-guided diffusion module and a source-guided denoising module. Specifically, in the memory-guided diffusion module, we design feature statistics memories that mine diverse style information from local parts to augment source features. The augmented features further serve as noise in the diffusion process, enabling the model to capture distribution differences between practical domain distributions. In the source-guided denoising module, we design a text-guided condition to facilitate distribution transfer from any unseen distribution to source distribution in the denoising process. By combining these two designs, our proposed SDG-DiffDet effectively models feature augmentation and target-to-source distribution transfer within a unified diffusion framework, thereby enhancing the generalization ability of object detector. Extensive experiments demonstrate that the proposed SDG-DiffDet achieves state-of-the-art performance across two challenge scenarios.


#138
PRO-VPT: Distribution-Adaptive Visual Prompt Tuning via Prompt Relocation

Chikai Shang · Mengke Li · Yiqun Zhang · Zhen Chen · Jinlin Wu · Fangqing Gu · Yang Lu · Yiu-ming Cheung

Visual prompt tuning (VPT) provides an efficient and effective solution for adapting pre-trained models to various downstream tasks by incorporating learnable prompts. However, most prior art indiscriminately applies a fixed prompt distribution across different tasks, neglecting the importance of each block differing depending on the task. In this paper, we investigate adaptive distribution optimization (ADO) by addressing two key questions: (1) How to appropriately and formally define ADO, and (2) How to design an adaptive distribution strategy guided by this definition? Through in-depth analysis, we provide an affirmative answer that properly adjusting the distribution significantly improves VPT performance, and further uncover a key insight that a nested relationship exists between ADO and VPT. Based on these findings, we propose a new VPT framework, termed PRO-VPT (iterative Prompt RelOcation-based VPT), which adaptively adjusts the distribution building upon a nested optimization formulation. Specifically, we develop a prompt relocation strategy for ADO derived from this formulation, comprising two optimization steps: identifying and pruning idle prompts, followed by determining the optimal blocks for their relocation. By iteratively performing prompt relocation and VPT, our proposal adaptively learns the optimal prompt distribution, thereby unlocking the full potential of VPT. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VPT methods, e.g., PRO-VPT surpasses VPT by 1.6% average accuracy, leading prompt-based methods to state-of-the-art performance on the VTAB-1k benchmark. The code is available at https://anonymous.4open.science/r/PRO-VPT.


#139
BATCLIP: Bimodal Online Test-Time Adaptation for CLIP

Sarthak Kumar Maharana · Baoming Zhang · Leonid Karlinsky · Rogerio Feris · Yunhui Guo

Although open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) have demonstrated strong zero-shot learning capabilities, their robustness to common image corruptions remains poorly understood. Through extensive experiments, we show that zero-shot CLIP lacks robustness to common image corruptions during test-time, necessitating the adaptation of CLIP to unlabeled corrupted images using test-time adaptation (TTA). However, we found that existing TTA methods have severe limitations in adapting CLIP due to their unimodal nature. To address these limitations, we propose $\texttt{BATCLIP}$, a bimodal $\textbf{online}$ TTA method designed to improve CLIP's robustness to common image corruptions. The key insight of our approach is not only to adapt the visual encoders for improving image features but also to strengthen the alignment between image and text features by promoting a stronger association between the image class prototype, computed using pseudo-labels, and the corresponding text feature. We evaluate our approach on benchmark image corruption datasets and achieve state-of-the-art results in online TTA for CLIP. Furthermore, we evaluate our proposed TTA approach on various domain generalization datasets to demonstrate its generalization capabilities.


#140
Mind the Cost of Scaffold! Benign Clients May Even Become Accomplices of Backdoor Attack

Xingshuo Han · Xuanye Zhang · Xiang Lan · Haozhao Wang · Shengmin Xu · Shen Ren · Jason Zeng · Ming Wu · Michael Heinrich · Tianwei Zhang

By using a control variate to calibrate the local gradient of each client, Scaffold has been widely known as a powerful solution to mitigate the impact of data heterogeneity in Federated Learning. Although Scaffold achieves significant performance improvements, we show that this superiority is at the cost of increased security vulnerabilities. Specifically, this paper presents BadSFL, the first backdoor attack targeting Scaffold, which turns benign clients into accomplices to amplify the attack effect. The core idea of BadSFL is to uniquely tamper with the control variate to subtly steer benign clients' local gradient updates towards the attacker's poisoned direction, effectively turning them into unwitting accomplices, significantly enhancing the backdoor persistence. Additionally, BadSFL leverages a GAN-enhanced poisoning strategy to enrich the attacker’s dataset, maintaining high accuracy on both benign and backdoored samples while remaining stealthy. Extensive experiments demonstrate that BadSFL achieves superior attack durability, maintaining effectiveness for over 60 global rounds—lasting up to three times longer than existing baselines even after ceasing malicious model injections.


#141
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury · Sayan Nag · Subhrajyoti Dasgupta · Yaoting Wang · Mohamed Elhoseiny · Ruohan Gao · Dinesh Manocha

With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 16 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.


#142
Verbalized Representation Learning for Interpretable Few-Shot Generalization

Cheng-Fu Yang · Da Yin · Wenbo Hu · Heng Ji · Nanyun Peng · Bolei Zhou · Kai-Wei Chang

Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller model. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks.


#143
UIPro: Unleashing Superior Interaction Capability For GUI Agents

Hongxin Li · Jingran Su · Jingfan CHEN · Zheng Ju · Yuntao Chen · Li Qing · Zhaoxiang Zhang

Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence. Central to these agents is the capability for GUI interaction, which involves GUI understanding and planning capabilities. Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs). However, the limited scenario, insufficient size, and heterogeneous action spaces hinder the progress of building generalist GUI agents. To resolve these issues, this paper proposes UIPro, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data, coupled with a unified action space. We first curate a comprehensive dataset encompassing 20.6 million GUI understanding tasks to pre-train UIPro, granting it a strong GUI grounding capability which is key to downstream GUI agent tasks. Subsequently, we establish a unified action space to harmonize heterogeneous GUI agent task datasets and produce a merged dataset to foster the action prediction ability of UIPro via continued fine-tuning. Experimental results demonstrate UIPro's superior performance across multiple GUI task benchmarks on various platforms, highlighting the effectiveness of our approach. We will release the data curation programs and cleaned dataset.


#144
Loss Functions for Predictor-based Neural Architecture Search

Han Ji · Yuqi Feng · Jiahao Fan · Yanan Sun

Evaluation is a critical but costly procedure in neural architecture search (NAS). Performance predictors have been widely adopted to reduce evaluation costs by directly estimating architecture performance. The effectiveness of predictors is heavily influenced by the choice of loss functions. While traditional predictors employ regression loss functions to evaluate the absolute accuracy of architectures, recent approaches have explored various ranking-based loss functions, such as pairwise and listwise ranking losses, to focus on the ranking of architecture performance. Despite their success in NAS, the effectiveness and characteristics of these loss functions have not been thoroughly investigated. In this paper, we conduct the first comprehensive study on loss functions in performance predictors, categorizing them into three main types: regression, ranking, and weighted loss functions. Specifically, we assess eight loss functions using a range of NAS-relevant metrics on 13 tasks across five search spaces. Our results reveal that specific categories of loss functions can be effectively combined to enhance predictor-based NAS. Furthermore, our findings could provide practical guidance for selecting appropriate loss functions for various tasks. We hope this work provides meaningful insights to guide the development of loss functions for predictor-based methods in the NAS community.

We introduce \textbf{DiMPLe} (\textbf{Di}sentangled \textbf{M}ulti-Modal \textbf{P}rompt \textbf{Le}arning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe \textbf{disentangles} features \textbf{within and across modalities} while maintaining consistent alignment, enabling better generalization to \textbf{novel classes} and robustness to \textbf{distribution shifts}.Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy. The code will be released publicly upon acceptance.


#146
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Jonathan Roberts · Kai Han · Samuel Albanie

Large multimodal models (LMMs) have exhibited proficiencies across many visual tasks. Although numerous well-known benchmarks exist to evaluate model performance, they increasingly have insufficient headroom. As such, there is a pressing need for a new generation of benchmarks challenging enough for the next generation of LMMs. One area that LMMs show potential is graph analysis, specifically, the tasks an analyst might typically perform when interpreting figures such as estimating the mean, intercepts or correlations of functions and data series. In this work, we introduce GRAB, a graph analysis benchmark, fit for current and future frontier LMMs. Our benchmark is entirely synthetic, ensuring high-quality, noise-free questions. GRAB is comprised of 2170 questions, covering four tasks and 23 graph properties. We evaluate 20 LMMs on GRAB, finding it to be a challenging benchmark, with the highest performing model attaining a score of just 21.7%. Finally, we conduct various ablations to investigate where the models succeed and struggle. We release GRAB to encourage progress in this important, growing domain.

Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called MosaicDiff that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model's inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.

Vision-language pre-training (VLP) models leverage large-scale cross-modal pre-training to align vision and text modalities, achieving impressive performance on tasks like image-text retrieval and visual grounding. However, these models are highly vulnerable to adversarial attacks, raising critical concerns about their robustness and reliability in safety-critical applications. Existing black-box attack methods are limited by insufficient data augmentation mechanisms or the disruption of global semantic structures, leading to poor adversarial transferability. To address these challenges, we propose the Global-Local Enhanced Adversarial Multimodal attack (GLEAM), a unified framework for generating transferable adversarial examples in vision-language tasks. GLEAM introduces a local feature enhancement module that achieves diverse local deformations while maintaining global semantic and geometric integrity. It also incorporates a global distribution expansion module, which expands feature space coverage through dynamic transformations. Additionally, a cross-modal feature alignment module leverages intermediate adversarial states to guide text perturbations. This enhances cross-modal consistency and adversarial text transferability. Extensive experiments on Flickr30K and MSCOCO datasets show that GLEAM outperforms state-of-the-art methods, with over 10\%-30\% higher attack success rates in image-text retrieval tasks and over 30\% improved transferability on large models like Claude 3.5 Sonnet and GPT-4o. GLEAM provides a robust tool for exposing vulnerabilities in VLP models and offers valuable insights into designing more secure and reliable vision-language systems.


#149
Adversarial Training for Probabilistic Robustness

YI ZHANG · Yuhang Chen · Zhen Chen · Wenjie Ruan · Xiaowei Huang · Siddartha Khastgir · Xingyu Zhao

Deep learning (DL) has shown transformative potential across industries, yet its sensitivity to adversarial examples (AEs) limits its reliability and broader deployment. Research on DL robustness has developed various techniques, with adversarial training (AT) established as a leading approach to counter AEs. Traditional AT focuses on worst-case robustness (WCR), but recent work has introduced probabilistic robustness (PR), which evaluates the proportion of AEs within a local perturbation range, providing an overall assessment of the model's local robustness and acknowledging residual risks that are more practical to manage. However, existing AT methods are fundamentally designed to improve WCR, and no dedicated methods currently target PR. To bridge this gap, we reformulate a new min-max optimization as the theoretical foundation for AT focused on PR, and introduce an AT-PR training scheme with effective numerical algorithms to solve the new optimization problem. Our experiments, based on 38 DL models trained on common datasets and architectures, demonstrate that AT-PR achieves higher improvements in PR than AT-WCR methods and shows more consistent effectiveness across varying local inputs, with a smaller trade-off in model generalization. Open-source tools and all experiments are publicly accessible.

Visual Language Models (VLMs) have achieved remarkable success in many domains due to their ability to perform step-by-step reasoning. However, progress in the telecommunication (Telecom) domain remains limited, primarily due to the lack of high-quality datasets and domain-specific insights. In this paper, we introduce RMultiplex200K, a multimodal dataset designed to present step-wise reasoning rationales and correctness scores for real-world TC questions. This enables VLMs to engage in step-level reasoning and verification using multimodal information, thereby facilitating reliable problem-solving. RMultiplex200K is highly scalable as it is constructed without human annotations, relying instead on our automatic plan-based annotation (ApPA) method, which automatically synthesizes reasoning steps labeled with reward scores. With this dataset, we introduce TC-NAVIGATOR, a new mechanism for training multimodal process reward models to serve as reliable reasoning verifiers for VLMs. For instance, the Qwen-2-VL-72B and Llama-3.2-90B models, which initially achieve only 21.3\% and 19.8\% respectively on practice Telecom questions, reached 48.5\% and 46.1\% accuracy, respectively, after training with RMultiplex200K and verifying with TC-NAVIGATOR.

Low-quality or scarce data has posed significant challenges for training deep neural networks in practice. While classical data augmentation cannot contribute very different new data, diffusion models opens up a new door to build self-evolving AI by generating high-quality and diverse synthetic data through text-guided prompts. However, text-only guidance cannot control synthetic images' proximity to the original images, resulting in out-of-distribution data detrimental to the model performance. To overcome the limitation, we study image guidance to achieve a spectrum of interpolations between synthetic and real images. With stronger image guidance, the generated images are similar to the training data but hard to learn. While with weaker image guidance, the synthetic images will be easier for model but contribute to a larger distribution gap with the original data. The generated full spectrum of data enables us to build a novel "Diffusion CurricuLum (DisCL)". DisCL adjusts the image guidance level of image synthesis for each training stage: It identifies and focuses on hard samples for the model and assesses the most effective guidance level of synthetic images to improve hard data learning. We apply DisCL to two challenging tasks: long-tail (LT) classification and learning from low-quality data. It focuses on lower-guidance images of high-quality to learn prototypical features as a warm-up of learning higher-guidance images that might be weak on diversity or quality. Extensive experiments showcase a gain of 2.7% and 2.1% in OOD and ID macro-accuracy when applying DisCL to iWildCam dataset. On ImageNet-LT, DisCL improves the base model's tail-class accuracy from 4.4% to 23.64% and leads to a 4.02% improvement in all-class accuracy.


#152
Backdoor Defense via Enhanced Splitting and Trap Isolation

Hongrui Yu · Lu Qi · Wanyu Lin · Jian Chen · Hailong Sun · chengbin sun

Backdoor attacks pose a significant threat to deep neural networks (DNNs), as attackers can inject a backdoor by tampering with only a few samples. The variety of backdoor attacks makes comprehensive defense extremely challenging. Previous defenses typically assume that backdoor samples are out-of-distribution (OOD) data of benign samples. However, backdoor samples can also be in-distribution (ID) data of benign samples and hard to identify as outliers, potentially causing defenses to fail. To address this issue, we propose a two-stage backdoor defense based on Enhanced Splitting and Trap Isolation (ESTI), leveraging attackers' tampering to defend against their attacks. In the first stage, we introduce backdoored models in conjunction with a benign model to split the dataset into a reliable clean subset and a poisoned subset. In the second stage, we introduce a trap mechanism to isolate the poisoned subset into a trap class to train a trap-model. The trap-model can flip the predictions of poisoned samples from the attacker's target class to the trap class. Through extensive experiments on three benchmark datasets and five model architectures, we demonstrate that ESTI effectively defends against various backdoor attacks while maintaining model performance on benign data, proving the superiority of our approach. Our code is available in the supplementary material.


#153
GT-Loc: Unifying When and Where in Images through a Joint Embedding Space

David G. Shatwell · Ishan Rajendrakumar Dave · Swetha Sirnam · Mubarak Shah

Timestamp prediction aims to determine when an image was captured using only visual information, supporting applications such as metadata correction, retrieval, and digital forensics. In outdoor scenarios, hourly estimates rely on cues like brightness, hue, and shadow positioning, while seasonal changes and weather inform date estimation. However, these visual cues significantly depend on geographic context, closely linking timestamp prediction to geo-localization. To address this interdependence, we introduce GT-Loc, a novel retrieval-based method that jointly predicts the capture time (hour and month) and geo-location (GPS coordinates) of an image. Our approach employs separate encoders for images, time, and location, aligning their embeddings within a shared high-dimensional feature space. Recognizing the cyclical nature of time, we utilize Random Fourier Features for effective temporal representation. Instead of conventional contrastive learning with hard positives and negatives, we propose a metric-learning objective providing soft targets by modeling temporal differences over a cyclical toroidal surface. We present new benchmarks demonstrating that our joint optimization surpasses methods focused solely on time prediction and even those utilizing geo-location during inference. Additionally, our approach achieves competitive results on standard geo-localization tasks, while the unified embedding space facilitates compositional and text-based image retrieval.

We present TrajectoryCrafter, a novel approach to redirect camera trajectories for monocular videos. By disentangling deterministic view transformations from stochastic content generation, our method achieves precise control over user-specified camera trajectories. We propose a novel dual-stream conditional video diffusion model that concurrently integrates point cloud renders and source videos as conditions, ensuring accurate view transformations and coherent 4D content generation. Instead of leveraging scarce multi-view videos, we curate a hybrid training dataset combining web-scale monocular videos with static multi-view datasets, by our innovative double-reprojection strategy, significantly fostering robust generalization across diverse scenes. Extensive evaluations on multi-view and large-scale monocular videos demonstrate the superior performance of our method. Code and pre-trained model will be released.


#155
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Minghe Gao · Xuqi Liu · Zhongqi Yue · Yang Wu · Shuang Chen · Juncheng Li · Siliang Tang · Fei Wu · Tat-Seng Chua · Yueting Zhuang

Recent advancements in reward signal usage for Large Language Models (LLMs) are remarkable. However, significant challenges exist when transitioning reward signal to the multimodal domain, including labor-intensive annotations, over-reliance on one-step rewards, and inadequate evaluation. To address these issues, we propose SVIP, a novel approach to train a step-level multi-dimensional Chain-of-Thought (CoT) reward model automatically. It generates code for solving visual tasks and transforms the analysis of code blocks into the evaluation of CoT step as training samples. Then, we train SVIP-Reward model using a multi-head attention mechanism called TriAtt-CoT. The advantages of SVIP-Reward are evident throughout the entire process of MLLM. We also introduce a benchmark for CoT reward model training and testing. Experimental results demonstrate that SVIP-Reward improves MLLM performance across training and inference-time scaling, yielding better results on benchmarks while reducing hallucinations and enhancing reasoning ability.


#156
AIRA: Activation-Informed Low-Rank Adaptation for Large Models

Lujun Li · Dezhi Li · Cheng Lin · Wei Li · Wei Xue · Sirui Han · Yike Guo

Low-Rank Adaptation (LoRA) is a widely used method for efficiently fine-tuning large models by introducing low-rank matrices into weight updates. However, existing LoRA techniques fail to account for activation information, such as outliers, which significantly impact model performance. This omission leads to suboptimal adaptation and slower convergence. To address this limitation, we present Activation-Informed Low-Rank Adaptation (AIRA), a novel approach that integrates activation information into initialization, training, and rank assignment to enhance model performance. Specifically, AIRA introduces: (1) Outlier-weighted SVD decomposition to reduce approximation errors in low-rank weight initialization, (2) Outlier-driven dynamic rank assignment using offline optimization for better layer-wise adaptation, and (3) Activation-informed training to amplify updates on significant weights. This cascaded activation-informed paradigm enables faster convergence and fewer fine-tuned parameters while maintaining high performance. Extensive experiments on multiple large models demonstrate that AIRA outperforms state-of-the-art LoRA variants, achieving superior performance-efficiency trade-offs in vision-language instruction tuning, few-shot learning, and image generation. Codes are available in Appendix.


#157
Social Debiasing for Fair Multi-modal LLMs

Harry Cheng · Yangyang Guo · Qingpei Guo · Ming Yang · Tian Gan · Weili Guan · Liqiang Nie

Multi-modal Large Language Models (MLLMs) have dramatically advanced the reseach field recently and delivered powerful vision-language understanding capabilities. However, these models often inherit deep-rooted social biases from their training data, leading to uncomfortable responses with respect to attributes such as race and gender.This paper addresses the issue of social biases in MLLMs by i) introducing a comprehensive Counterfactual dataset with multiple social concepts (CMSC), which complements existing datasets by providing 18 diverse and balanced social concepts; and ii) proposing a Counter-Stereotype Debiasing (CSD) strategy that mitigates social biases in MLLMs by leveraging the opposites of prevalent stereotypes. CSD incorporates both a novel bias-aware data sampling method and a loss rescaling method, thereby enabling the model to more effectively reduce biases. We conduct extensive experiments with four prevalent MLLM architectures. The results demonstrate the advantage of the CMSC dataset and the edge of CSD strategy in reducing social biases compared to existing competing methods, without compromising the overall performance on general multi-modal reasoning benchmarks.


#158
Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection

Shizhen Zhao · Jiahui Liu · Xin Wen · Haoru Tan · Xiaojuan Qi

Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, we systematically investigate representative vision foundation models for OOD detection. Our findings reveal that a pre-trained DINOv2 model, even without fine-tuning on in-domain (ID) data, naturally provides a highly discriminative feature space for OOD detection, achieving performance comparable to existing state-of-the-art methods without requiring complex designs. Beyond this, we explore how fine-tuning foundation models on in-domain (ID) data can enhance OOD detection. However, we observe that the performance of vision foundation models remains unsatisfactory in scenarios with a large semantic space. This is due to the increased complexity of decision boundaries as the number of categories grows, which complicates the optimization process. To mitigate this, we propose the Mixture of Feature Experts (MoFE) module, which partitions features into subspaces, effectively capturing complex data distributions and refining decision boundaries. Further, we introduce a Dynamic-$\beta$ Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments demonstrate the effectiveness of our approach, significantly outperforming baseline methods. The code will be made publicly available.


#159
LIRA: Reasoning Reconstruction via Multimodal Large Language Models

Zhen Zhou · Tong Wang · Yunkai Ma · Xiao Tan · Fengshui Jing

Existing language instruction-guided online 3D reconstruction systems mainly rely on explicit instructions or queryable maps, showing inadequate capability to handle implicit and complex instructions. In this paper, we first introduce a reasoning reconstruction task. This task inputs an implicit instruction involving complex reasoning and an RGB-D sequence, and outputs incremental 3D reconstruction of instances that conform to the instruction. To handle this task, we propose LIRA: Language Instructed Reconstruction Assistant. It leverages a multimodal large language model to actively reason about the implicit instruction and obtain instruction-relevant 2D candidate instances and their attributes. Then, candidate instances are back-projected into the incrementally reconstructed 3D geometric map, followed by instance fusion and target instance inference. In LIRA, to achieve higher instance fusion quality, we propose TIFF, a Text-enhanced Instance Fusion module operating within Fragment bounding volume, which is learning-based and fuses multiple keyframes simultaneously. Since the evaluation system for this task is not well established, we propose a benchmark ReasonRecon comprising the largest collection of scene-instruction data samples involving implicit reasoning. Experiments demonstrate that LIRA outperforms existing methods in the reasoning reconstruction task and is capable of running in real time. Code and benchmark will be publicly available.

Federated learning (FL) enables collaborative model training across distributed clients without centralizing data. However, existing approaches like Federated Averaging ($\texttt{FedAvg}$) often perform poorly with heterogeneous data distributions, failing to achieve personalization due to their inability to capture class-specific information effectively.To overcome $\texttt{FedAvg}$'s personalization limitations, we propose Class-wise Federated Averaging ($\texttt{cwFedAvg}$), a novel personalized FL (PFL) framework that performs Federated Averaging for each class.$\texttt{cwFedAvg}$ creates class-specific global models via weighted aggregation of local models using class distributions, then combines them to generate personalized local models.To facilitate effective class-wise aggregation, we further propose Weight Distribution Regularizer ($\texttt{WDR}$), which encourages deep networks to encode class-specific information efficiently by aligning empirical and approximated class distributions derived from output layer weights.Our experiments demonstrate $\texttt{cwFedAvg}$'s superior performance over existing PFL methods through efficient personalization while maintaining $\texttt{FedAvg}$'s communication cost and avoiding additional local training and pairwise computations.

Modern retrieval systems often struggle with upgrading to new and more powerful models due to the incompatibility of embeddings between the old and new models. This necessitates a costly process known as backfilling, which involves re-computing the embeddings for a large number of data samples. In vision, Backward-compatible Training (BT) has been proposed to ensure that the new model aligns with the old model's embeddings. This paper extends the concept of vision-only BT to the field of cross-modal retrieval, marking the first attempt to address Cross-modal BT (XBT). Our goal is to achieve backward-compatibility between Vision-Language Pretraining (VLP) models, such as CLIP, for the cross-modal retrieval task. To address XBT challenges, we propose an efficient solution: a projection module that maps the new model's embeddings to those of the old model. This module, pretrained solely with text data, significantly reduces the number of image-text pairs required for XBT learning, and, once it is pretrained, it avoids using the old model during training. Furthermore, we utilize parameter-efficient training strategies that improve efficiency and preserve the off-the-shelf new model's knowledge by avoiding any modifications. Experimental results on cross-modal retrieval datasets demonstrate the effectiveness of XBT and its potential to enable backfill-free upgrades when a new VLP model emerges.


#162
Highlight
Region-based Cluster Discrimination for Visual Representation Learning

Yin Xie · Kaicheng Yang · Xiang An · Kun Wu · Yongle Zhao · Weimo Deng · Zimin Ran · Yumeng Wang · Ziyong Feng · Roy Miles · Ismail Elezi · Jiankang Deng

The vision towers of Multimodal Language Models (MLLM) have significantly enhanced the performance of large multimodal models. This success is primarily attributed to extensive language alignment training, which enhances human-like understanding. However, these models predominantly rely on global category representations, limiting their performance in tasks that require localized representations, such as grounding, OCR, and segmentation. To address this limitation, we propose a novel Locality-Aware Cluster Contrastive Learning strategy. Our approach leverages local feature clustering and contrastive learning to improve the model's ability to understand and represent localized information. Furthermore, our method can be easily scaled to billion-level training, ensuring its applicability to large-scale datasets and models. We demonstrate the effectiveness of our method by achieving state-of-the-art results on the Visual Question Answering (VQA) and RefCOCO benchmarks, showcasing its superior capabilities in handling tasks that require fine-grained visual understanding. Our results indicate a significant improvement in performance, validating the potential of our approach in advancing MLLM tasks. It outperforms the widely used SigLIP.

Traditional Remote Sensing Foundation models (RSFMs) are pre-trained with a data-centralized paradigm, through self-supervision on large-scale curated remote sensing data. For each institution, however, pre-training RSFMs with limited data in a standalone manner may lead to suboptimal performance, while aggregating remote sensing data from multiple institutions for centralized pre-training raises privacy concerns. Seeking for collaboration is a promising solution to resolve this dilemma, where multiple institutions can collaboratively train RSFMs without sharing private data. In this paper, we propose a novel privacy-preserved pre-training framework (FedSense), which enables multiple institutions to collaboratively train RSFMs without sharing private data. However, it is a non-trivial task hindered by a vicious cycle, which results from model drift by remote sensing data heterogeneity and high communication overhead. To break this vicious cycle, we introduce Federated Mutual-guidance Learning. Specifically, we propose a Server-to-Clients Guidance (SCG) mechanism to guide clients updates towards global-flatness optimal solutions. Additionally, we propose a Clients-to-Server Guidance (CSG) mechanism to inject local knowledge into the server by low-bit communication. Extensive experiments on four downstream tasks demonstrate the effectiveness of our FedSense in both full-precision and communication-reduced scenarios, showcasing remarkable communication efficiency and performance gains.


#164
EFTViT: Efficient Federated Training of Vision Transformers with Masked Images on Resource-Constrained Clients

meihan wu · Tao Chang · Cui Miao · Jie Zhou · Chun Li · Xiangyu Xu · Ming Li · Xiaodong Wang

Federated learning research has recently shifted from Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs) due to their superior capacity. ViTs training demands higher computational resources due to the lack of 2D inductive biases inherent in CNNs. However, efficient federated training of ViTs on resource-constrained clients remains unexplored in the community. In this paper, we propose EFTViT, a hierarchical federated framework that leverages masked images to enable efficient, full-parameter training on resource-constrained clients, offering substantial benefits for learning on heterogeneous data. In general, we patchify images and randomly mask a portion of the patches, observing that excluding them from training has minimal impact on performance while substantially reducing computation costs and enhancing data content privacy protection. Specifically, EFTViT comprises a series of lightweight local modules and a larger global module, updated independently on clients and the central server, respectively. The local modules are trained on unmasked image patches, while the global module is trained on intermediate patch features uploaded from the local client, balanced through a proposed median sampling strategy to erase client data distribution privacy. We analyze the computational complexity and privacy protection of EFTViT. We analyze the computational complexity and privacy protection of EFTViT . Extensive experiments on popular benchmarks show that EFTViT reduces local training computational cost by up to $5.6\times$, cuts local training time by up to $3.1\times$, and achieves up to 2.46\% accuracy improvement compared to existing methods.

Zero-shot human-object interaction (HOI) detection remains a challenging task, particularly in generalizing to unseen actions. Existing methods address this challenge by tapping Vision-Language Models (VLMs) to access knowledge beyond the training data. However, they either struggle to distinguish actions involving the same object or demonstrate limited generalization to unseen classes. In this paper, we introduce LoRD-HOI (Low-Rank Decomposed VLM Feature Adaptation for Zero-Shot HOI Detection), a novel approach that both enhances generalization to unseen classes and improves action distinction. In training, LoRD-HOI decomposes VLM text features for given HOI classes via low-rank factorization, producing class-shared basis features and adaptable weights. These features and weights form a compact HOI representation that preserves shared information across classes, enhancing generalization to unseen classes. Subsequently, we refine action distinction by adapting weights for each HOI class and introducing human-object tokens to enrich visual interaction representations. To further distinguish unseen actions, we guide the weight adaptation with LLM-derived action regularization. Experimental results show that our method sets a new state-of-the-art across zero-shot HOI settings on HICO-DET, achieving an unseen-class mAP of 27.91 in the unseen-verb setting.


#166
Diagnosing Pretrained Models for Out-of-distribution Detection

Haipeng Xiong · Kai Xu · Angela Yao

This work questions a common assumption of OOD detection, that models with higher in-distribution (ID) accuracy tend to have better OOD performance. Recent findings show this assumption doesn’t always hold. A direct observation is that the later version of torchvision models improves ID accuracy but suffers from a significant drop in OOD performance. We systematically diagnose torchvision training recipes andexplain this effect by analyzing the maximal logits of ID and OOD samples. We then propose post-hoc and training-time solutions to mitigate the OOD decrease by fixing problematic augmentations in torchvision recipes. Both solutions enhance OOD detection and maintain strong ID performance. Code will be released upon acceptance.


#167
ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction

Han Yu · Kehan Li · Dongbai Li · Yue He · Xingxuan Zhang · Peng Cui

Recently, there has been gradually more attention paid to Out-of-Distribution (OOD) performance prediction, whose goal is to predict the performance of trained models on unlabeled OOD test datasets, so that we could better leverage and deploy off-the-shelf trained models in risk-sensitive scenarios. Although progress has been made in this area, evaluation protocols of previous literature are not consistent, and most works cover only a limited number of real-world OOD datasets and types of distribution shifts. To provide convenient and fair comparisons for various algorithms, we propose Out-of-Distribution Performance Prediction Benchmark (ODP-Bench), a comprehensive benchmark that includes most commonly used OOD datasets and existing practical performance prediction algorithms. We will provide our trained models as a testbench for future researchers, thus guaranteeing the consistency of comparison and avoiding the burden of repeating the model training process. Furthermore, we also conduct in-depth experimental analyses to better understand their capability boundary.


#168
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang · Jiaxing Huang · Huanjin Yao · Shunyu Liu · Xikun ZHANG · Shijian Lu · Dacheng Tao

Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are.In this work, we aim to enhance the MLLMs’ reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed step-wise reward mechanisms, StepGRPO effectively mitigates the sparse reward issue for MLLMs and encourages more structured and logically consistent reasoning process. Extensive experiments over 8 benchmarks demonstrate the superiority of the proposed StepGRPO.


#169
Supervised Exploratory Learning for Long-Tailed Visual Recognition

Zhongquan Jian · Yanhao Chen · Wangyancheng Wangyancheng · Junfeng Yao · Meihong Wang · Qingqiang Wu

Long-tailed data poses a significant challenge for deep learning models, which tend to prioritize accurate classification of head classes while largely neglecting tail classes. Existing techniques, such as class re-balancing, logit adjustment, and data augmentation, aim to enlarge decision regions of tail classes or achieve clear decision boundaries, leaving the robustness of decision regions under-considered. This paper proposes a simple yet effective Supervised Exploratory Learning (SEL) framework to achieve these goals simultaneously from space exploration perspectives. SEL employs the adaptive Optimal Foraging Algorithm (OFA) to generate diverse exploratory examples, integrating Class-biased Complement (CbC) for balanced class distribution and Fitness-weighted Sampling (FwS) for space exploration. Both theoretical analysis and empirical results demonstrate that SEL enhances class balance, sharpens decision boundaries, and strengthens decision regions. SEL is a plug-and-play training framework that can be seamlessly integrated into model training or classifier adjustment stages, making it highly adaptable and compatible with existing methods and facilitating further performance improvements. Extensive experiments on various long-tailed benchmarks demonstrate SEL's superiority.


#170
Synergistic Prompting for Robust Visual Recognition with Missing Modalities

Zhihui Zhang · Luanyuan Dai · Qika Lin · Yunfeng Diao · Guangyin Jin · Yufei Guo · Jing Zhang · Xiaoshuai Hao

Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks by leveraging extensive paired multi-modal training data. However, in real-world applications, the presence of missing or incomplete modality inputs often leads to significant performance degradation. Recent research has focused on prompt-based strategies to tackle this issue; however, existing methods are hindered by two major limitations: (1) static prompts lack the flexibility to adapt to varying missing-data conditions, and (2) basic prompt-tuning methods struggle to ensure reliable performance when critical modalities are missing. To address these challenges, we propose a novel Synergistic Prompting (SyP) framework for robust visual recognition with missing modalities. The proposed SyP introduces two key innovations: (I) a Dynamic Adapter, which computes adaptive scaling factors to dynamically generate prompts, replacing static parameters for flexible multi-modal adaptation, and (II) a Synergistic Prompting Strategy, which combines static and dynamic prompts to balance information across modalities, ensuring robust reasoning even when key modalities are missing. The proposed SyP achieves significant performance improvements over existing approaches across three widely-used visual recognition datasets, demonstrating robustness under diverse missing rates and conditions. Extensive experiments and ablation studies validate its effectiveness in handling missing modalities, highlighting its superior adaptability and reliability. The source code will be released.


#171
Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Hao Zheng · Shunzhi Yang · Zhuoxin He · Jinfeng Yang · Zhenhua Huang

Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL's superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Our code will be made publicly available.

CLIP is a foundational model with transferable classification performance in the few-shot setting. Several methods have shown improved performance of CLIP using few-shot examples. However, so far all these techniques have been benchmarked using standard few-shot datasets. We argue that this mode of evaluation does not provide a true indication of the inductive generalization ability using few-shot examples. As most datasets have been seen by the CLIP model, the resultant setting can be termed as partially transductive. To solve this, we propose a pipeline that uses an unlearning technique to obtain true inductive baselines. In this new inductive setting, methods show a significant drop in performance (-55% on average among 13 baselines with multiple datasets). We validate the unlearning technique using oracle baselines. An improved few-shot classification technique is proposed that consistently obtains state-of-the-art performance over 13 other recent baseline methods on a comprehensive analysis with 5880 experiments - varying the datasets, differing number of few-shot examples, unlearning setting, and with different seeds. Thus, we identify the issue with the evaluation of CLIP-based few-shot classification, provide a solution using unlearning, propose new benchmarks, and provide an improved method. All the models, code and baselines will be released on acceptance of the work.

Detectors often suffer from performance drop due to domain gap between training and testing data. Recent methods explore diffusion models applied to domain generalization (DG) and adaptation (DA) tasks, but still struggle with large inference costs and have not yet fully leveraged the capabilities of diffusion models. We propose to tackle these problems by extracting intermediate features from a single-step diffusion process, improving feature collection and fusion to reduce inference time by 75% while enhancing performance on source domains (i.e., Fitness). Then, we construct an object-centered auxiliary branch by applying box-masked images with class prompts to extract robust and domain-invariant features that focus on object. We also apply consistency loss to align the auxiliary and ordinary branch, balancing fitness and generalization while preventing overfitting and improving performance on target domains (i.e., Generalization). Furthermore, within a unified framework, standard detectors are guided by diffusion detectors through feature-level and object-level alignment on source domains (for DG) and unlabeled target domains (for DA), thereby improving cross-domain detection performance (i.e., Transferability). Our method achieves competitive results on 3 DA benchmarks and 5 DG benchmarks. Additionally, experiments on COCO generalization benchmark demonstrate that our method maintains significant advantages and show remarkable efficiency in large domain shifts and low-data scenarios. Our work shows the superiority of applying diffusion models to domain generalized and adaptive detection tasks and offers valuable insights for visual perception tasks across diverse domains.


#174
Entropy-Adaptive Diffusion Policy Optimization with Dynamic Step Alignment

Renye Yan · Jikang Cheng · Yaozhong Gan · Shikun Sun · You Wu · Yunfan Yang · Ling Liang · JinLong Lin · Yeshuang Zhu · Jie Zhou · Jinchao Zhang · Junliang Xing · Yimao Cai · Ru Huang

While fine-tuning diffusion models with reinforcement learning (RL) has demonstrated effectiveness in directly optimizing downstream objectives, existing RL frameworks are prone to overfitting the rewards, leading to outputs that deviate from the true data distribution and exhibit reduced diversity. To address this issue, we introduce entropy as a quantitative measure to enhance the exploratory capacity of diffusion models' denoising policies. We propose an adaptive mechanism that dynamically adjusts the application and magnitude of entropy and regularization, guided by real-time quality estimation of intermediate noised states. Theoretically, we prove the convergence of our entropy-enhanced policy optimization and establish two critical properties: 1) global entropy increases through training, ensuring robust exploration capabilities, and 2) entropy systematically decreases during the denoising process, enabling a phase transition from early-stage diversity promotion to late-stage distributional fidelity. Building on this foundation, we propose a plug-and-play RL module that adaptively controls entropy and optimizes denoising steps. Extensive evaluations demonstrate our method's theoretical soundness and empirical robustness, achieving state-of-the-art quality-diversity trade-offs across benchmarks. Notably, our framework significantly improves the rewards and reduces denoising steps in training by up to 40\%. The code is available in the supplementary.

Vision-Language Models (VLMs) like CLIP have shown remarkable zero-shot performance by aligning different modalities in the embedding space, enabling diverse applications from image editing to visual question answering (VQA). However, these models often inherit biases from their training data, resulting in performance disparities across specific subpopulations. Traditional debiasing methods for VLMs primarily focus on specific downstream tasks using labeled datasets, which we argue is insufficient given the broad applicability of VLMs. Specifically, these methods struggle with generalizability, transferability, and feasibility due to overfitting, limited task applicability, and regulatory constraints on the use of sensitive data, making them less practical in real-world scenarios. To address these challenges, we propose a novel task-agnostic method for learning debiased image embeddings in VLMs. Our approach does not require expensive annotated datasets or curated prompts for downstream tasks, while still preserving the inherent zero-shot capabilities of these models. Instead, we leverage easily accessible information: 1) a bias text corpus generated by a large language model, and 2) a generic unsupervised vision dataset. Our method disentangles the image embedding into bias and neutral components by applying centered kernel alignment (CKA) regularization to the text-vision representational similarity, using the bias text corpus over the generic vision dataset. Experimental results validate the effectiveness of our approach across multiple tasks, offering a practical and versatile solution to debiasing VLMs.


#176
Joint Asymmetric Loss for Learning with Noisy Labels

Jialiang Wang · Xianming Liu · Xiong Zhou · Gangfeng Hu · Deming Zhai · Junjun Jiang · Xiangyang Ji

Learning with noisy labels is an important and challenging task for training accurate deep neural networks.To mitigate label noise, prior studies have proposed various robust loss functions, particularly symmetric losses. Nevertheless, symmetric losses usually suffer from the underfitting issue due to the overly strict symmetric condition. To address this problem, the Active Passive Loss (APL) jointly optimizes an active and a passive loss to mutually enhance the overall fitting ability.Within APL, symmetric losses have been successfully extended, yielding advanced robust loss functions.Despite these advancements, emerging theoretical analyses indicate that asymmetric loss functions, a new class of robust loss functions, possess superior properties compared to symmetric losses. However, existing asymmetric losses are not compatible with advanced optimization frameworks such as APL, limiting their practical potential and applicability. Motivated by this theoretical gap and the promising properties of asymmetric losses, we extend the asymmetric loss function to the more complex passive loss scenario and propose the Asymetric Mean Square Error (AMSE), a novel asymmetric loss function. We rigorously establish the necessary and sufficient condition under which AMSE satisfies the asymmetric condition.By substituting the traditional symmetric passive loss in APL with our proposed AMSE, we introduce a novel robust loss framework termed Joint Asymmetric Loss (JAL).Extensive experiments demonstrate the effectiveness of our method in mitigating label noise.


#177
FG-OrIU: Towards Better Forgetting via Feature-Gradient Orthogonality for Incremental Unlearning

qian feng · Jiahang Tu · Mintong Kang · Hanbin Zhao · Chao Zhang · Hui Qian

Incremental unlearning (IU) is critical for pre-trained models to comply with sequential data deletion requests, yet existing methods primarily suppress parameters or confuse knowledge without explicit constraints on both feature and gradient level, resulting in \textit{superficial forgetting} where residual information remains recoverable. This incomplete forgetting risks security breaches and disrupts retention balance, especially in IU scenarios. We propose FG-OrIU (\textbf{F}eature-\textbf{G}radient \textbf{Or}thogonality for \textbf{I}ncremental \textbf{U}nlearning), the first framework unifying orthogonal constraints on both features and gradients level to achieve deep forgetting, where the forgetting effect is irreversible. FG-OrIU decomposes feature spaces via Singular Value Decomposition (SVD), separating forgetting and remaining class features into distinct subspaces. It then enforces dual constraints: feature orthogonal projection on both forgetting and remaining classes, while gradient orthogonal projection prevents the reintroduction of forgotten knowledge and disruption to remaining classes during updates. Additionally, dynamic subspace adaptation merges newly forgetting subspaces and contracts remaining subspaces, ensuring a stable balance between removal and retention across sequential unlearning tasks. Extensive experiments demonstrate the effectiveness of our method.


#178
Language-Driven Multi-Label Zero-Shot Learning with Semantic Granularity

Shouwen Wang · Qian Wan · Junbin Gao · Zhigang Zeng

Recent methods learn class-unified prompt contexts by image data to adapt CLIP to zero-shot multi-label image classification, which achieves impressive performance. However, simply tuning prompts is insufficient to deal with novel classes across different semantic granularity levels. This limitation arises due to the sparse semantic detail in prompt class names and the hierarchical granularity competition among class names caused by CLIP’s contrastive loss. We propose a language-driven zero-shot multi-label learning framework to bridge associations among classes across multiple granularity levels through class name reconstruction. To achieve this, we first leverage a language model to generate structured text descriptions for each class, which explicitly capture (1) visual attributes, (2) hierarchical relationships, and (3) co-occurrence scenes. With the enriched descriptions, we then learn class names by extracting and aligning semantic relationships and features from them in the CLIP’s shared image-text embedding space. Furthermore, we consider that similar text descriptions among different classes may introduce ambiguities. We mitigate these ambiguities by imposing a pair-based loss on learnable class names to enhance their distinctiveness. During inference, we aggregate semantic predictions from multiple image snippets to reinforce the identification of classes across different granularity levels. Comprehensive experiments demonstrate that our method surpasses state-of-the-art methods in multi-label zero-shot learning and effectively handles novel classes across different granularity levels.


#179
ViT-Split: Unleashing the Power of Vision Foundation Models via Efficient Splitting Heads

Yifan Li · Xin Li · Tianqin Li · Wenbin He · Yu Kong · Liu Ren

Vision foundation models (VFMs) have demonstrated remarkable performance across a wide range of downstream tasks. While several VFM adapters have shown promising results by leveraging the prior knowledge of VFMs, we identify two inefficiencies in these approaches. First, the interaction between convolutional neural network (CNN) and VFM backbone triggers early layer gradient backpropagation. Second, existing methods require tuning all components, adding complexity. Besides, these adapters alter VFM features, underutilizing the prior knowledge. To tackle these challenges, we propose a new approach called ViT-Split, based on a key observation: \textbf{the layers of several VFMs, like DINOv2, can be divided into two distinct components: an extractor for learning low-level features and an adapter for learning task-specific features}. Leveraging this insight, we eliminate the CNN branch and introduce two heads, task head and prior head, to the frozen VFM. The task head is designed to learn task-specific features, mitigating the early gradient propagation issue. The prior head is used to leverage the multi-scale prior features from the frozen VFM, reducing tuning parameters and overfitting. Extensive experiments on various tasks (e.g., segmentation, detection, depth estimation, and visual question answering) validate the effectiveness and efficiency of ViT-Split. Specifically, ViT-Split reduces training time up to $4\times$ while achieving comparable or even better results on ADE20K, compared to other VFM adapters.


#180
Generalized Deep Multi-view Clustering via Causal Learning with Partially Aligned Cross-view Correspondence

Xihong Yang · Siwei Wang · Jiaqi Jin · Fangdi Wang · Tianrui Liu · Yueming Jin · Xinwang Liu · En Zhu · Kunlun He

Multi-view clustering (MVC) aims to explore the common clustering structure across multiple views. Many existing MVC methods heavily rely on the assumption of view consistency, where alignments for corresponding samples across different views are ordered in advance. However, real-world scenarios often present a challenge as only partial data is consistently aligned across different views, restricting the overall clustering performance. In this work, we consider the model performance decreasing phenomenon caused by data order shift (i.e., from fully to partially aligned) as a generalized multi-view clustering problem. To tackle this problem, we design a causal multi-view clustering network, termed CauMVC. We adopt a causal modeling approach to understand multi-view clustering procedure. To be specific, we formulate the partially aligned data as an intervention and multi-view clustering with partially aligned data as an post-intervention inference. However, obtaining invariant features directly can be challenging. Thus, we design a Variational Auto-Encoder for causal learning by incorporating an encoder from existing information to estimate the invariant features. Moreover, a decoder is designed to perform the post-intervention inference. Lastly, we design a contrastive regularizer to capture sample correlations. To the best of our knowledge, this paper is the first work to deal generalized multi-view clustering via causal learning. Empirical experiments on both fully and partially aligned data illustrate the strong generalization and effectiveness of CauMVC.

Ensemble-based attacks have been proven to be effective in enhancing adversarial transferability by aggregating the output of models with various architectures. However, existing research primarily focuses on refining ensemble weights or optimizing the ensemble path, overlooking the exploration of ensemble models to enhance the transferability of adversarial attacks. In this study, we attempt to adversarially augment ensemble models by modifying inner modules to mitigate this gap. Moreover, observing that ensemble Vision Transformers (ViTs) gain less attention, we propose ViT-EnsembleAttack, the first ensemble-based attack method tailored for ViTs to the best of our knowledge. Our approach generates augmented models for each surrogate ViT using three strategies: Multi-head dropping, Attention score scaling and MLP feature mixing, with the associated parameters optimized by Bayesian optimization. These adversarially augmented models are ensembled to generate adversarial examples. Furthermore, we introduce an automatic reweighting module that dynamically adjusts the influence of each surrogate model in the ensemble, while also enlarging the step size in each iteration to enhance convergence. Extensive experiments demonstrate that ViT-EnsembleAttack significantly enhances the adversarial transferability of ensemble-based attacks on ViTs, outperforming existing methods by a substantial margin.


#182
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs

Zitian Wang · Yue Liao · RONG KANG · Fengyun Rao · Yibo Yang · Si Liu

Preference alignment has emerged as an effective strategy to enhance the performance of Multimodal Large Language Models (MLLMs) following supervised fine-tuning. While existing preference alignment methods predominantly target hallucination factors, they overlook the factors essential for multi-modal comprehension capabilities, often narrowing their improvements on hallucination mitigation. To bridge this gap, we propose Instruction-oriented Preference Alignment (IPA), a scalable framework designed to automatically construct alignment preferences grounded in instruction fulfillment efficacy. Our method involves an automated preference construction coupled with a dedicated verification process that identifies instruction-oriented factors, avoiding significant variability in response representations. Additionally, IPA incorporates a progressive preference collection pipeline, further recalling challenging samples through model self-evolution and reference-guided refinement. Experiments conducted on Qwen2VL-7B demonstrate IPA's effectiveness across multiple benchmarks, including hallucination evaluation, visual question answering, and text understanding tasks, highlighting its capability to enhance general comprehension.


#183
Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning

Jiong Yin · Liang Li · Jiehua Zhang · Yuhan Gao · Chenggang Yan · Xichun Sheng

Audio-visual multi-task incremental learning aims to continuously learn from multiple audio-visual tasks without the need for joint training on all tasks. The challenge of the problem is how to preserve the old task knowledge while facilitating the learning of new task with previous experiences. To address these challenges, we introduce a three-stage Progressive Homeostatic and Plastic audio-visual prompt (PHP) method. In the shallow phase, we design the task-shared modality aggregating adapter to foster cross-task and cross-modal audio-visual representation learning to enhance shared understanding between tasks. In the middle phase, we propose the task-specific modality-shared dynamic generating adapter, which constructs prompts that are tailored to individual tasks while remaining general across modalities, which balances the model’s ability to retain knowledge against forgetting with its potential for versatile multi-task transferability. In the deep phase, we introduce the task-specific modality-independent prompts to further refine the understand ability by targeting individual information for each task and modality. By incorporating these three phases, PHP retains task-specific prompts while adapting shared parameters for new tasks to effectively balance knowledge sharing and specificity. Our method achieves SOTA performance in different orders of three tasks~(AVE, AVVP and AVQA). We will release the source codes on GitHub.


#184
Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu · Zeyi Sun · Yuhang Zang · Xiaoyi Dong · Yuhang Cao · Haodong Duan · Dahua Lin · Jiaqi Wang

Reinforcement Fine-Tuning (RFT) in Large Reasoning Models like OpenAI o1 learns from feedback on its answers, which is especially useful in applications when fine-tuning data is scarce.Recent open-source work like DeepSeek-R1 demonstrates that reinforcement learning with verifiable reward is possibly one key direction in reproducing o1.While the R1-style model has demonstrated success in language models, its application in multi-modal domains remains under-explored.This work introduces Visual Reinforcement Fine-Tuning (Visual-RFT), which further extends the application areas of RFT on visual tasks.Specifically, Visual-RFT first uses Large Vision-Language Models (LVLMs) to generate multiple responses containing reasoning tokens and final answers for each input, and then uses our proposed visual perception verifiable reward functions to update the model via the policy optimization algorithm such as Group Relative Policy Optimization (GRPO).We design different verifiable reward functions for different perception tasks, such as the Intersection over Union (IoU) reward for object detection.Experimental results on fine-grained image classification, few-shot object detection, reasoning grounding, as well as open-vocabulary object detection benchmarks show the competitive performance and advanced generalization ability of Visual-RFT compared with Supervised Fine-tuning (SFT).For example, Visual-RFT improves accuracy by 24.3% over the baseline in one-shot fine-grained image classification with around 100 samples.In few-shot object detection, Visual-RFT also exceeds the baseline by 21.0 on COCO's 4-shot setting and 15.4 on LVIS.Our Visual-RFT represents a paradigm shift in fine-tuning LVLMs, offering a data-efficient, reward-driven approach that enhances reasoning and adaptability for domain-specific tasks.


#185
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao · Ranjie Duan · Fengxiang Wang · Chi Chen · Caixin KANG · Shouwei Ruan · Jialing Tao · YueFeng Chen · Hui Xue · Xingxing Wei

Multimodal Large Language Models (MLLMs) have achieved impressive performance and have been put into practical use in commercial applications, but they still have potential safety mechanism vulnerabilities. Jailbreak attacks are red teaming methods that aim to bypass safety mechanisms and discover MLLMs' potential risks. Existing MLLMs' jailbreak methods often bypass the model's safety mechanism through complex optimization methods or carefully designed image and text prompts. Despite achieving some progress, they have a low attack success rate on commercial closed-source MLLMs. Unlike previous research, we empirically find that there exists a Shuffle Inconsistency between MLLMs' comprehension ability and safety ability for the shuffled harmful instruction. That is, from the perspective of comprehension ability, MLLMs can understand the shuffled harmful text-image instructions well. However, they can be easily bypassed by the shuffled harmful instructions from the perspective of safety ability, leading to harmful responses. Based on the exploration, we innovatively propose a text-image jailbreak attack named SI-Attack. Specifically, to fully utilize the Shuffle Inconsistency and overcome the shuffle randomness, we apply a query-based black-box optimization method to select the most harmful shuffled inputs based on the feedback of the toxic judge model. A series of experiments show that SI-Attack can improve the attack's performance on three benchmarks. In particular, SI-Attack can obviously improve the attack success rate for commercial closed-source MLLMs such as GPT-4o or Claude-3.5-Sonnet.


#186
Large Learning Rates Simultaneously Achieve Robustness to Spurious Correlations and Compressibility

Melih Barsbey · Lucas Prieto · Stefanos Zafeiriou · Tolga Birdal

Robustness and resource-efficiency are two highly desirable properties for modern machine learning models. However, achieving them jointly remains a challenge. In this paper, we position high learning rates as a facilitator for simultaneously achieving robustness to spurious correlations and network compressibility. We demonstrate that large learning rates also produce desirable representation properties such as invariant feature utilization, class separation, and activation sparsity. Importantly, our findings indicate that large learning rates compare favorably to other hyperparameters and regularization methods, in consistently satisfying these properties in tandem. In addition to demonstrating the positive effect of large learning rates across diverse spurious correlation datasets, models, and optimizers, we also present strong evidence that the previously documented success of large learning rates in standard classification tasks is likely due to its effect on addressing hidden/rare spurious correlations in the training dataset.


#187
Dual-Rate Dynamic Teacher for Source-Free Domain Adaptive Object Detection

Qi He · Xiao Wu · Jun-Yan He · Shuai Li

Source-Free Domain Adaptive Object Detection (SF-DAOD) transfers knowledge acquired from the labeled source domain to the unlabeled target domain while preserving data privacy by restricting access to source data during adaptation. Existing approaches predominantly leverage the Mean Teacher framework for self-training in the target domain. The Exponential Moving Average (EMA) mechanism in Mean Teacher stabilizes training by averaging the student weights over training steps. However, in domain adaptation, its inherent lag in responding to emerging knowledge can hinder the student's rapid adaptation to target-domain shifts. To address this challenge, we propose the Dual-rate Dynamic Teacher (DDT) with an Asynchronous EMA (AEMA), which implements group-wise parameter updates. Unlike conventional EMA, which synchronously updates all parameters, AEMA dynamically partitions teacher parameters into two functional groups based on the contribution to capture the target domain shift. By applying a distinct smoothing coefficient to these groups, AEMA enables simultaneous fast adaptation and historical knowledge retention. Comprehensive experiments conducted on three widely used traffic benchmarks have demonstrated that the proposed DDT achieves superior performance, outperforming the state-of-the-art methods by a clear margin. The codes are available at https://anonymous.4open.science/r/Dual-Rate-Dynamic-Teacher-for-Source-Free-Domain-Adaptive-Object-Detection-17BF.


#188
Dynamic Multi-Layer Null Space Projection for Vision-Language Continual Learning

Borui Kang · Lei Wang · Zhiping Wu · Tao Feng · Yawen Li · Yang Gao · Wenbin Li

Vision-Language Models (VLM) have emerged as a highly promising approach for Continual Learning (CL) due to their powerful generalized features. While adapter-based VLM can exploit both task-specific and task-agnostic features, current CL methods have largely overlooked the distinct and evolving parameter distributions in visual and language modalities, which are found crucial for effectively mitigating catastrophic forgetting.In this study, we find that the visual modality experiences a broader parameter distribution and greater variance during class increments than the textual modality, leading to higher vulnerability to forgetting.Consequently, we handle the branches of the two modalities asymmetrically. Specifically, we propose a Dynamic Multi-layer Null Space Projection (DMNSP) strategy and apply it only to the visual modality branch, while optimizing the language branch according to the original optimizer. DMNSP can restrict the update of visual parameters within the common subspace of multiple null spaces, further limiting the impact of non-zero residual terms. Simultaneously, combined with a dynamic projection coefficient, we can precisely control the magnitude of gradient projection to the null space, endowing the model with good stability and plasticity.Extensive experiments on TinyImageNet, CIFAR100 and ImageNet-R demonstrate that our method outperforms current approaches in accuracy and knowledge retention, setting a new standard for state-of-the-art performance in class incremental learning.


#189
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu · Peng Jin · ZiangWu ZiangWu · Li Hao · Yibing Song · Lichao Sun · Li Yuan

Large language models have demonstrated substantial advancements in reasoning capabilities. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a large VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements on reasoning-intensive tasks. To accomplish this, we construct the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose a test-time stage-wise retracing search method (SWIRES), which enables effective and efficient test-time scaling. Remarkably, with only 100k training samples and test-time scaling, LLaVA-CoT not only outperforms its base model by 9.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct. The code, dataset, and pre-trained weights will be made publicly available.


#190
DALIP: Distribution Alignment-based Language-Image Pre-Training for Domain-Specific Data

Junjie Wu · Jiangtao Xie · Zhaolin Zhang · Qilong Wang · Qinghua Hu · Peihua Li · Sen Xu

Recently, Contrastive Language-Image Pre-training (CLIP) has shown promising performance in domain-specific data (e.g., biology), and has attracted increasing research attention. Existing works generally focus on collecting extensive domain-specific data and directly tuning the original CLIP models. Intuitively, such a paradigm takes no full consideration of the characteristics lying in domain-specific data (e.g., fine-grained nature of biological data) and so limits model capability, while mostly losing the original ability of CLIP in the general domain. In this paper, we propose a Distribution Alignment-based Language-Image Pre-Training (DALIP) method for biological data. Specifically, DALIP optimizes CLIP models by matching the similarity between feature distribution of image-text pairs instead of the original [cls] token, which can capture rich yet effective information inherent in image-text pairs as powerful representations, and so better cope with fine-grained nature of biological data. Particularly, our DALIP efficiently approximates feature distribution via its first- and second-order statistics, while presenting a Multi-head Brownian Distance Covariance (MBDC) module to acquire second-order statistics of token features efficiently. Furthermore, we collect a new dataset for plant domain (e.g., specific data in biological domain) comprising 10M plant data with 3M general-domain data (namely PlantMix-13M) according to data mixing laws. Extensive experiments show that DALIP clearly outperforms existing CLIP counterparts in biological domain, while well generalizing to remote sensing and medical imaging domains. Besides, our PlantMix-13M dataset further boosts performance of DALIP in plant domain, while preserving model ability in general domain.


#191
Semi-supervised Concept Bottleneck Models

Lijie Hu · Tianhao Huang · Huanyi Xie · Xilin Gong · Chenyang Ren · Zhengyu Hu · Lu Yu · Ping Ma · Di Wang

Concept Bottleneck Models (CBMs) have garnered increasing attention due to their ability to provide concept-based explanations for black-box deep learning models while achieving high final prediction accuracy using human-like concepts. However, the training of current CBMs heavily relies on the accuracy and richness of annotated concepts in the dataset. These concept labels are typically provided by experts, which can be costly and require significant resources and effort. Additionally, concept saliency maps frequently misalign with input saliency maps, causing concept predictions to correspond to irrelevant input features - an issue related to annotation alignment. To address these limitations, we propose a new framework called SSCBM (Semi-supervised Concept Bottleneck Model). Our SSCBM is suitable for practical situations where annotated data is scarce. By leveraging joint training on both labeled and unlabeled data and aligning the unlabeled data at the concept level, we effectively solve these issues. We proposed a strategy to generate pseudo labels and an alignment loss. Experiments demonstrate that our SSCBM is both effective and efficient. With only 10% labeled data, our model's concept and task accuracy on average across four datasets is only 2.44% and 3.93% lower, respectively, compared to the best baseline in the fully supervised learning setting.


#192
FRET: Feature Redundancy Elimination for Test Time Adaptation

Linjing You · Jiabao Lu · Xiayuan Huang · Xiangli Nie

Test-Time Adaptation (TTA) aims to enhance the generalization of deep learning models when faced with test data that exhibits distribution shifts from the training data. In this context, only a pre-trained model and unlabeled test data are available, making it particularly relevant for privacy-sensitive applications. In practice, we observe that feature redundancy in embeddings tends to increase as domain shifts intensify in TTA. However, existing TTA methods often overlook this redundancy, which can hinder the model’s adaptability to new data. To address this issue, we introduce Feature Redundancy Elimination for Test-time Adaptation (FRET), a novel perspective for TTA. A straightforward approach (S-FRET) is to directly minimize the feature redundancy score as an optimization objective to improve adaptation. Despite its simplicity and effectiveness, S-FRET struggles with label shifts, limiting its robustness in real-world scenarios. To mitigate this limitation, we further propose Graph-based FRET (G-FRET), which integrates a Graph Convolutional Network (GCN) with contrastive learning. This design not only reduces feature redundancy but also enhances feature discriminability in both the representation and prediction layers. Extensive experiments across multiple model architectures, tasks, and datasets demonstrate the effectiveness of S-FRET and show that G-FRET achieves state-of-the-art performance. Further analysis reveals that G-FRET enables the model to extract non-redundant and highly discriminative features during inference, thereby facilitating more robust test-time adaptation.


#193
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

Hongcheng Gao · Tianyu Pang · Chao Du · Taihang Hu · Zhijie Deng · Min Lin

With the rapid progress of diffusion models (DMs), significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained DMs to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs to relearn the unlearned concepts. This occurs partly because certain benign concepts (e.g., ''skin'') retained in DMs are related to the unlearned ones (e.g., ''nudity''), facilitating their relearning via finetuning. To address this, we propose meta-unlearning on DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered to self-destruct, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies.


#194
Highlight
A Unified Interpretation of Training-Time Out-of-Distribution Detection

Xu Cheng · Xin Jiang · Zechao Li

This paper explains training-time out-of-distribution (OOD) detection from a novel view, i.e., interactions between different input variables of deep neural networks (DNNs). Specifically, we provide a unified understanding of the effectiveness of current training-time OOD detection methods, i.e., DNNs trained with these methods all encode more complex interactions for inference than those trained without training-time methods, which contributes to their superior OOD detection performance. We further conduct thorough empirical analyses and verify that complex interactions play a primary role in OOD detection, by developing a simple-yet-efficient method to force the DNN to learn interactions of specific complexities and evaluate the change of OOD detection performances. Besides, we also use interactions to investigate why near-OOD samples are more difficult to distinguish from in-distribution (ID) samples than far-OOD samples, mainly because compared to far-OOD samples, the distribution of interactions in near-OOD samples is more similar to that of ID samples. Moreover, we discover that training-time OOD detection methods can effectively decrease such similarities. The code will be released when the paper is accepted.


#195
Coupling the Generator with Teacher for Effective Data-Free Knowledge Distillation

Xu Chen · Yang Li · Yahong Han · Guangquan Xu · Jialie Shen

Data-Free Knowledge Distillation (DFKD) avoids accessing the original training data during knowledge transferring from a large model to a smaller one, possessing significant potential in ensuring the widespread promotion of industry-level applications while safeguarding user privacy and data security. Unfortunately, due to the lack of precise estimation of the original data distribution, existing DFKD methods often rely on manually induced priors to constrain the generator to produce samples that comply with the rules as much as possible. In this paper, we propose a novel method dubbed \textbf{C}ou\textbf{P}ling \textbf{Net}work (\textbf{CPNet}) that constructs a generator to explicitly approximate the inverse transformation of the teacher model. Consequently, the two components can be integrated into an autoencoder specifically tailored for label information, where the generated images are treated as latent variables. Since real labels are typically uniformly distributed and the parameters of the teacher model are fixed, this enables our generator to produce images that closely approximate the true distribution. Besides, we transform real labels into feature-level constraints through the inverse transformation of a network classifier with fixed parameters, thereby converting the classification problem of generated images into an issue of distance measurement between features. We utilize this constraint for adversarial training and enhancing the diversity of produced images. Extensive experiments on three public benchmarks demonstrate that our proposed method achieves superior or competitive performance compared to previous state-of-the-art methods, while also exhibiting faster generation speed.


#196
FedMeNF: Privacy-Preserving Federated Meta-Learning for Neural Fields

Junhyeog Yun · Minui Hong · Gunhee Kim

Neural fields provide a memory-efficient representation of data, which can effectively handle diverse modalities and large-scale data.However, learning to map neural fields often requires large amounts of training data and computations, which can be limited to resource-constrained edge devices.One approach to tackle this limitation is to leverage Federated Meta-Learning (FML), but traditional FML approaches suffer from privacy leakage.To address these issues, we introduce a novel FML approach called FedMeNF.FedMeNF utilizes a new privacy-preserving loss function that regulates privacy leakage in the local meta-optimization. This enables the local meta-learner to optimize quickly and efficiently without retaining the client's private data.Our experiments demonstrate that FedMeNF achieves fast optimization speed and robust reconstruction performance, even with few-shot or non-IID data across diverse data modalities, while preserving client data privacy.


#197
Visual Modality Prompt for Adapting Vision-Language Object Detectors

Heitor Rapela Medeiros · Atif Belal · Srikanth Muralidharan · Eric Granger · Marco Pedersoli

The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work has explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image, making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly modality prompt decoupled residual, facilitating a more robust adaptation. Empirical benchmarking results show our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) datasets, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability.

Domain Generalization aims to develop models that can generalize to novel and unseen data distributions. In this work, we study how model architectures and pre-training objectives impact feature richness and propose a method to effectively leverage them for domain generalization. Specifically, given a pre-trained feature space, we first discover latent domain structures, referred to as pseudo-domains, that capture domain-specific variations in an unsupervised manner. Next, we augment existing classifiers with these complementary pseudo-domain representations making them more amenable to diverse unseen test domains. We analyze how different pre-training feature spaces differ in the domain-specific variances they capture. Our empirical studies reveal that features from diffusion models excel at separating domains in the absence of explicit domain labels and capture nuanced domain-specific information. On 5 datasets, we show that our very simple framework improves generalization to unseen domains by a maximum test accuracy improvement of over 4% compared to the standard baseline Empirical Risk Minimization (ERM). Crucially, our method outperforms most algorithms that access domain labels during training. Code is available at: https://anonymous.4open.science/r/GUIDE-B567/README.md.


#199
Prototype Guided Backdoor Defense via Activation Space Manipulation

Venkat Adithya Amula · Sunayana Samavedam · Saurabh Saini · Avani Gupta · P J Narayanan

Deep learning models are susceptible to {\em backdoor attacks} involving malicious attackers perturbing a small subset of training data with a {\em trigger} to causes misclassifications. Various triggers have been used including semantic triggers that are easily realizable without requiring attacker to manipulate the image. The emergence of generative AI has eased generation of varied poisoned samples. Robustness across types of triggers is crucial to effective defense. We propose Prototype Guided Backdoor Defense (PGBD), a robust post-hoc defense that scales across different trigger types, including previously unsolved semantic triggers. PGBD exploits displacements in the geometric spaces of activations to penalize movements towards the trigger. This is done using a novel sanitization loss of a post-hoc fine-tuning step. The geometric approach scales easily to all types of attacks. PGBD achieves better performance across all settings. We also present the first defense against a new semantic attack on celebrity face images.


#200
Analyzing Finetuning Representation Shift for Multimodal LLMs Steering

Pegah KHAYATAN · Mustafa Shukor · Jayneel Parekh · Arnaud Dapogny · Matthieu Cord

Multimodal LLMs have reached remarkable levels of proficiency in understanding multimodal inputs, driving extensive research to develop increasingly powerful models. However, far less attention has been given to understanding and explaining the underlying mechanisms of these models. Most existing explainability research examines these models only in their final states, overlooking the dynamic representational shifts that occur during training. In this work, we systematically analyze the evolution of hidden-state representations to reveal how fine-tuning alters a model’s internal structure to specialize on new multimodal tasks. Using a concept-based approach, we map hidden states to interpretable visual and textual concepts, enabling us to trace changes in encoded concepts across modalities as training progresses. We also demonstrate the use of shift vectors to capture this concepts changes. These shift vectors allow us to recover fine-tuned concepts by shifting those in the original model. Finally, we explore the practical impact of our findings on model steering, showing that we can adjust multimodal LLMs behaviors without any training, such as modifying answer types, captions style or biasing the model toward specific responses. Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks. The code will be made publicly available.


#201
Efficient Unsupervised Shortcut Learning Detection and Mitigation in Transformers

Lukas Kuhn · sari sadiya · Jörg Schlötterer · Florian Buettner · Christin Seifert · Gemma Roig

Shortcut learning, i.e., a model's reliance on undesired features not directly relevant to the task, is a major challenge that severely limits the applications of machine learning algorithms, particularly when deploying them to assist in making sensitive decisions, such as in medical diagnostics. In this work, we leverage recent advancements in machine learning to create an unsupervised framework that is capable of both detecting and mitigating shortcut learning in transformers. We validate our method on multiple datasets. Results demonstrate that our framework significantly improves both worst-group accuracy (samples misclassified due to shortcuts) and average accuracy, while minimizing human annotation effort. Moreover, we demonstrate that the detected shortcuts are meaningful and informative to human experts, and that our framework is computationally efficient, allowing it to be run on consumer hardware.


#202
Understanding Museum Exhibits using Vision-Language Reasoning

Ada-Astrid Balauca · Sanjana Garai · Stefan Balauca · Rasesh Shetty · Naitik Agrawal · Dhwanil Shah · Yuqian Fu · Xi Wang · Kristina Toutanova · Danda Pani Paudel · Luc Gool

Museums serve as repositories of cultural heritage and historical artifacts from diverse epochs, civilizations, and regions, preserving well-documented collections that encapsulate vast knowledge, which, when systematically structured into large-scale datasets, can train specialized models. Visitors engage with exhibits through curiosity and questions, making expert domain-specific models essential for interactive query resolution and gaining historical insights. Understanding exhibits from images requires analyzing visual features and linking them to historical knowledge to derive meaningful correlations. We facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs for exhibits from all around the world; (b) training large vision-language models (VLMs) on the collected dataset; (c) benchmarking their ability on five visual question answering tasks, specifically designed to reflect real-world inquiries and challenges observed in museum settings.The complete dataset is labeled by museum experts, ensuring the quality and the practical significance of the labels. We train two VLMs from different categories: BLIP with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through extensive experiments, we find that while both model types effectively answer visually grounded questions, large vision-language models excel in queries requiring deeper historical context and reasoning. We further demonstrate the necessity of fine-tuning models on large-scale domain-specific datasets by showing that our fine-tuned models significantly outperform current SOTA VLMs in answering questions related to specific attributes, highlighting their limitations in handling complex, nuanced queries. Our dataset, benchmarks, and source code will be made publicly available.


#203
Looking in the Mirror: A Faithful Counterfactual Explanation Method for Interpreting Deep Image Classification Models

Townim Chowdhury · Vu Phan · Kewen Liao · Nanyu Dong · Minh-Son To · Anton Hengel · Johan Verjans · Zhibin Liao

Counterfactual explanations (CFE) for deep image classifiers aim to reveal how minimal input changes lead to different model decisions, providing critical insights for model interpretation and improvement. However, existing CFE methods often rely on additional image encoders and generative models to create plausible images, neglecting the classifier's own feature space and decision boundaries. As such, they do not explain the intrinsic feature space and decision boundaries learned by the classifier. To address this limitation, we propose Mirror-CFE, a novel method that generates faithful counterfactual explanations by operating directly in the classifier's feature space, treating decision boundaries as mirrors that ``reflect'' feature representations in the mirror. Mirror-CFE learns a mapping function from feature space to image space while preserving distance relationships, enabling smooth transitions between source images and their counterfactuals. Through extensive experiments on four image datasets, we demonstrate that Mirror-CFE achieves superior performance in validity while maintaining input resemblance compared to state-of-the-art explanation methods. Finally, mirror-CFE provides interpretable visualization of the classifier's decision process by generating step-wise transitions that reveal how features evolve as classification confidence changes.


#204
Improving Multimodal Learning via Imbalanced Learning

Shicai Wei · Chunbo Luo · Yang Luo

Multimodal learning often encounters the under-optimized problem and may perform worse than unimodal learning. Existing approaches attribute this issue to imbalanced learning across modalities and tend to address it through gradient balancing. However, this paper argues that balanced learning is not the optimal setting for multimodal learning. With bias-variance analysis, we prove that imbalanced dependency on each modality obeying the inverse ratio of their variances contributes to optimal performance. To this end, we propose the Asymmetric Representation Learning(ARL) strategy to assist multimodal learning via imbalanced optimization. ARL introduces auxiliary regularizers for each modality encoder to calculate their prediction variance. ARL then calculates coefficients via the unimodal variance to re-weight the optimization of each modality, forcing the modality dependence ratio to be inversely proportional to the modality variance ratio. Moreover, to minimize the generalization error, ARL further introduces the prediction bias of each modality and jointly optimizes them with multimodal loss. Notably, all auxiliary regularizers share parameters with the multimodal model and rely only on the modality representation. Thus the proposed ARL strategy introduces no extra parameters and is independent of the structures and fusion methods of the multimodal model. Finally, extensive experiments on various datasets validate the effectiveness and versatility of ARL.


#205
NAPPure: Adversarial Purification for Robust Image Classification under Non-Additive Perturbations

Junjie Nan · Jianing Li · Wei Chen · Mingkun Zhang · Xueqi Cheng

Adversarial purification has achieved great success in combating adversarial image perturbations, which are usually assumed to be additive. However, non-additive adversarial perturbations such as blur, occlusion, and distortion are also common in the real world. Under such perturbations, existing adversarial purification methods are much less effective since they are designed to fit the additive nature. In this paper, we propose an extended adversarial purification framework named NAPPure, which can further handle non-additive perturbations. Specifically, we first establish the generation process of an adversarial image, and then disentangle the underlying clean image and perturbation parameters through likelihood maximization. Experiments on GTSRB and CIFAR-10 datasets show that NAPPure significantly boosts the robustness of image classification models against non-additive perturbations.


#206
VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs

Qiucheng Wu · Handong Zhao · Michael Saxon · Trung Bui · William Yang Wang · Yang Zhang · Shiyu Chang

Multimodal large language models are an exciting emerging class of language models (LMs) that have merged classic LM capabilities with those of image processing systems. However, how these capabilities integrate is often not intuitive and warrants direct investigation. One understudied capability in MLLMs is visual spatial planning---the ability to comprehend the spatial arrangements of objects and devise action plans to achieve desired outcomes in visual scenes. It is unclear why MLLMs fall short on these tasks generally considered easy for humans, given their successes across other diverse scenarios. To this end, we introduce VSP, a benchmark that 1) evaluates the spatial planning capability in MLLMs in general, and 2) diagnoses this capability via finer-grained sub-tasks, including perception and reasoning, and measure the capabilities of models through these sub-tasks. Our evaluation confirms that both open-source and private MLLMs fail to generate effective plans for even simple spatial planning tasks. Evaluations on the fine-grained analytical tasks further reveal fundamental deficiencies in the models’ visual perception and bottlenecks in reasoning abilities, explaining their worse performance in the general spatial planning tasks. Our work illuminates future directions for improving MLLMs' abilities in spatial planning.


#207
Multi-Cache Enhanced Prototype Learning for Test-Time Generalization of Vision-Language Models

Xinyu Chen · Haotian Zhai · Can Zhang · XIUPENG SHI · Ruirui Li

In zero-shot setting, test-time adaptation (TTA) adjusts pre-trained models using unlabeled data from the test phase to enhance performance on unknown test distributions. Existing cache-enhanced TTA methods rely on a low-entropy criterion to select samples for prototype construction, assuming intra-class compactness. However, low-entropy samples may be unreliable under distribution shifts, and the resulting prototypes may not ensure compact intra-class distributions. This study identifies a positive correlation between cache-enhanced performance and intra-class compactness. Based on this observation, we propose a Multi-Cache enhanced Prototype-based Test-Time Adaptation (MCP) featuring three caches: an entropy cache for initializing prototype representations with low-entropy samples, a align cache for integrating visual and textual information to achieve compact intra-class distributions, and a negative cache for prediction calibration using high-entropy samples. We further developed MCP++, a framework incorporating cross-modal prototype alignment and residual learning, introducing prototype residual fine-tuning. Comparative and ablation experiments across 15 downstream tasks demonstrate that the proposed method and framework achieve state-of-the-art generalization performance.

The advancement of Multimodal Large Language Models (MLLMs) has driven significant progress in Visual Question Answering (VQA), evolving from Single-Image VQA to Multi-Image VQA (MVQA). However, the increased number of images in MVQA inevitably introduces substantial visual redundancy that is irrelevant to question answering (QA), negatively impacting both accuracy and efficiency.To address this issue, existing methods often lack flexibility in controlling the number of compressed visual tokens and tend to produce discrete visual fragments, which hinder MLLMs' ability to comprehend images holistically.In this paper, we propose a straightforward yet universal Adaptive Visual Anchoring strategy, which can be seamlessly integrated into existing MLLMs, offering significant accuracy improvements through adaptive compression. Technically, our approach first constructs a response map that captures local relevance within an image concerning a given textual question by measuring cross-modal similarity. Next, a series of anchor boxes are generated around the gravity center of the response map, with the highest-confidence box selected and fed into MLLMs for question answering. To further enhance performance, we introduce a novel collaborative decoding mechanism that balances the answering results derived from both global and compressed images. Since compressed images effectively filter out irrelevant visual regions, they enable MLLMs to establish a more precise alignment between visual and textual content, thereby improving answer accuracy. Extensive experiments validate the effectiveness of our method, demonstrating consistent performance improvements across various MLLMs. The code will be publicly available.


#209
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization

Kesen Zhao · Beier Zhu · Qianru Sun · Hanwang Zhang

Chain-of-thought (CoT) reasoning greatly improves the interpretability and problem-solving abilities of multimodal large language models (MLLMs). However, existing approaches are focused on text CoT, limiting their ability to leverage visual cues. Visual CoT remains underexplored, and the only work is based on supervised fine-tuning (SFT) that relies on extensive labeled bounding-box data and is hard to generalize to unseen cases. In this paper, we introduce Unsupervised Visual CoT (UV-CoT), a novel framework for image-level CoT reasoning via preference optimization. UV-CoT performs preference comparisons between model-generated bounding boxes (one is preferred and the other is dis-preferred), eliminating the need for bounding-box annotations. We get such preference data by introducing an automatic data generation pipeline. Given an image, our target MLLM (e.g., LLaVA-1.5-7B) generates seed bounding boxes using a template prompt and then answers the question using each bounded region as input. An evaluator MLLM (e.g., OmniLLM-12B) ranks the responses, and these rankings serve as supervision to train the target MLLM with UV-CoT by minimizing negative log-likelihood losses. By emulating human perception--identifying key regions and reasoning based on them--UV-CoT can improve visual comprehension, particularly in spatial reasoning tasks where textual descriptions alone fall short. Our experiments on six datasets demonstrate the superiority of UV-CoT, compared to the state-of-the-art textual and visual CoT methods. Our zero-shot testing on three unseen datasets shows the strong generalization of UV-CoT. The implementation code is available in the Appendix.


#210
Adversarial Robustness of Discriminative Self-Supervised Learning in Vision

Ömer Veysel Çağatan · Ömer TAL · M. Emre Gursoy

Self-supervised learning (SSL) has advanced significantly in visual representation learning, yet comprehensive evaluations of its adversarial robustness remain limited. In this study, we evaluate the adversarial robustness of seven discriminative self-supervised models and one supervised model across diverse tasks, including ImageNet classification, transfer learning, segmentation, and detection. Our findings suggest that discriminative SSL models generally exhibit better robustness to adversarial attacks compared to their supervised counterpart on ImageNet, with this advantage extending to transfer learning when using linear evaluation. However, when fine-tuning is applied, the robustness gap between SSL and supervised models narrows considerably. Similarly, this robustness advantage diminishes in segmentation and detection tasks. We also investigate how various factors might influence adversarial robustness, including architectural choices, training duration, data augmentations, and batch sizes. Our analysis contributes to the ongoing exploration of adversarial robustness in visual self-supervised representation systems.


#211
Hierarchical Variational Test-Time Prompt Generation for Zero-Shot Generalization

Zhaoyang Wu · Fang Liu · Licheng Jiao · Shuo Li · Lingling Li · Xu Liu · Puhua Chen · wenping ma

Vision-language models like CLIP have demonstrated strong zero-shot generalization, making them valuable for various downstream tasks through prompt learning. However, existing test-time prompt tuning methods, such as entropy minimization, treat both text and visual prompts as fixed learnable parameters, limiting their adaptability to unseen domains. In contrast, we propose Hierarchical Variational Test-Time Prompt Generation, a novel approach where both text and visual prompts are dynamically generated via a HyperTransformer at inference time. This enables the model to produce data-specific prompts for each modality, significantly improving generalization. To further address template sensitivity and distribution shifts, we introduce variational prompt generation, leveraging variational inference to mitigate biases introduced by different prompt templates and data augmentations. Additionally, our hierarchical variational prompt generation conditions prompts at each layer on those from previous layers, allowing the model to capture deeper contextual dependencies and refine prompt interactions for robust adaptation. Extensive experiments on domain generalization benchmarks demonstrate that our method significantly outperforms existing prompt-learning techniques, achieving state-of-the-art zero-shot accuracy while maintaining efficiency.


#212
TRNAS: A Training-Free Robust Neural Architecture Search

Yeming Yang · Qingling Zhu · Jianping Luo · Ka-Chun Wong · Qiuzhen Lin · Jianqiang Li

Deep Neural Networks (DNNs) have succeeded remarkably in various computer tasks. However, they remain vulnerable to adversarial attacks, which could lead to severe security risks. In recent years, robust neural architecture search (NAS) has gradually become an emerging direction for designing adversarially robust architectures. However, existing robust NAS methods rely on repeatedly training numerous DNNs to evaluate robustness, which makes the search process extremely expensive. In this paper, we propose a training-free robust NAS method (TRNAS) that significantly reduces search costs. First, we design a zero-cost proxy model (R-Score) that formalizes adversarial robustness evaluation by exploring the theory of DNN's linear activation capability and feature consistency. This proxy only requires initialized weights for evaluation, which avoids expensive adversarial training costs. Secondly, we introduce a multi-objective selection (MOS) strategy to save candidate architectures with robustness and compactness. Experimental results show that TRNAS only requires 0.02 GPU days to find a promising robust architecture in a vast search space including approximately 10$^{20}$ networks.TRNAS surpasses other state-of-the-art robust NAS methods under both white-box and black-box attacks. Finally, we summarize a few meaningful conclusions for designing the robust architecture and promoting the development of robust NAS field.


#213
Staining and Locking Computer Vision Models Without Retraining

Oliver Sutton · Qinghua Zhou · George Leete · Alexander Gorban · Ivan Tyukin

We introduce new methods of staining and locking computer vision models, to protect their owners' intellectual property. Staining, also known as watermarking, embeds secret behaviour into a model which can later be used to identify it, while locking aims to make a model unusable unless a secret trigger is inserted into input images. Unlike existing methods, our algorithms can be used to stain and lock pre-trained models without requiring fine-tuning or retraining, and come with provable, computable guarantees bounding their worst-case false positive rates. The stain and lock are implemented by directly modifying a small number of the model's weights and have minimal impact on the (unlocked) model's performance. Locked models are unlocked by inserting a small `trigger patch' into the corner of the input image. We present experimental results showing the efficacy of our methods and demonstrating their practical performance on a variety of computer vision models.

Deep vision models have achieved remarkable classification performance by leveraging a hierarchical architecture in which human-interpretable concepts emerge through the composition of individual neurons across layers. Given the distributed nature of representations, pinpointing where specific concepts are encoded within a model remains a crucial yet challenging task in computer vision. In this paper, we introduce an effective circuit discovery method, called $\textit{Granular Concept Circuits (GCCs)}$, in which each circuit represents a concept relevant to a given query. Our method iteratively assesses inter-neuron connectivity—focusing on dependencies and semantic alignment—to construct each GCC. By automatically discovering multiple GCCs, each capturing specific concepts within that query, our approach offers a profound, concept-wise interpretation of models and is the first to identify circuits tied to specific visual concepts at a fine-grained level. We validate the versatility and effectiveness of GCCs across various deep image classification models. The source code will be publicly available.


#215
Federated Domain Generalization with Domain-specific Soft Prompts Generation

Jianhan Wu · Xiaoyang Qu · Zhangcheng Huang · Jianzong Wang

Prompt learning has become an efficient paradigm for adapting CLIP to downstream tasks. Compared with traditional fine-tuning, prompt learning optimizes a few parameters yet yields highly competitive results, especially appealing in federated learning for computational efficiency. In federated learning scenarios, data across different clients is often non-IID., leading to domain shift among clients, which poses a formidable challenge to the adaptation of downstream tasks. Federated domain generalization (FDG) methods typically learn fixed or residual soft prompts from training samples, replacing manually designed prompts to enhance the generalization ability of federated models. However, these learned prompts lack diversity and tend to ignore information about unknown domains. We propose a novel and effective method from a generative perspective for handling FDG tasks, namely federated domain generalization with domain-specific soft prompts generation (FedDSPG). Specifically, in the training phase, we introduce domain-specific soft prompts (DSPs) for each domain and integrate domain and content knowledge into the generative model among clients. In the inference phase, the generator is utilized to obtain DSPs for unseen target domains, thus guiding downstream tasks in unknown domains. Extensive experiments on several public datasets show that our method achieves state-of-the-art performance compared with the strong baselines in FDG.


#216
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

yi yang · Xiaoxuan He · Hongkun Pan · Xiyan Jiang · Yan Deng · Xingtao Yang · Haoyu Lu · Dacheng Yin · Fengyun Rao · Minfeng Zhu · Bo Zhang · Wei Chen

Large Language Models have demonstrated remarkable reasoning capability in complex textual tasks. However, multimodal reasoning, which requires integrating visual and textual information, remains a significant challenge. Existing visual-language models often struggle to effectively analyze and reason visual content, resulting in suboptimal performance on complex reasoning tasks. Moreover, the absence of comprehensive benchmarks hinders the accurate assessment of multimodal reasoning capabilities. In this paper, we introduce R1-Onevision, a multimodal reasoning model designed to bridge the gap between visual perception and deep reasoning. To achieve this, we propose a cross-modal reasoning pipeline that transforms images into formal textural representations, enabling precise language-based reasoning. Leveraging this pipeline, we construct the R1-Onevision dataset which provides detailed, step-by-step multimodal reasoning annotations across diverse domains. We further develop the R1-Onevision model through supervised fine-tuning and reinforcement learning to cultivate advanced reasoning and robust generalization abilities. To comprehensively evaluate multimodal reasoning performance across different grades, we introduce R1-Onevision-Bench, a benchmark aligned with human educational stages, covering exams from junior high school to university and beyond. Experimental results show that R1-Onevision achieves state-of-the-art performance, outperforming models such as GPT-4o and Qwen2.5-VL on multiple challenging multimodal reasoning benchmarks.


#217
Boosting Adversarial Transferability via Negative Hessian Trace Regularization

Yunfei Long · Zilin Tian · Liguo Zhang · Huosheng Xu

Transferability makes the black-box attacks to be practical. Recent studies demonstrate that adversarial examples situated at the flat maxima on the loss landscape tend to exhibit higher transferability and propose effective strategies to optimize adversarial examples to converge toward that region. However, these works primarily consider the first-order gradient regularization and have yet to explore higher-order geometry properties of the flat loss landscape, which may lead to suboptimal results. In this work, we propose leveraging the trace of the Hessian matrix of loss function with respect to the adversarial example as a curvature-aware regularizer. For computationally efficient, we introduce an approximation method for the trace based on stochastic estimation and finite difference. We theoretically and empirically demonstrate that the trace of Hessian matrices for adversarial examples near local loss maxima is consistently negative. Following this insight, we propose Negative Hessian Trace Regularization (NHTR), explicitly penalizing the negative Hessian trace to suppress curvature. Compared to existing first-order regularization methods, NHTR can generate adversarial examples at flatter local regions. Extensive experimental results on the ImageNet-compatible and CIFAR-10 datasets show that NHTR can significantly improve adversarial transferability than the state-of-the-art attacks.


#218
The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

Laura Niss · Kevin Vogt-Lowell · Theodoros Tsiligkaridis

The fine-tuning of large vision-language foundation models remains an underexplored area, particularly regarding its impact on learning gains and catastrophic forgetting. Inspired by the significance of modality gaps in contrastive dual-encoders, we introduce the Inter-Intra Modal Measure (IIMM)—a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment. Through extensive empirical analysis across four state-of-the-art vision-language models and five fine-tuning techniques, we establish a strong linear relationship: tasks with higher IIMM scores yield greater in-domain performance improvements but suffer from more pronounced out-of-domain degradation, with some parameter-efficient fine-tuning (PEFT) methods exhibiting severe forgetting. Compared to existing transferability measures, the IIMM demonstrates significantly stronger predictive power for accuracy changes post fine-tuning in dual-encoder models. Moreover, we provide a theoretical bound, proving that changes in IIMM are limited by the Wasserstein distance between pre- and post-fine-tuning embedding distributions, ensuring its stability and robustness as a predictive measure. With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning. When combined with prior knowledge of a model’s performance across diverse tasks, the IIMM further enhances transferability predictions for novel tasks, offering a lightweight yet effective tool for guiding model adaptation strategies.


#219
Highlight
What to Distill? Fast Knowledge Distillation with Adaptive Sampling

Byungchul Chae · Seonyeong Heo

Knowledge Distillation (KD) has been established as an effective technique for reducing the resource requirements of models when tackling computer vision tasks. Prior work has studied how to distill the knowledge of a teacher model better, but it overlooks how data affects the distillation result. This work examines the impact of data in knowledge distillation from two perspectives: (i) quantity of knowledge and (ii) quality of knowledge. Our examination finds that faster knowledge distillation can be achieved by using data with a large amount of high-quality knowledge in distillation. Based on the findings, this work proposes an efficient adaptive sampling method called KDAS for faster knowledge distillation, which enhances the distillation efficiency by selecting and applying 'good' samples for the distillation. This work shows that our adaptive sampling methods can effectively accelerate the training efficiency of a student model when combined with existing KD methods.


#220
Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design

Yuhao Sun · Yihua Zhang · Gaowen Liu · Hongtao Xie · Sijia Liu

With the increasing demand for the right to be forgotten, machine unlearning (MU) has emerged as a vital tool for enhancing trust and regulatory compliance by enabling the removal of sensitive data influences from machine learning (ML) models. However, most MU algorithms primarily rely on in-training methods to adjust model weights, with limited exploration of the benefits that data-level adjustments could bring to the unlearning process. To address this gap, we propose a novel approach that leverages digital watermarking to facilitate MU by strategically modifying data content. By integrating watermarking, we establish a controlled unlearning mechanism that enables precise removal of specified data while maintaining model utility for unrelated tasks. We first examine the impact of watermarked data on MU, finding that MU effectively generalizes to watermarked data. Building on this, we introduce an unlearning-friendly watermarking framework, termed Water4MU, to enhance unlearning effectiveness. The core of Water4MU is a bi-level optimization (BLO) framework: at the upper level, the watermarking network is optimized to minimize unlearning difficulty, while at the lower level, the model itself is trained independently of watermarking. Experimental results demonstrate that Water4MU is effective in MU across both image classification and image generation tasks. Notably, it outperforms existing methods in challenging MU scenarios, known as ``challenging forgets''.


#221
Federated Continuous Category Discovery and Learning

Lixu Wang · Chenxi Liu · Junfeng Guo · Qingqing Ye · Heng Huang · Haibo Hu · Wei Dong

Federated Learning (FL) studies often assume a static data distribution, whereas real-world scenarios involve dynamic changes. To address this gap, we study Federated Continuous Category Discovery and Learning (FC^2DL)---an essential yet underexplored problem that enables FL models to evolve continuously by discovering and learning novel data categories. The key challenge in FC^2DL lies in merging and aligning categories discovered and learned by different clients, all while maintaining privacy. To tackle this, we propose the Global Prototype Alignment (GPA) framework. GPA first estimates the number of categories and constructs global prototypes by locating high-density regions in the representation space through bi-level clustering. To mitigate pseudo-label noise, GPA then employs a semantic-weighted loss to capture correlations between global prototypes and the novel data. This semantic weighting strategy is also used for contrastive loss, facilitating unsupervised novel-category learning. Besides, GPA incorporates a mixup-based mechanism for both data and models, effectively mitigating interference between known and novel categories while alleviating forgetting. Extensive experiments across multiple datasets demonstrate GPA’s superiority over state-of-the-art baseline approaches. Notably, GPA achieves absolute gains of 5.7\% to 13.1\% in novel category accuracy while preserving known category performance. Furthermore, GPA is highly adaptable, equipping various mainstream FL algorithms with category discovery and learning capabilities, underscoring its potential for real-world deployment.


#222
Highlight
Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective

Hoang Phan · Tung Lam Tran · Quyen Tran · Ngoc Tran · Tuan Truong · Qi Lei · Nhat Ho · Dinh Phung · Trung Le

Multi-task learning (MTL) trains deep neural networks to optimize several objectives simultaneously using a shared backbone, which leads to reduced computational costs, improved data efficiency, and enhanced performance through cross-task knowledge sharing. Although recent gradient manipulation techniques seek a common descent direction to benefit all tasks, conventional empirical loss minimization still leaves models prone to overfitting and gradient conflicts. To address this, we introduce a novel MTL framework that leverages weight perturbation to regulate gradient norms. thus improve generalization. By carefully modulating weight perturbations, our approach harmonizes task-specific gradients, reducing conflicts and encouraging more robust learning across tasks. Theoretical insights reveal that controlling the gradient norm through weight perturbation directly contributes to better generalization. Extensive experiments across diverse applications demonstrate that our method significantly outperforms existing gradient-based MTL techniques in terms of task performance and overall model robustness.

Few-Shot Class-Incremental Learning (FSCIL) is challenged by limited data and expanding class spaces, leading to overfitting and catastrophic forgetting. Existing methods, which often freeze feature extractors and use Nearest Class Mean classifiers, sacrifice adaptability to new feature distributions. To address these issues, we propose Flexi-FSCIL, a semi-supervised framework that integrates three novel strategies: Adaptive Gated Residual Fusion (AGRF), Attention-Guided Dynamic Hybrid Distillation (ADHD), and Prototype Offset Equilibrium (POE). Flexi-FSCIL effectively balances stability and plasticity in FSCIL. AGRF resolves the rigidity of frozen feature extractors by integrating both frozen and trainable components, enabling adaptive feature learning while retaining old-class knowledge. ADHD tackles the imbalance between old and new tasks by dynamically aligning features using cross-attention maps and direct matching, preserving old-class knowledge while facilitating new-class learning. POE addresses the issue of prototype drift in semi-supervised settings by selecting high-quality unlabeled samples, maintaining feature space separability and preventing overfitting. Evaluated on three benchmark datasets, Flexi-FSCIL achieves state-of-the-art performance, significantly outperforming existing FSCIL methods with only 12.97 performance drop on CUB200.


#224
FDPT: Federated Discrete Prompt Tuning for Black-Box Visual-Language Models

Jiaqi Wu · Simin Chen · Jing Tang · Yuzhe YANG · Yiming Chen · Lixu Wang · Song Lin · Zehua Wang · Wei Chen · Zijian Tian

General-purpose Vision-Language Models (VLMs) have driven major advancements in multimodal AI. Fine-tuning these models with task-specific data enhances adaptability to various downstream tasks but suffers from privacy risks. While potential solutions like federated learning can address user data privacy concerns, model protection is also essential. Other methods that rely on black-box VLM APIs usually require the access of prediction logits, leaving them open to inversion attacks. Moreover, addressing the challenges of tuning complexity and data transmission efficiency in federated VLM scenarios is also crucial. To address these challenges, we propose FDPT—a federated discrete prompt tuning method utilizing black-box VLMs. During client optimization stage, FDPT employs an agent-driven framework leveraging large language models (LLMs) with enhanced reasoning capacities to systematically optimize discrete prompt representations, and also utilizes feedback mechanisms and chain of thought to enhance prediction accuracy. Importantly, it performs optimization by relying not on the predicted logic vectors output by LLMs but on textual results, avoiding reverse attack risks. During global aggregation stage, We mimic human electoral activities by employing evolutionary computation methods underpinned by semantic similarity computation to implement enhanced zero-order optimization for acquiring representative global tokens, thereby achieving knowledge aggregation. FDPT significantly outperforms nine state-of-the-art methods in image classification and visual question-answering, reducing communication overhead while generating highly transferable optimized prompts. Additionally, it exhibits improved robustness to data heterogeneity.


#225
PROL : Rehearsal Free Continual Learning in Streaming Data via Prompt Online Learning

Muhammad Anwar Ma'sum · Mahardhika Pratama · Savitha Ramasamy · Lin Liu · H Habibullah · Ryszard Kowalczyk

The data privacy constraint in online continual learning (OCL), where the data can be seen only once, complicates the catastrophic forgetting problem in streaming data. A common approach applied by the current SOTAs in OCL is with the use of memory saving exemplars or features from previous classes to be replayed in the current task. On the other hand, the prompt-based approach performs excellently in continual learning but with the cost of a growing number of trainable parameters. The first approach may not be applicable in practice due to data openness policy, while the second approach has the issue of throughput associated with the streaming data. In this study, we propose a novel prompt-based method for online continual learning that includes 4 main components: (1) Single light-weight prompt generator as a general knowledge, (2) trainable scaler-and-shifter as specific knowledge, (3) PTM generalization preserving, and (4) hard-soft updates mechanism. Our proposed method achieves significantly higher performance than the current SOTAs in CIFAR100, ImageNet-R, ImageNet-A, and CUB dataset. Our complexity analysis shows that our method requires a relatively smaller number of parameters and achieves moderate training time, inference time, and throughput. For further study, the source code of our method is available at https://anonymous.4open.science/r/ICCV2025_ID15989/.


#226
Can Knowledge be Transferred from Unimodal to Multimodal? Investigating the Transitivity of Multimodal Knowledge Editing

Lingyong Fang · Xinzhong Wang · Depeng depeng wang · Zongru Wu · Ya Guo · Huijia Zhu · Zhuosheng Zhang · Gongshen Liu

Multimodal Large Language Models (MLLMs) contain a substantial amount of factual knowledge, which may become outdated or inaccurate over time. Consequently, various knowledge editing techniques have been proposed to update the knowledge encoded within these models. Previous approaches maintain modality consistency during both the editing and testing phases. However, in practical applications, it is desirable for knowledge to be transferable across different modalities, which can enhance the robustness of knowledge editing and potentially allow for cost-effective editing of multimodal knowledge using textual information. To address this, we introduce the concept of Transitivity of Multimodal Knowledge Editing (TMKE) and design corresponding evaluation criteria. Subsequently, we construct a corresponding TMKE Benchmark through an automated pipeline. We evaluate three MLLMs and five knowledge editing methods, uncovering limitations in the current models and methods concerning transitivity. Additionally, we analyze the intrinsic representations of the model during the editing process based on Knowledge Neurons to interpret the experimental phenomena.


#227
Scaling Laws for Native Multimodal Models

Mustafa Shukor · Enrico Fini · Victor Guilherme Turrisi da Costa · Matthieu Cord · Joshua Susskind · Alaaeldin El-Nouby

Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing training on multimodal data. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)—those trained from the ground up on all modalities—and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on pre-trained image encoders or tokenizers. On the contrary, early-fusion exhibits stronger performance at lower parameter count, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows models to learn modality-specific weights, significantly benefiting performance.

In many robotics and VR/AR applications, fast camera motions cause a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.


#229
Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models

Zhen Zeng · Leijiang Gu · Xun Yang · Zhangling Duan · Zenglin Shi · Meng Wang

Existing knowledge editing works for MultiModal Large Language Models primarily focus on text-oriented, coarse-grained scenarios, where modifying textual content alone is sufficient. As a result, they fail to capture the unique challenges of multimodal editing, particularly when visual information is central to knowledge representation. In this paper, we introduce a visual-oriented, fine-grained multimodal knowledge editing task that targets precise modifications in images containing multiple interacting entities. To support this, we propose the Fine-Grained Visual Knowledge Editing (FGVEdit) benchmark, designed to evaluate the accuracy and effectiveness of multimodal editing at a granular level. To address this challenge, we present the Multimodal Scope Classifier-based Knowledge Editor (MSCKE), a new framework that leverages a multimodal scope classifier to integrate both textual and visual information. By accurately identifying and updating knowledge localized within images, MSCKE ensures precise editing while preserving unrelated content. Extensive experiments on the FGVEdit benchmark highlight the complexity of this new task and demonstrate that existing methods struggle with fine-grained multimodal editing. Our results highlight MSCKE as a scalable and promising framework for advancing multimodal knowledge editing.


#230
Dynamic Multimodal Prototype Learning in Vision-Language Models

Xingyu Zhu · Shuo Wang · Beier Zhu · Miaoge Li · Yunfan Li · Junfeng Fang · Zhicai Wang · Dongsheng Wang · Hanwang Zhang

With the increasing attention to pre-trained vision-language models (VLMs), e.g., CLIP, substantial efforts have been devoted to many downstream tasks, especially in test-time adaptation (TTA). However, previous works focus on learning prototypes only in the textual modality while overlooking the ambiguous semantics in class names. These ambiguities lead to textual prototypes that are insufficient to capture visual concepts, resulting in limited performance. To address this issue, we introduce ProtoMM, a training-free framework that constructs multimodal prototypes to adapt VLMs during the test time. By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning. More importantly, the visual particles are dynamically updated as the testing stream flows. This allows our multimodal prototypes to continually learn from the data, enhancing their generalizability in unseen scenarios. In addition, we quantify the importance of the prototypes and test images by formulating their semantic distance as an optimal transport problem. Extensive experiments on 15 zero-shot benchmarks demonstrate the effectiveness of our method, achieving a 1.03\% average accuracy improvement over state-of-the-art methods on ImageNet and its variant datasets.


#231
Visual Intention Grounding for Egocentric Assistants

Pengzhan Sun · Junbin Xiao · Tze Ho Elden Tse · Yicong Li · Arjun Akula · Angela Yao

Visual grounding associates textual descriptions with objects in an image. Conventional methods target third-person image inputs and named object queries. In applications such as AI assistants, the perspective shifts -- inputs are egocentric, and objects may be referred to implicitly through needs and intentions. To bridge this gap, we introduce EgoIntention, the first dataset for egocentric visual intention grounding. EgoIntention challenges multimodal LLMs to 1) understand and ignore unintended contextual objects and 2) reason about uncommon object functionalities. Benchmark results show that current models misidentify context objects and lack affordance understanding in egocentric views. We also propose Reason-to-Ground (RoG) instruction tuning; it enables hybrid training with normal descriptions and egocentric intentions with a chained intention reasoning and object grounding mechanism. RoG significantly outperforms naive finetuning and hybrid training on EgoIntention, while maintaining or slightly improving naive description grounding. This advancement enables unified visual grounding for egocentric and exocentric visual inputs while handling explicit object queries and implicit human intentions.


#232
Highlight
Unleashing Vecset Diffusion Model for Fast Shape Generation

Zeqiang Lai · Zhao Yunfei · Zibo Zhao · Haolin Liu · Fu-Yun Wang · Huiwen Shi · Xianghui Yang · Qingxiang Lin · Jingwei Huang · Lliu Yuhong · Jie Jiang · Chunchao Guo · Xiangyu Yue

3D shape generation has greatly flourished through the development of so-called ``native" 3D diffusion, particularly through the Vectset Diffusion Model (VDM). While recent advancements have shown promising results in generating high-resolution 3D shapes, VDM still struggles at high-speed generation. Challenges exist because of not only difficulties in accelerating diffusion sampling but also VAE decoding in VDM -- areas under-explored in previous works. To address these challenges, we present FlashVDM, a systematic framework for accelerating both VAE and DiT in VDM. For DiT, FlashVDM enables flexible diffusion sampling with as few as 5 inference steps, while maintaining comparable quality, which is made possible by stabilizing consistency distillation with our newly introduced Progressive Flow Distillation technique. For VAE, we introduce a lightning vectset decoder equipped with Adaptive KV Selection, Hierarchical Volume Decoding,, and Efficient Network Design. By exploiting the locality of vectset and the sparsity of shape surface in the volume, the proposed decoder drastically lowers FLOPs, minimizing the overall decoding overhead. We apply FlashVDM to the current state-of-the-art open-source shape generation model Hunyuan3D-2, resulting in Hunyuan3D-2 Turbo. Through systematic evaluation for both generation and reconstruction, we demonstrate that our model outperforms existing fast 3D generation methods by a significant margin, achieving comparable performance to the state-of-the-art models while reducing inference time by over 45x for reconstruction and 32x for generation. Code and models will be made publicly available.


#233
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling

Xin Dong · Shichao Dong · Jin Wang · Jing Huang · Li Zhou · Zenghui Sun · Lihua Jing · Jinsong Lan · Xiaoyong Zhu · Bo Zheng

Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose $\textbf{INTER}: \textbf{Inter}$action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.


#234
VisRL: Intention-Driven Visual Perception via Reinforced Reasoning

Zhangquan Chen · Xufang Luo · Dongsheng Li

Visual understanding is inherently intention-driven—humans selectively focus on different regions of a scene based on their goals. Recent advances in large multimodal models (LMMs) enable flexible expression of such intentions through natural language, allowing queries to guide visual reasoning processes. Frameworks like Visual Chain-of-Thought have demonstrated the benefit of incorporating explicit reasoning steps, where the model predicts a focus region before answering a query. However, existing approaches rely heavily on supervised training with annotated intermediate bounding boxes, which severely limits scalability due to the combinatorial explosion of intention-region pairs. To overcome this limitation, we propose VisRL, the first framework that applies reinforcement learning (RL) to the problem of intention-driven visual perception. VisRL optimizes the entire visual reasoning process using only reward signals. By treating intermediate focus selection as a internal decision optimized through trial-and-error, our method eliminates the need for costly region annotations while aligning more closely with how humans learn to perceive the world. Extensive experiments across multiple benchmarks show that VisRL consistently outperforms strong baselines, demonstrating both its effectiveness and its strong generalization across different LMMs.


#235
Auto-Regressively Generating Multi-View Consistent Images

JiaKui Hu · Yuxiao Yang · Jialun Liu · Jinbo Wu · Chen Zhao · Yanye Lu

Generating multi-view images from human instructions is crucial for 3D content creation. The primary challenges involve maintaining consistency across multiple views and effectively synthesizing shapes and textures under diverse conditions. In this paper, we propose the Multi-View Auto-Regressive (\textbf{MV-AR}) method, which leverages an auto-regressive model to progressively generate consistent multi-view images from arbitrary prompts. Firstly, the next-token-prediction capability of the AR model significantly enhances its effectiveness in facilitating progressive multi-view synthesis. When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information. Subsequently, we propose a unified model that accommodates various prompts via architecture designing and training strategies. To address multiple conditions, we introduce condition injection modules for text, camera pose, image, and shape. To manage multi-modal conditions simultaneously, a progressive training strategy is employed. This strategy initially adopts the text-to-multi-view (t2mv) model as a baseline to enhance the development of a comprehensive X-to-multi-view (X2mv) model through the randomly dropping and combining conditions. Finally, to alleviate the overfitting problem caused by limited high-quality data, we propose the ``Shuffle View" data augmentation technique, thus significantly expanding the training data by several magnitudes. Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images across a range of conditions and performs on par with leading diffusion-based multi-view image generation models.

Parameter-level model merging is an emerging paradigm in multi-task learning with significant promise. Previous research has explored its connections with prediction-level model ensembling—commonly viewed as the upper bound for merging—to reveal the potential of achieving performance consistency between the two. However, this observation relies on certain preconditions, such as being limited to two models, using ViT-based models, and all models are fine-tuned from the same pre-trained checkpoint. To further understand the intrinsic connections between these two paradigms, this paper explores an interesting possibility: If these restrictions are removed, can performance consistency still be achieved between merging and ensembling? To answer this question, we first theoretically establish a performance correlation between merging and ensembling. We find that even when previous restrictions are not met, there is still a way for model merging to attain a near-identical and superior performance similar to that of ensembling. To verify whether our findings are practical, we introduce a validation framework termed \underline{Neu}ral \underline{Lig}and (NeuLig). The learning process of NeuLig is meticulously designed with a specialized loss function supported by theoretical foundations. Experimental results demonstrate the robust resilience of NeuLig in terms of both model scale and the number of collaborating models. For instance, for the case involving 5 CLIP-ViT-B/32 models, parameter-level merging achieves the same performance as prediction-level ensembling (merging: 95.44\% vs. ensembling: 95.46\%).


#237
Debiased Teacher for Day-to-Night Domain Adaptive Object Detection

Yiming Cui · Liang Li · Haibing YIN · Yuhan Gao · Yaoqi Sun · Chenggang Yan

Day-to-Night Domain Adaptive Object Detection (DN-DAOD) is a significant challenge due to the low visibility and signal-to-noise ratio at night. Although recent self-training approaches achieve promising results, they fail to address three critical biases: distribution bias, training bias, and confirmation bias. Therefore, we propose a Debiased Teacher to address the above biases from three aspects: domain transforming, representation compensating, and pseudo label calibrating. Concretely, the day-to-night domain transforming module (DNDT) leverages physical priors to model some key day-night domain differences, thus transforming daytime images into night-like images. Then, the cross-domain representation compensating module (CDRC) selectively mixes objects from nighttime and night-like images to compensate for the model’s general representation of nighttime objects. Further, to correct confirmation bias caused by learning from inaccurate pseudo labels, the pseudo label confirmation calibrating module (ConCal) is designed to obtain accurate pseudo labels for better nighttime knowledge learning. Experimental results on three benchmarks demonstrate that our method outperforms current SOTA methods by a large margin. Our code is released in supplementary materials.


#238
From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

Chuang Yu · Jinmiao Zhao · Yunpeng Liu · Sicheng Zhao · Yimian Dai · Xiangyu Yue

Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we construct an innovative Progressive Active Learning (PAL) framework for single point supervision, which drives the existing SIRST detection networks progressively and actively recognizes and learns more hard samples to achieve significant performance improvements. Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. Meanwhile, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our PAL framework can build an efficient and stable bridge between full supervision and single point supervision tasks. Our code will be open source.


#239
Disentangled World Models: Learning to Transfer Semantic Knowledge from Distracting Videos for Reinforcement Learning

Qi Wang · Zhipeng Zhang · Baao Xie · Xin Jin · Yunbo Wang · Shiyu Wang · Liaomo Zheng · Xiaokang Yang · Wenjun Zeng

Training visual reinforcement learning (RL) in practical scenarios presents a significant challenge, $\textit{i.e.,}$ RL agents suffer from low sample efficiency in environments with variations. While various approaches have attempted to alleviate this issue by disentanglement representation learning, these methods usually start learning from scratch without prior knowledge of the world. This paper, in contrast, tries to learn and understand underlying semantic variations from distracting videos via offline-to-online latent distillation and flexible disentanglement constraints. To enable effective cross-domain semantic knowledge transfer, we introduce an interpretable model-based RL framework, dubbed Disentangled World Models (DisWM). Specifically, we pretrain the action-free video prediction model offline with disentanglement regularization to extract semantic knowledge from distracting videos. The disentanglement capability of the pretrained model is then transferred to the world model through latent distillation. For finetuning in the online environment, we exploit the knowledge from the pretrained model and introduce a disentanglement constraint to the world model. During the adaptation phase, the incorporation of actions and rewards from online environment interactions enriches the diversity of the data, which in turn strengthens the disentangled representation learning. Experimental results validate the superiority of our approach on various benchmarks.


#240
Highlight
Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion

Mutian Xu · Chongjie Ye · Haolin Liu · Yushuang Wu · Jiahao Chang · Xiaoguang Han

3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes StableDiffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns.

Hyperparameter optimization (HPO) is a billion-dollar problem in machine learning, which significantly impacts the training efficiency and model performance. However, achieving efficient and robust HPO in deep reinforcement learning (RL) is consistently challenging due to its high non-stationarity and computational cost. To tackle this problem, existing approaches attempt to adapt common HPO techniques (e.g., population-based training or Bayesian optimization) to the RL scenario. However, they remain sample-inefficient and computationally expensive, which cannot facilitate a wide range of applications. In this paper, we propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. ULTHO also provides a quantified and statistical perspective to filter the HPs efficiently. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet. Extensive experiments demonstrate that the ULTHO can achieve superior performance with simple architecture, contributing to the development of advanced and automated RL systems.


#242
Task-Aware Prompt Gradient Projection for Parameter-Efficient Tuning Federated Class-Incremental Learning

Hualong Ke · Yachao Zhang · Jiangming Shi · FangyongWang FangyongWang · Yuan Xie · Yanyun Qu

Federated Continual Learning (FCL) has recently garnered significant attention due to its ability to continuously learn new tasks while protecting user privacy. However, existing Data-Free Knowledge Transfer (DFKT) methods require training the entire model, leading to high training and communication costs, while prompt pool-based methods with accessing other task-specific prompts in the pool may pose privacy leakage risk. To address these challenges, we propose a novel method: Task-aware Prompt gradient Projection and Replay (TPPR), which leverages visual prompts to build a parameter-efficient tuning architecture, thereby significantly reducing training and communication costs. Specifically, we propose the Task-Aware Prompt Gradient Projection (TAPGP) mechanism, from the perspective of protecting learned knowledge, to balance the learning of task-agnostic and task-specific knowledge in a pool-free manner. In practice, we make the gradient of the deep prompts orthogonal to the virtual data and prompts of preceding tasks, which prevents the erosion of old task knowledge while allowing the model to learn new information. Additionally, we introduce Dual-Level Prompt Replay (DLPR) based on exponential moving average to facilitate knowledge review at both inter-task and intra-task levels, effectively inheriting learned knowledge. Extensive experimental results demonstrate that our method effectively reduces model communication overhead and alleviates forgetting while fully protecting privacy. With only 1% of the training parameters, we achieve more than 5% accuracy improvements in all settings than SOTA with the same backbone.


#243
Is Visual in-Context Learning for Compositional Medical Tasks within Reach?

Simon Reiß · Zdravko Marinov · Alexander Jaus · Constantin Seibold · M. Sarfraz · Erik Rodner · Rainer Stiefelhagen

In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training in-context learners to adapt to sequences of tasks, rather than individual tasks. Our goal is to solve complex tasks that involve multiple intermediate steps using a single model, allowing users to define entire vision pipelines flexibly at test time. To achieve this, we first examine the properties and limitations of visual in-context learning architectures, with a particular focus on the role of codebooks.We then introduce a novel method for training in-context learners using a synthetic compositional task generation engine.This engine bootstraps task sequences from arbitrary segmentation datasets, enabling the training of visual in-context learners for compositional tasks.Additionally, we investigate different masking-based training objectives to gather insights into how to train models better for solving complex, compositional tasks.Our exploration not only provides important insights especially for multi-modal medical task sequences but also highlights challenges that need to be addressed.


#244
Effective Training Data Synthesis for Improving MLLM Chart Understanding

Yuwei Yang · Zeyu Zhang · Yunzhong Hou · Zhuowan Li · Gaowen Liu · Ali Payani · Yuan-Sen Ting · Liang Zheng

Being able to effectively read scientific plots, or chart understanding, is a central part toward building effective agents for science. However, existing multimodal large language models (MLLMs), especially open-source ones, are still falling behind with a typical success rate of 30\%-50\% on challenging benchmarks. Previous studies on fine-tuning MLLMs with synthetic charts are often restricted by their inadequate similarity to the real charts, which could compromise model training and performance on complex real-world charts.In this study, we show that modularizing chart generation and diversifying visual details improves chart understanding capabilities. In particular, we design a five-step data synthesis pipeline, where we separate data and function creation for single plot generation, condition the generation of later subplots on earlier ones for multi-subplot figures, visually diversify the generated figures, filter out low quality data, and finally generate the question-answer (QA) pairs with GPT-4o. This approach allows us to streamline the generation of fine-tuning datasets and introduce the Effective Chart Dataset (ECD), which contains 10k+ chart images and 300k+ QA pairs, covering 25 topics and featuring 250+ chart type combinations with high visual complexity. We show that ECD consistently improves the performance of various MLLMs on a range of real-world and synthetic test sets.

In class-imbalanced learning (CIL), post-hoc logit adjustment (LA) effectively mitigates class imbalance by adjusting biased logits according to label frequencies. Given the success of LA in CIL, recent class-imbalanced semi-supervised learning (CISSL) algorithms incorporated LA, leading to improved performance when labeled and unlabeled datasets share the same class distribution. However, a common real-world scenario involves the unknown class distribution of the unlabeled set, which may mismatch that of the labeled set. In this case, LA may result in an inappropriate degree of logit adjustments, potentially degrading classification performance due to its inability to incorporate the unknown class distribution of the unlabeled set. To address this problem, we propose a novel CISSL algorithm named learnable logit adjustment (LLA). Unlike the original LA, LLA learns the appropriate degree of logit adjustment by minimizing the class-averaged loss computed for both the labeled and unlabeled sets. Based on the learned degree, LLA refines the biased pseudo-labels of base semi-supervised learning algorithms and adjusts the biased class predictions on the test set by adjusting the logits. Experimental results on benchmark datasets demonstrate that LLA achieves state-of-the-art performance in CISSL.


#246
Highlight
Where, What, Why: Towards Explainable Driver Attention Prediction

Yuchen Zhou · Jiayu Tang · Xiaoyan Xiao · Yueyao Lin · Linkai Liu · Zipeng Guo · Hao Fei · Xiaobo Xia · Chao Gou

Modeling task-driven attention in driving is a fundamental challenge for both autonomous vehicles and cognitive science. Existing methods primarily predict where drivers look by generating spatial heatmaps, but fail to capture the cognitive motivations behind attention allocation in specific contexts, which limits deeper understanding of attention mechanisms. To bridge this gap, we introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why). To support this, we present W³DA, the first large-scale explainable driver attention dataset. It enriches existing benchmarks with detailed semantic and causal annotations across diverse driving scenarios, including normal conditions, safety-critical situations, and traffic accidents. We further propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture. Extensive experiments demonstrate the effectiveness of LLada, exhibiting robust generalization across datasets and driving conditions. This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction. The dataset, code, and models will be released.


#247
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Ma Teng · Xiaojun Jia · Ranjie Duan · Xinfeng Li · Yihao Huang · Xiaoshuang Jia · Zhixuan Chu · Wenqi Ren

With the rapid advancement of multimodal large language models (MLLMs), concerns regarding their security have increasingly captured the attention of both academia and industry. Although MLLMs are vulnerable to jailbreak attacks, designing effective jailbreak attacks poses unique challenges, especially given the highly constrained adversarial capabilities in real-world deployment scenarios. Previous works concentrate risks into a single modality, resulting in limited jailbreak performance. In this paper, we propose a heuristic-induced multimodal risk distribution jailbreak attack method, called HIMRD, which is black-box and consists of two elements: multimodal risk distribution strategy and heuristic-induced search strategy. The multimodal risk distribution strategy is used to distribute harmful semantics into multiple modalities to effectively circumvent the single-modality protection mechanisms of MLLMs. The heuristic-induced search strategy identifies two types of prompts: the understanding-enhancing prompt, which helps MLLMs reconstruct the malicious prompt, and the inducing prompt, which increases the likelihood of affirmative outputs over refusals, enabling a successful jailbreak attack. HIMRD achieves an average attack success rate (ASR) of 90% across seven open-source MLLMs and an average ASR of around 68% in three closed-source MLLMs. HIMRD reveals cross-modal security vulnerabilities in current MLLMs and underscores the imperative for developing defensive strategies to mitigate such emerging risks.


#248
Hypergraph Clustering Network with Partial Attribute Imputation

Qianqian Wang · Bowen Zhao · Zhengming Ding · Wei Feng · Quanxue Gao

Existing hypergraph clustering methods typically assume that node attributes are fully available. However, in real-world scenarios, missing node attributes are common due to factors such as data privacy concerns or failures in data collection devices. While some approaches attempt to handle missing attributes in traditional graphs, they are not designed for hypergraphs, which encode higher-order relationships and introduce additional challenges. To bridge this gap, we propose \textbf{H}ypergraph \textbf{C}lustering \textbf{N}etwork with \textbf{P}artial \textbf{A}ttribute \textbf{I}mputation (HCN-PAI). Specifically, we first leverage higher-order neighborhood propagation to impute missing node attributes by minimizing the Dirichlet energy, ensuring smooth feature propagation across the hypergraph. Next, we introduce a hypergraph smoothing preprocessing that efficiently captures structural information, replacing the hypergraph convolution operation, and significantly reducing computational costs. Finally, we design a dual-space projection contrast mechanism, which employs two independent MLPs to encode node representations into two distinct views and enforces consistency at both node and hyperedge levels. Extensive experiments on multiple benchmark datasets validate the effectiveness and superiority of our proposed method.


#249
Highlight
VRM: Knowledge Distillation via Virtual Relation Matching

Weijia Zhang · Fei Xie · Weidong Cai · Chao Ma

Knowledge distillation (KD) aims to transfer the knowledge of a more capable yet cumbersome teacher model to a lightweight student model. In recent years, relation-based KD methods have fallen behind, as instance-matching counterparts dominate in performance. In this paper, we revive relational KD by identifying and tackling several key issues in relational KD, including its susceptibility to overfitting and spurious responses. Specifically, we transfer novelly constructed affinity graphs that compactly encapsulate a wealth of beneficial inter-sample, inter-class, and inter-view correlations by exploiting virtual views and relations as a new kind of knowledge. As a result, the student has access to rich guidance signals and stronger regularisation throughout the distillation process. To further mitigate the adverse impact of spurious responses, we prune the affinity graphs by dynamically detaching redundant and unreliable edges. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO datasets demonstrate the superior performance of the proposed virtual relation matching (VRM) method over a range of tasks, architectures, and set-ups. For instance, VRM for the first time hits 74.0% accuracy for ResNet50-to-MobileNetV2 distillation on ImageNet, and improves DeiT-Ti by 14.44% on CIFAR-100 with a ResNet56 teacher. Code and models will be released.


#250
Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning

Junjie Shan · Ziqi Zhao · Jialin Lu · Rui Zhang · SM Yiu · Ka-Ho Chow

Foundation models that bridge vision and language have made significant progress. While they have inspired many life-enriching applications, their potential for abuse in creating new threats remains largely unexplored. In this paper, we reveal that vision-language models (VLMs) can be weaponized to enhance gradient inversion attacks (GIAs) in federated learning (FL), where an FL server attempts to reconstruct private data samples from gradients shared by victim clients. Despite recent advances, existing GIAs struggle to reconstruct high-resolution images when the victim has a large local data batch. One promising direction is to focus reconstruction on valuable samples rather than the entire batch, but current methods lack the flexibility to target specific data of interest. To address this gap, we propose Geminio, the first approach to transform GIAs into semantically meaningful, targeted attacks. It enables a brand new privacy attack experience: attackers can describe, in natural language, the data they consider valuable, and Geminio will prioritize reconstruction to focus on those high-value samples. This is achieved by leveraging a pretrained VLM to guide the optimization of a malicious global model that, when shared with and optimized by a victim, retains only gradients of samples that match the attacker-specified query. Geminio can be launched at any FL round and has no impact on normal training (i.e., the FL server can steal clients' data while still producing a high-utility ML model as in benign scenarios). Extensive experiments demonstrate its effectiveness in pinpointing and reconstructing targeted samples, with high success rates across complex datasets and large batch sizes with resilience against defenses.


#251
From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

Pengkun Jiao · Bin Zhu · Jingjing Chen · Chong-Wah Ngo · Yu-Gang Jiang

Efficient Visual Instruction Fine-Tuning (EVIT) seeks to adapt Multimodal Large Language Models (MLLMs) to downstream tasks with minimal computational overhead. However, as task diversity and complexity increase, EVIT faces significant challenges in resolving data conflicts.To address this limitation, we propose the Dual Low-Rank Adaptation (Dual-LoRA), a holistic-to-local framework that enhances the adapter's capacity to address data conflict through dual structural optimization. Specifically, we utilize two subspaces: a skill space for stable, holistic knowledge retention, and a rank-rectified task space that locally activates the holistic knowledge.Additionally, we introduce Visual Cue Enhancement (VCE), a multi-level local feature aggregation module designed to enrich the vision-language projection with local details.Our approach is both memory- and time-efficient, requiring only 1.16$\times$ the inference time of the standard LoRA method (with injection into the query and value projection layers), and just 73\% of the inference time of a 4-expert LoRA-MoE. Extensive experiments on various downstream tasks and general MLLM benchmarks validate the effectiveness of our proposed methods.


#252
Gaze-Language Alignment for Zero-Shot Prediction of Visual Search Targets from Human Gaze Scanpaths

Sounak Mondal · Naveen Sendhilnathan · Ting Zhang · Yue Liu · Michael Proulx · Michael Iuzzolino · Chuan Qin · Tanya Jonker

Decoding human intent from eye gaze during a visual search task has become an increasingly important capability within augmented and virtual reality systems. However, gaze target prediction models used within such systems are constrained by the predefined target categories found within available gaze data, limiting their generalizability to novel categories and their usefulness within real-world, interactive systems. In this work, we present the Gaze-Language Alignment Model (GLAM), a vision-language model that can generalize gaze target predictions to novel categories of search targets lacking gaze annotation. To do so, GLAM uses a novel gaze encoder to encode foveal and peripheral information of a gaze scanpath. The resultant gaze embeddings are aligned with language embeddings of large language model-generated search descriptions for associated target categories using a novel contrastive learning strategy called Gaze-Language Alignment Decomposition (GLAD). When used to train GLAM in a zero-shot setup, GLAD surpassed naive contrastive learning strategies by nearly one-third in target prediction accuracy, even outperforming a fully supervised baseline. Moreover, in a fully supervised setup, GLAM outperformed previous methods in target prediction accuracy, regardless of the training strategy used.


#253
You Are Your Own Best Teacher: Achieving Centralized-level Performance in Federated Learning under Heterogeneous and Long-tailed Data

Shanshan Yan · Zexi Li · Chao Wu · Meng Pang · Yang Lu · Yan Yan · Hanzi Wang

Data heterogeneity, stemming from local non-IID data and global long-tailed distributions, is a major challenge in federated learning (FL), leading to significant performance gaps compared to centralized learning. Previous research found that poor representations and biased classifiers are the main problems and proposed neural-collapse-inspired synthetic simplex ETF to help representations be closer to neural collapse optima. However, we find that the neural-collapse-inspired methods are not strong enough to reach neural collapse and still have huge gaps to centralized training. In this paper, we rethink this issue from a self-distillation perspective and propose FedYoYo (You Are Your Own Best Teacher), introducing Augmented Self-bootstrap Distillation (ASD) to improve representation learning by distilling knowledge between weakly and strongly augmented local samples, without needing extra datasets or models. We further introduce Distribution-aware Logit Adjustment (DLA) to balance the self-distillation process and correct biased feature representations. FedYoYo nearly eliminates the performance gap, achieving centralized-level performance even under mixed heterogeneity. It enhances local representation learning, reducing model drift and improving convergence, with feature prototypes closer to neural collapse optimality. Extensive experiments show FedYoYo achieves state-of-the-art results, even surpassing centralized logit adjustment methods by 5.4\% under global long-tailed settings. The code is available at https://anonymous.4open.science/r/FedYoYo-1F01}{https://anonymous.4open.science/r/FedYoYo-1F01.


#254
Enhancing Few-Shot Vision-Language Classification with Large Multimodal Model Features

Chancharik Mitra · Brandon Huang · Tianning Chai · Zhiqiu Lin · Assaf Arbelle · Rogerio Feris · Leonid Karlinsky · Trevor Darrell · Deva Ramanan · Roei Herzig

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks. Despite strong performance, LMMs' generative outputs are not specialized for vision-language classification tasks (i.e., tasks with vision-language inputs and discrete labels) such as image classification and multiple-choice VQA.One key challenge in utilizing LMMs for these tasks is the extraction of useful features from generative LMMs.To overcome this, we propose an approach that leverages multimodal feature extraction from the LMM's latent space.Toward this end, we present Sparse Attention Vectors (SAVs)---a finetuning-free method that leverages sparse attention head activations (fewer than 5% of the heads) in LMMs as strong feature representations.With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of vision-language classification tasks.Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.


#255
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks

Jiawei Wang · Yushen Zuo · Yuanjun Chai · Zhendong Liu · Yicheng Fu · Yichun Feng · Kin Man Lam

Vision-Language Models (VLMs) extend the capabilities of Large Language Models (LLMs) by incorporating visual information, yet they remain vulnerable to jailbreak attacks, especially when processing noisy or corrupted images. Although existing VLMs adopt security measures during training to mitigate such attacks, vulnerabilities associated with noise-augmented visual inputs are overlooked. In this work, we identify that missing noise-augmented training causes critical security gaps: many VLMs are susceptible to even simple perturbations such as Gaussian noise. To address this challenge, we propose Robust-VLGuard, a multimodal safety dataset with aligned / misaligned image-text pairs, combined with noise-augmented fine-tuning that reduces attack success rates while preserving functionality of VLM. For stronger optimization-based visual perturbation attacks, we propose DiffPure-VLM, leveraging diffusion models to convert adversarial perturbations into Gaussian-like noise, which can be defended by VLMs with noise-augmented safety fine-tuning. Experimental results demonstrate that the distribution-shifting property of diffusion model aligns well with our fine-tuned VLMs, significantly mitigating adversarial perturbations across varying intensities. The dataset and code will be open-sourced.


#256
Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment

Kejia Zhang · Juanjuan Weng · Zhiming Luo · Shaozi Li

Despite the remarkable progress of deep neural networks (DNNs) in various visual tasks, their vulnerability to adversarial examples raises significant security concerns. Recent adversarial training methods leverage inverse adversarial attacks to generate high-confidence examples, aiming to align adversarial distributions with high-confidence class regions. However, our investigation reveals that under inverse adversarial attacks, high-confidence outputs are influenced by biased feature activations, causing models to rely on background features that lack a causal relationship with the labels. This spurious correlation bias leads to overfitting irrelevant background features during adversarial training, thereby degrading the model's robust performance and generalization capabilities. To address this issue, we propose Debiased High-Confidence Adversarial Training (DHAT), a novel approach that aligns adversarial logits with debiased high-confidence logits and restores proper attention by enhancing foreground logit orthogonality. Extensive experiments demonstrate that DHAT achieves state-of-the-art robustness on both CIFAR and ImageNet-1K benchmarks, while significantly improving generalization by mitigating the feature bias inherent in inverse adversarial training approaches. Code is available at~\url{https://anonymous.4open.science/r/ICCV-7546}.


#257
Learning Visual Proxy for Compositional Zero-Shot Learning

Shiyu Zhang · Cheng Yan · Yang Liu · Chenchen Jing · Lei Zhou · Wenjun Wang

Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions by leveraging knowledge from seen compositions. Existing methods align textual prototypes with visual features through Vision-Language Models (VLMs), but they face two key limitations: (1) modality gaps hinder the discrimination of semantically similar composition pairs, and (2) single-modal textual prototypes lack fine-grained visual cues, creating bottlenecks in VLM-based CZSL. In this paper, we introduce Visual Proxy Learning, a novel approach that facilitates the learning of distinct visual distributions, effectively reducing the modality gap and improving compositional generalization performance. Specifically, we initialize visual proxies for various attributes, objects, and their compositions using text representations. By optimizing the visual space, we capture fine-grained visual cues and guide the learning of more discriminative visual representations for attributes, objects and compositions.Furthermore, we propose an effective Cross-Modal Joint Learning (CMJL) strategy that imposes cross-modal constraints between the original text-image space and the fine-grained visual space. This approach not only boosts generalization for previously unseen composition pairs but also sharpens the discrimination of similar pairs, fostering more robust and precise learning.Extensive experiments demonstrate state-of-the-art performance in closed-world scenarios and competitive open-world results across four established CZSL benchmarks, validating the effectiveness of our approach in advancing compositional generalization.

Pansharpening, a pivotal task in remote sensing for fusing high-resolution panchromatic and multispectral imagery, has garnered significant research interest. Recent advancements employing diffusion models based on stochastic differential equations (SDEs) have demonstrated state-of-the-art performance. However, the inherent multi-step sampling process of SDEs imposes substantial computational overhead, hindering practical deployment. While existing methods adopt efficient samplers, knowledge distillation, or retraining to reduce sampling steps (\textit{e.g.}, from 1,000 to fewer steps), such approaches often compromise fusion quality.In this work, we propose the Optimal Transport Flow Matching (OTFM) framework, which integrates the dual formulation of unbalanced optimal transport (UOT) to achieve one-step, high-quality pansharpening. Unlike conventional OT formulations that enforce rigid distribution alignment, UOT relaxes marginal constraints to enhance modeling flexibility, accommodating the intrinsic spectral and spatial disparities in remote sensing data. Furthermore, we incorporate task-specific regularization into the UOT objective, enhancing the robustness of the flow model.The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints. Experimental evaluations across multiple datasets demonstrate that OTFM matches or exceeds the performance of previous regression-based models and leading diffusion-based methods while only needing one sampling step.


#259
Evidential Knowledge Distillation

Liangyu Xiang · Junyu Gao · Changsheng Xu

Existing logit-based knowledge distillation methods typically employ singularly deterministic categorical distributions, which eliminates the inherent uncertainty in network predictions and thereby limiting the effective transfer of knowledge. To address this limitation, we introduce distribution-based probabilistic modeling as a more comprehensive representation of network knowledge. Specifically, we regard the categorical distribution as a random variable and leverage deep neural networks to predict its distribution, representing it as an evidential second-order distribution. Based on the second-oder modeling, we propose Evidential Knowledge Distillation (EKD) which distills both the expectation of the teacher distribution and the distribution itself into the student. The expectation captures the macroscopic characteristics of the distribution, while the distribution itself conveys microscopic information about the classification boundaries. Additionally, we theoretically demonstrate that EKD's distillation objective provides an upper bound on the expected risk of the student when the teacher’s predictions are treated as ground truth labels. Extensive experiments on several standard benchmarks across various teacher-student network pairs highlight the effectiveness and superior performance of EKD. Our code is available in the Supplementary Material.


#260
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP

Junsung Park · Jungbeom Lee · Jongyoon Song · Sangwon Yu · Dahuin Jung · Sungroh Yoon

While CLIP has significantly advanced multimodal understanding by bridging vision and language, the inability to grasp negation — such as failing to differentiate concepts like "parking" from "no parking" — poses substantial challenges.By analyzing the data used in the public CLIP model's pre-training, we posit this limitation stems from a lack of negation-inclusive data.To address this, we introduce data generation pipelines that employ a large language model (LLM) and a multimodal LLM to produce negation-inclusive captions.Fine-tuning CLIP with data generated from our pipelines, we develop NegationCLIP, which enhances negation awareness while preserving the generality.Moreover, to enable a comprehensive evaluation of negation understanding, we propose NegRefCOCOg—a benchmark tailored to test VLMs' ability to interpret negation across diverse expressions and positions within a sentence.Experiments on various CLIP architectures validate the effectiveness of our data generation pipelines in enhancing CLIP's ability to perceive negation accurately.Additionally, NegationCLIP's enhanced negation awareness has practical applications across various multimodal tasks, demonstrated by performance gains in text-to-image generation and referring image segmentation.


#261
Intervening in Black Box: Concept Bottleneck Model for Enhancing Human Neural Network Mutual Understanding

Nuoye Xiong · Anqi Dong · Ning Wang · Cong Hua · Guangming Zhu · Lin Mei · peiyi shen · zhang liang

Recent advances in deep learning have led to increasingly complex models with deeper layers and more parameters, reducing interpretability and making their decisions harder to understand. While many methods explain black-box reasoning, most lack effective interventions or only operate at sample-level without modifying the model itself. To address this, we propose the Concept Bottleneck Model for Enhancing Human-Neural Network Mutual Understanding (CBM-HNMU). CBM-HNMU leverages the Concept Bottleneck Model (CBM) as an interpretable framework to approximate black-box reasoning and communicate conceptual understanding. Detrimental concepts are automatically identified and refined (removed/replaced) based on global gradient contributions. The modified CBM then distills corrected knowledge back into the black-box model, enhancing both interpretability and accuracy. We evaluate CBM-HNMU on various CNN and transformer-based models across Flower-102, CIFAR-10, CIFAR-100, FGVC-Aircraft, and CUB-200, achieving a maximum accuracy improvement of 2.64\% and a maximum increase in average accuracy across 1.03\%. Source code is available at: http://anonymous.com.


#262
VALLR: Visual ASR Language Model for Lip Reading

Marshall Thomas · Edward Fish · Richard Bowden

Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging due to the absence of auditory information and the inherent ambiguity when visually distinguishing phonemes that have overlapping visemes—where different phonemes appear identical on the lips. Current methods typically attempt to predict words or characters directly from these visual cues, but this approach frequently encounters high error rates due to coarticulation effects and viseme ambiguity. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) that addresses these longstanding challenges. First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head, thereby reducing the task complexity and achieving robust speaker invariance. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences by leveraging broader linguistic context. Unlike existing methods that either predict words directly—often faltering on visually similar phonemes—or rely on large-scale multimodal pre-training, our approach explicitly encodes intermediate linguistic structure while remaining highly data efficient. We demonstrate state-of-the-art performance on two challenging datasets, LRS2 and LRS3, where our method achieves significant reductions in Word Error Rate (WER) achieving a SOTA WER of 18.7 on LRS3 despite using 99.4% less labelled data than the next best approach. Code will be released following the review process.

Dataset condensation aims to compress large dataset into smaller synthetic set while preserving the essential representations needed for effective model training. However, existing condensation methods show severe performance degradation when applied to noisy datasets. To address this, we present robust dataset condensation (RDC), an end-to-end method that mitigates noise to generate a clean and robust synthetic set, without requiring separate noise-reduction preprocessing steps. RDC refines the condensation process by integrating contrastive learning tailored for robust condensation, named golden MixUp contrast. It uses synthetic samples to sharpen class boundaries and to mitigate noisy representations, while its augmentation strategy compensates for the limited size of the synthetic set by identifying clean samples from noisy training data, enriching synthetic images with real-data diversity. We evaluate RDC against existing condensation methods and a conventional approach that first applies noise cleaning algorithms to the dataset before performing condensation. Extensive experiments show that RDC outperforms other approaches on CIFAR-10/100 across different types of noise, including asymmetric, symmetric, and real-world noise.


#264
CaliMatch: Adaptive Calibration for Improving Safe Semi-supervised Learning

Jinsoo Bae · Seoung Bum Kim · Hyungrok Do

Semi-supervised learning (SSL) uses unlabeled data to improve the performance of machine learning models when labeled data is scarce. However, its real-world applications often face the label distribution mismatch problem, in which the unlabeled dataset includes instances whose ground-truth labels are absent from the labeled training dataset. Recent studies referred to as safe SSL have addressed this issue by using both classification and out-of-distribution (OOD) detection. However, the existing methods may suffer from overconfidence in deep neural networks, leading to increased SSL errors because of high confidence in incorrect pseudo-labels or OOD detection. To address this, we propose a novel method, CaliMatch, which calibrates both the classifier and the OOD detector to foster safe SSL. CaliMatch presents adaptive label smoothing and temperature scaling, which eliminates the need to manually tune the smoothing degree for effective calibration. We give a theoretical justification for why improving the calibration of both the classifier and the OOD detector is crucial in safe SSL. Extensive evaluations on CIFAR-10, CIFAR-100, SVHN, TinyImageNet, and ImageNet demonstrate that CaliMatch outperforms the existing methods in safe SSL tasks.


#265
Hybrid-TTA: Continual Test-time Adaptation via Dynamic Domain Shift Detection

Hyewon Park · Hyejin Park · Jueun Ko · Dongbo Min

Continual Test Time Adaptation (CTTA) has emerged as a critical approach to bridge the domain gap between controlled training environments and real-world scenarios.Since it is important to balance the trade-off between adaptation and stabilization, many studies have tried to accomplish it by either introducing a regulation to fully trainable models or updating a limited portion of the models.This paper proposes Hybrid-TTA, a holistic approach that dynamically selects the instance-wise tuning method for optimal adaptation. Our approach introduces Dynamic Domain Shift Detection (DDSD), which identifies domain shifts by leveraging temporal correlations in input sequences, and dynamically switches between Full or Efficient Tuning for effective adaptation toward varying domain shifts. To maintain model stability, Masked Image Modeling Adaptation (MIMA) leverages auxiliary reconstruction task for enhanced generalization and robustness with minimal computational overhead.Hybrid-TTA achieves 0.6\%p gain on the Cityscapes-to-ACDC benchmark dataset for semantic segmentation, surpassing previous state-of-the-art methods. It also delivers about 20-fold increase in FPS compared to the recently proposed fastest methods, offering a robust solution for real-world continual adaptation challenges.

Multi-Task Learning (MTL) enables multiple tasks to be learned within a shared network, but differences in objectives across tasks can cause negative transfer, where the learning of one task degrades another task's performance. While pre-trained transformers significantly improve MTL performance, their fixed network capacity and rigid structure limit adaptability. Previous dynamic network architectures attempt to address this but are inefficient as they directly convert shared parameters into task-specific ones. We propose Dynamic Token Modulation and Expansion (DTME-MTL), a framework applicable to any transformer-based MTL architecture. DTME-MTL enhances adaptability and reduces overfitting by identifying gradient conflicts in token space and applying adaptive solutions based on conflict type. Unlike prior methods that mitigate negative transfer by duplicating network parameters, DTME-MTL operates entirely in token space, enabling efficient adaptation without excessive parameter growth. Extensive experiments demonstrate that DTME-MTL consistently improves multi-task performance with minimal computational overhead, offering a scalable and effective solution for enhancing transformer-based MTL models.


#267
DisCoPatch: Taming Adversarially-driven Batch Statistics for Improved Out-of-Distribution Detection

Francisco Caetano · Christiaan Viviers · Luis Zavala-Mondragón · Peter H.N. De With · Fons van der Sommen

Out-of-distribution (OOD) detection holds significant importance across many applications. While semantic and domain-shift OOD problems are well-studied, this work focuses on covariate shifts - subtle variations in the data distribution that can degrade machine learning performance. We hypothesize that detecting these subtle shifts can improve our understanding of in-distribution boundaries, ultimately improving OOD detection. In adversarial discriminators trained with Batch Normalization (BN), real and adversarial samples form distinct domains with unique batch statistics - a property we exploit for OOD detection. We introduce DisCoPatch, an unsupervised Adversarial Variational Autoencoder (VAE) framework that harnesses this mechanism. During inference, batches consist of patches from the same image, ensuring a consistent data distribution that allows the model to rely on batch statistics. DisCoPatch uses the VAE's suboptimal outputs (generated and reconstructed) as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this boundary, DisCoPatch achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 95.5% AUROC on ImageNet-1K(-C), but also outperforms all prior methods on public Near-OOD (95.0%) benchmarks. With a compact model size of $\leq$ 25MB, it achieves high OOD detection performance at notably lower latency than existing methods, making it an efficient and practical solution for real-world OOD detection applications. The code will be made publicly available.


#268
PLAN: Proactive Low-Rank Allocation for Continual Learning

XIEQUN WANG · Zhan Zhuang · Yu Zhang

Continual learning (CL) requires models to continuously adapt to new tasks without forgetting past knowledge. In this work, we propose Proactive Low-rank AllocatioN(PLAN), a framework that extends Low-Rank Adaptation (LoRA) to enable efficient and interference-aware fine-tuning of large pre-trained models in CL settings. PLAN proactively manages the allocation of task-specific subspaces by introducing orthogonal basis vectors for each task and optimizing them through a perturbation-based strategy that minimizes conflicts with previously learned parameters. Furthermore, PLAN incorporates a novel selection mechanism that identifies and assigns basis vectors with minimal sensitivity to interference, reducing the risk of degrading past knowledge while maintaining efficient adaptation to new tasks. Empirical results on standard CL benchmarks demonstrate that PLAN consistently outperforms existing methods, establishing a new state-of-the-art for continual learning with foundation models.


#269
Think Twice: Test-Time Reasoning for Robust CLIP Zero-Shot Classification

Shenyu Lu · Zhaoying Pan · Xiaoqian Wang

Contrastive Language-Image Pre-training (CLIP) models exhibit intriguing properties, particularly in their zero-shot classification capability. However, the reliability of CLIP zero-shot classification is severely undermined by spurious correlations. Existing efforts to enhance the robustness of zero-shot CLIP models often rely on prior knowledge or annotations of spurious correlations, limiting real-world applicability due to the unavailability of such information. Alternative methods attempt to detect distribution shift at test time but require training statistics whose access is often restricted or computationally expensive. To address the challenges brought by spurious correlation under zero-shot settings, we propose a novel test-time reasoning approach. Our method, inspired by human recognition, localizes the object and refines the classification accordingly. The inherent capacity of CLIP for semantic understanding allows us to isolate the object of interest without auxiliary models. Zero-shot classification is then performed exclusively on the localized objects, effectively mitigating the influence of spurious correlation. The proposed approach is interpretable and flexible as it requires no spurious annotations or prior knowledge, making it widely applicable. The substantial improvements across multiple benchmark datasets validated the effectiveness of our approach.


#270
Differential-informed Sample Selection Accelerates Multimodal Contrastive Learning

Zihua Zhao · Feng Hong · Mengxi Chen · Pengyi Chen · Benyuan Liu · Jiangchao Yao · Ya Zhang · Yanfeng Wang

The remarkable success of contrastive-learning-based multimodal models has been greatly driven by training on ever-large datasets with expensive compute consumption. Sample selection as an alternative efficient paradigm plays an important direction to accelerate the training process. However, recent advances on sample selection either mostly rely on an oracle model to offline select a high-quality coreset, which limits in the cold-start scenarios, or focus on online selection based on real-time model predictions, which has not sufficiently or efficiently considered the noisy correspondence. To address this dilemma, we propose a novel Differential-Informed Sample Selection (DISSect) method, which accurately and efficiently discriminates the noisy correspondence for training acceleration. Specifically, we rethink the impact of noisy correspondence on contrastive learning and propose that the differential between the predicted correlation of the current model and that of a historical model is more informative to characterize sample quality. Based on this, we construct a robust differential-based sample selection and analyze its theoretical insights. Extensive experiments on three benchmark datasets and various downstream tasks demonstrate the consistent superiority of DISSect over current state-of-the-art methods.


#271
Highlight
Dataset Distillation via Vision-Language Category Prototype

YAWEN ZOU · Guang Li · Duo Su · Zi Wang · Jun YU · Chao Zhang

Dataset distillation (DD) condenses large datasets into compact yet informative substitutes, preserving performance comparable to the original dataset while reducing storage, transmission costs, and computational consumption. However, previous DD methods mainly focus on distilling information from images, often overlooking the semantic information inherent in the data. The disregard for context hinders the model's generalization ability, particularly in tasks involving complex datasets, which may result in illogical outputs or the omission of critical objects. In this study, we integrate vision-language methods into DD by introducing text prototypes to distill language information and collaboratively synthesize data with image prototypes, thereby enhancing dataset distillation performance. Notably, the text prototypes utilized in this study are derived from descriptive text information generated by an open-source large language model. This framework demonstrates broad applicability across datasets without pre-existing text descriptions, expanding the potential of dataset distillation beyond traditional image-based approaches. Compared to other methods, the proposed approach generates logically coherent images containing target objects, achieving state-of-the-art validation performance and demonstrating robust generalization. Source code and generated data are available in https://anonymous.4open.science/r/10575/.


#272
Scaling and Taming Adversarial Training with Synthetic Data

Juntao Wu · Xianting Huang · Yu Chen · Shuai Pang · Ke Wang

Despite the success of adversarial training on small datasets, applying it to large-scale datasets like ImageNet remains challenging. Previous attempts using synthetic data show limited improvements. This work investigates the impact of synthetic data scaling, model scaling, and training strategies on adversarial training with ImageNet, providing deeper insights into large-scale robustness. During the process, we observe a notable phenomenon of loss oscillation, leading to adversarial overfitting, and propose strategies to mitigate it. Experimental results show that, under AutoAttack on ImageNet-1K, our method achieves a robust accuracy of 71.54\%. Our findings highlight the crucial role of synthetic data and model scaling in enhancing adversarial robustness on large-scale benchmarks and provide a new direction for training robust visual representations at scale.


#273
Highlight
A Constrained Optimization Approach for Gaussian Splatting from Coarsely-posed Images and Noisy Lidar Point Clouds

Jizong Peng · Tze Ho Elden Tse · Kai Xu · Wenchao Gao · Angela Yao

3D Gaussian Splatting (3DGS) is a powerful reconstruction technique, but it needs to be initialized from accurate camera poses and high-fidelity point clouds. Typically, the initialization taken from Structure-from-Motion (SfM) algorithms; however, SfM is time-consuming and restricts the application of 3DGS in real-world scenarios and large-scale scene reconstruction. We introduce a constrained optimization method for simultaneous camera pose estimation and 3D reconstruction that does not require SfM support. Core to our approach is decomposing a camera pose into a sequence of camera-to-(device-)center and (device-)center-to-world optimizations. To facilitate, we propose two optimization constraints conditioned to the sensitivity of each parameter group and restricts each parameter’s search space. In addition, as we learn the scene geometry directly from the noisy point clouds, we propose geometric constraints to improve the reconstruction quality. Experiments demonstrate that the proposed method significantly outperforms the existing (multi-modal) 3DGS baseline and methods supplemented by COLMAP on both our collected dataset and two public benchmarks.


#274
Soft Separation and Distillation: Toward Global Uniformity in Federated Unsupervised Learning

Hung-Chieh Fang · Hsuan-Tien Lin · Irwin King · Yifei Zhang

Federated Unsupervised Learning (FUL) aims to learn expressive representations in federated and self-supervised settings. The quality of representations learned in FUL is usually determined by uniformity, a measure of how uniformly representations are distributed in the embedding space. However, existing solutions perform well in achieving intra-client (local) uniformity for local models while failing to achieve inter-client (global) uniformity after aggregation due to non-IID data distributions and the decentralized nature of FUL. To address this issue, we propose Soft Separation and Distillation (SSD), a novel approach that preserves inter-client uniformity by encouraging client representations to spread toward different directions. This design reduces interference during client model aggregation, thereby improving global uniformity while preserving local representation expressiveness. We further enhance this effect by introducing a projector distillation module to address the discrepancy between loss optimization and representation quality. We evaluate SSD in both cross-silo and cross-device federated settings, demonstrating consistent improvements in representation quality and task performance across various training scenarios. Our results highlight the importance of inter-client uniformity in FUL and establish SSD as an effective solution to this challenge.


#275
Competitive Distillation: A Simple Learning Strategy for Improving Visual Classification

Daqian Shi · Xiaolei Diao · Xu Chen · Cedric John

Deep Neural Networks (DNNs) have significantly advanced the field of computer vision. To improve DNN training process, knowledge distillation methods demonstrate their effectiveness in accelerating network training by introducing a fixed learning direction from the teacher network to student networks. In this context, several distillation-based optimization strategies are proposed, e.g., deep mutual learning and self-distillation, as an attempt to achieve generic training performance enhancement through the cooperative training of multiple networks. However, such strategies achieve limited improvements due to the poor understanding of the impact of learning directions among networks across different iterations. In this paper, we propose a novel competitive distillation strategy that allows each network in a group to potentially act as a teacher based on its performance, enhancing the overall learning performance. Competitive distillation organizes a group of networks to perform a shared task and engage in competition, where competitive optimization is proposed to improve the parameter updating process. We further introduce stochastic perturbation in competitive distillation, aiming to motivate networks to induce mutations to achieve better visual representations and global optimum. The experimental results show that competitive distillation achieves promising performance in diverse tasks and datasets.

Fast Adversarial Training (FAT) employs the single-step Fast Gradient Sign Method (FGSM) to generate adversarial examples, reducing the computational costs of traditional adversarial training. However, FAT suffers from Catastrophic Overfitting (CO), where models' robust accuracy against multi-step attacks plummets to zero during training. Recent studies indicate that CO occurs because single-step adversarial perturbations contain label information that models exploit for prediction, leading to overfitting and diminished robustness against more complex attacks. In this paper, we discover that after CO occurs, the label information of certain samples can transfer across different samples, significantly increasing the likelihood of modified images being classified as the intended label. This discovery offers a new perspective on why various adversarial initialization strategies are effective. To address this issue, we introduce an innovative FAT strategy that leverages special samples to capture transferable label information and proactively removes potential label information during training, complemented by a non-uniform label smoothing technique to further eliminate label information. Experimental results across three datasets demonstrate that our method maintains competitive robustness against several attacks compared to other FAT approaches, with ablation studies confirming the effectiveness of our methodology.


#277
Fewer Denoising Steps or Cheaper Per-Step Inference: Towards Compute-Optimal Diffusion Model Deployment

Zhenbang Du · Yonggan Fu · Lifu Wang · Jiayi Qian · Xiao Luo · Yingyan Celine Lin

Diffusion models have shown remarkable success across generative tasks, yet their high computational demands challenge deployment on resource-limited platforms. This paper investigates a critical question for compute-optimal diffusion model deployment: Under a post-training setting without fine-tuning, is it more effective to reduce the number of denoising steps or to use a cheaper per-step inference? Intuitively, reducing the denoising steps increases the variability of the characteristics between the steps, making the model more sensitive to compression. In contrast, keeping more denoising steps makes the differences smaller, preserving redundancy, and making post-training compression more feasible. To systematically examine this, we propose PostDiff, a training-free framework for accelerating pre-trained diffusion models by reducing redundancy at both the input level and module level in a post-training manner. At the input level, we propose a mixed-resolution denoising scheme based on the insight that reducing generation resolution in early denoising steps can enhance low-frequency components and improve final generation fidelity. At the module level, we employ a hybrid module caching strategy to reuse computations across denoising steps. Extensive experiments and ablation studies demonstrate that (1) PostDiff can significantly improve the fidelity-efficiency trade-off of state-of-the-art diffusion models, and (2) to boost efficiency while maintaining decent generation fidelity, reducing per-step inference cost is often more effective than reducing the number of denoising steps. All codes and models will be released upon acceptance.


#278
Chimera: Improving Generalist Model with Domain-Specific Experts

Tianshuo Peng · Mingsheng Li · Jiakang Yuan · Hongbin Zhou · Renqiu Xia · Renrui Zhang · LEI BAI · Song Mao · Bin Wang · Aojun Zhou · Botian Shi · Tao Chen · Bo Zhang · Xiangyu Yue

Large Multi-modal Models (LMMs), trained on web-scale datasets predominantly composed of natural images, have demonstrated remarkable performance on general tasks. However, these models often exhibit limited specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. An intuitive solution is to post-train LMMs on a specific domain, but often suffers from the labor-intensive annotating process and the inaccessibility of private training data. Directly integrating expert models tailored for those tasks is also challenging due to representational gaps and imbalanced optimization. To address these challenges, we introduce \textbf{Chimera}, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs. We will release model weights, along with the data used for training and evaluation, to facilitate future research on LMMs.

Unlearning methods for vision-language models (VLMs) have primarily adapted techniques from large language models (LLMs), relying on weight updates that demand extensive annotated forget sets. Moreover, these methods perform unlearning at a coarse granularity, often leading to excessive forgetting and reduced model utility. To address this issue, we introduce SAUCE, a novel method that leverages sparse autoencoders (SAEs) for fine-grained and selective concept unlearning in VLMs. Briefly, SAUCE first trains SAEs to capture high-dimensional, semantically rich sparse features. It then identifies the features most relevant to the target concept for unlearning. During inference, it selectively modifies these features to suppress specific concepts while preserving unrelated information. We evaluate SAUCE on two distinct VLMs, LLaVA-v1.5-7B and LLaMA-3.2-11B-Vision-Instruct, across two types of tasks: concrete concept unlearning (objects and sports scenes) and abstract concept unlearning (emotions, colors, and materials), encompassing a total of 60 concepts. Extensive experiments demonstrate that SAUCE outperforms state-of-the-art methods by 18.04\% in unlearning quality while maintaining comparable model utility. Furthermore, we investigate SAUCE's robustness against widely used adversarial attacks, its transferability across models, and its scalability in handling multiple simultaneous unlearning requests. Our findings establish SAUCE as an effective and scalable solution for selective concept unlearning in VLMs.


#280
Highlight
Corvid: Improving Multimodal Large Language Models Towards Chain-of-Thought Reasoning

Jingjing Jiang · Chao Ma · Xurui Song · Hanwang Zhang · Jun Luo

Recent advancements in multimodal large language models (MLLMs) have demonstrated exceptional performance in multimodal perception and understanding. However, leading open-source MLLMs exhibit significant limitations in complex and structured reasoning, particularly in tasks requiring deep reasoning for decision-making and problem-solving. In this work, we present Corvid, an MLLM with enhanced chain-of-thought (CoT) reasoning capabilities. Architecturally, Corvid incorporates a hybrid vision encoder for informative visual representation and a meticulously designed connector (GateMixer) to facilitate cross-modal alignment. To enhance Corvid's CoT reasoning capabilities, we introduce MCoT-Instruct-287K, a high-quality multimodal CoT instruction-following dataset, refined and standardized from diverse public reasoning sources. Leveraging this dataset, we fine-tune Corvid with a two-stage CoT-formatted training approach to progressively enhance its step-by-step reasoning abilities. Furthermore, we propose an effective inference-time scaling strategy that enables Corvid to mitigate over-reasoning and under-reasoning through self-verification. Extensive experiments demonstrate that Corvid outperforms existing o1-like MLLMs and state-of-the-art MLLMs with similar parameter scales, with notable strengths in mathematical reasoning and science problem-solving.


#281
Any-SSR: How Recursive Least Squares Works in Continual Learning of Large Language Model

Kai Tong · Kang Pan · Xiao Zhang · Erli Meng · Run He · Yawen Cui · Nuoyan Guo · Huiping Zhuang

Large Language Models (LLMs) possess encompassing capabilities that can process diverse language-related tasks. However, finetuning on LLMs will diminish this general skills and continual finetuning will further cause severe degradation on accumulated knowledge. Recently, Continual Learning (CL) in Large Language Models (LLMs) arises which aims to continually adapt the LLMs to new tasks while maintaining previously learned knowledge and inheriting general skills. Existing techniques either leverage previous data to replay, leading to extra computational costs, or utilize a single parameter-efficient module to learn the downstream task, constraining new knowledge absorption with interference between different tasks. Toward these issues, this paper proposes Analytic Subspace Routing(ASR) to address these challenges. For each task, we isolate the learning within a subspace of deep layers' features via low-rank adaptation, eliminating knowledge interference between different tasks. Additionally, we propose an analytic routing mechanism to properly utilize knowledge learned in different subspaces. Our approach employs Recursive Least Squares to train a multi-task router model, allowing the router to dynamically adapt to incoming data without requiring access to historical data. Also, the router effectively assigns the current task to an appropriate subspace and has a non-forgetting property of previously learned tasks with a solid theoretical guarantee. Experimental results demonstrate that our method achieves near-perfect retention of prior knowledge while seamlessly integrating new information, effectively overcoming the core limitations of existing methods. Our code will be released after acceptance.


#282
Reminiscence Attack on Residuals: Exploiting Approximate Machine Unlearning for Privacy

Yaxin Xiao · Qingqing Ye · Li Hu · Huadi Zheng · Haibo Hu · Zi Liang · Haoyang LI · JIAOYIJIE JIAOYIJIE

Machine unlearning enables the removal of specific data from ML models to uphold the right to be forgotten. While approximate unlearning algorithms offer efficient alternatives to full retraining, this work reveals that they fail to adequately protect the privacy of unlearned data. In particular, these algorithms introduce implicit residuals which facilitate privacy attacks targeting at unlearned data. We observe that these residuals persist regardless of model architectures, parameters, and unlearning algorithms, exposing a new attack surface beyond conventional output-based leakage. Based on this insight, we propose the Reminiscence Attack (ReA), which amplifies the correlation between residuals and membership privacy through targeted fine-tuning processes. ReA achieves up to 1.90x and 1.12x higher accuracy than prior attacks when inferring class-wise and sample-wise membership, respectively. To mitigate such residual-induced privacy risk, we develop a dual-phase approximate unlearning framework that first eliminates deep-layer unlearned data traces and then enforces convergence stability to prevent models from "pseudo-convergence", where their outputs are similar to retrained models but still preserve unlearned residuals. Our framework works for both classification and generation tasks. Experimental evaluations confirm that our approach maintains high unlearning efficacy, while reducing the adaptive privacy attack accuracy to nearly random guess, at the computational cost of 2-12% of full retraining.


#283
STEP-DETR: Advancing DETR-based Semi-Supervised Object Detection with Super Teacher and Pseudo-Label Guided Text Queries

Tahira Shehzadi · Khurram Azeem Hashmi · Shalini Sarode · Didier Stricker · Muhammad Zeshan Afzal

This paper addresses key limitations in current Semi-Supervised Object Detection (SSOD) frameworks, focusing on issues related to pseudo-label quality, confidence bias, and inefficient query generation. Traditional methods, including CNN-based and DETR-based architectures, often face challenges such as noisy pseudo-labels, overfitting to common object categories, and consequently face difficulty detecting rare objects. Specifically, recent DETR-based SSOD approaches struggle with the one-to-many assignment strategy, which produces noisy pseudo-labels and overlapping predictions, resulting in suboptimal performance. To address these challenges, we propose STEP-DETR, a transformer-based SSOD framework. STEP-DETR introduces Super Teacher to generate higher-quality pseudo-labels and improve the student’s learning process. Furthermore, STEP-DETR proposes Pseudo-Label Text Queries, which incorporate text embeddings from Super Teacher, balancing the student’s confidence across common and rare categories, thereby mitigating confidence bias and enhancing generalization. Moreover, Denoising Text Guided Object Queries synthesizes query-label pairs for foreground and background using contrastive learning, enabling the model to better distinguish objects from background noise. To further boost performance and training efficiency, a Query Refinement Module is incorporated to filter out redundant denoising queries. On MS-COCO and Pascal VOC benchmarks, STEP-DETR outperforms state-of-the-art methods, demonstrating its effectiveness in improving semi-supervised object detection. Notably, with just 10% labeled data, it achieves 45.4 mAP, surpassing the baseline Semi-DETR by 1.9 mAP.

Existing fine-grained image retrieval (FGIR) methods predominantly rely on supervision from predefined categories to learn discriminative representations for retrieving fine-grained objects. However, they inadvertently introduce category-specific semantics into the retrieval representation, creating semantic dependencies on predefined classes that critically hinder generalization to unseen categories. To tackle this, we propose AdvRF, a novel adversarial reconstruction feedback framework aimed at learning category-agnostic discrepancy representations. Specifically, AdvRF reformulates FGIR as a visual discrepancy reconstruction task via synergizing category-aware discrepancy localization from retrieval models with category-agnostic feature learning from reconstruction models. The reconstruction model exposes residual discrepancies overlooked by the retrieval model, forcing it to improve localization accuracy, while the refined signals from the retrieval model guide the reconstruction model to improve its reconstruction ability. Consequently, the retrieval model localizes visual differences, while the reconstruction model encodes these differences into category-agnostic representations. This representation is then transferred to the retrieval model through knowledge distillation for efficient deployment. Quantitative and qualitative evaluations demonstrate that our AdvRF achieves impressive performance on both widely-used fine-grained and coarse-grained datasets.


#285
Not Only Vision: Evolve Visual Speech Recognition via Peripheral Information

Zhaoxin Yuan · Shuang Yang · Shiguang Shan · Xilin Chen

Visual Speech Recognition (VSR) aims to infer spoken content by analyzing the speaker’s facial dynamics. While this technology has shown promise, a question naturally arises: Is it sufficient to rely solely on such visual information in complex real-world scenarios?Humans, on the other hand, excel at lip-reading by leveraging information beyond lip movements, such as speech-related background and prior knowledge about the task. Despite this well-recognized human capability, existing approaches have not explored incorporating such \textbf{Peripheral Information} into automatic frameworks.We categorize peripheral information into a hierarchical structure based on its relevance to the spoken content: (1) Content Anchors (e.g., speech topic or description), (2) Task Expertise (task-related background, e.g., human prior lip-reading experiences), and (3) Linguistic Perturbation (irrelevant information that VSR systems should process alongside meaningful signals).To unlock the valuable clues embedded in peripheral information, we propose a novel multi-modal framework that utilizes a large language model (LLM) to decode spoken content while seamlessly integrating peripheral information.Center to our framework is a new adaptation method, Synergy LoRA, which enables a coordinated adaptation of visual and textual inputs.Visual features are processed with a independent module while guided by semantic cue from peripheral information by a MoE textual adaptation module. It preserves the fine-grained spatiotemporal details of the visual modality and incorporates peripheral information to enhance recognition.On the widely-used LRS3 dataset, with readily available peripheral information, our model achieves a Word Error Rate (WER) of 22.0\%, surpassing recent approaches.Further experiments on the challenging AVSpeech dataset also show promising results in handling complex real-world scenarios.


#286
KOEnsAttack: Towards Efficient Data-Free Black-Box Adversarial Attacks via Knowledge-Orthogonalized Substitute Ensembles

Chaoyong Yang · Jia-Li Yin · Bin Chen · Zhaozhe Hu · Xiaolei Liu · Wei Lin

Data-free black-box attacks aim to attack a model without access to either the model parameters or training data. Existing methods use a generator to synthesize training samples and then train a substitute model to imitate the victim model. The adversarial examples (AEs) are finally generated using the substitute model to transfer to the victim model. To this end, how to generate diverse training samples for substitute model training and improve the transferability of AEs from the substitute model to victim model become the core challenges. In this paper, we propose a Knowledge-Orthogonalized Ensemble Attack, dubbed KOEnsAttack, to accomplish these two goals. We first use dual networks as the ensemble substitute model, and then propose a sample hardness enhancement to transform the samples from the generator into hard samples that exist in the controversial regions of the dual models for promoting the sample diversity. Next, during the substitute model training, we design a knowledge orthogonalization module to guide the dual networks in learning complementary and useful information from the black-box, thereby enhancing the transferability of adversarial samples generated on the final ensemble model. Extensive experiments on several datasets are conducted to evaluate the effectiveness of our method. The results show that the proposed method can achieve superior performance compared with the state-of-the-art competitors.


#287
FedPall: Prototype-based Adversarial and Collaborative Learning for Federated Learning with Feature Drift

yong zhang · Feng Liang · Guanghu Yuan · Min Yang · Chengming Li · Xiping Hu

Federated learning (FL) enables collaborative training of a global model in the centralized server with data from multiple parties while preserving privacy. However, data heterogeneity can significantly degrade the performance of the global model when each party uses datasets from different sources to train a local model. Among various cases of data heterogeneity, feature drift, feature space difference among parties, is prevalent in real-life data but remains largely unexplored. Feature drift can distract feature extraction learning in clients and thus lead to poor feature extraction and classification performance. To tackle the problem of feature drift in FL, we propose FedPall, an FL framework that utilizes prototype-based adversarial learning to unify feature spaces and collaborative learning to reinforce class information within the features. Moreover, FedPall leverages mixed features generated from global prototypes and local features to enhance the global classifier with classification-relevant information from a global perspective. Evaluation results on three representative feature-drifted datasets demonstrate FedPall's consistently superior performance in classification with feature-drifted data in the FL scenario.


#288
DuET: Dual Incremental Object Detection via Exemplar-Free Task Arithmetic

Munish Monga · Vishal Chudasama · Pankaj Wasnik · Biplab Banerjee

Real-world object detection systems, such as those in autonomous driving and surveillance, must continuously learn new object categories and simultaneously adapt to changing environmental conditions. Existing approaches, Class Incremental Object Detection (CIOD) and Domain Incremental Object Detection (DIOD)—only address one aspect of this challenge. CIOD struggles in unseen domains, while DIOD suffers from catastrophic forgetting when learning new classes, limiting their real-world applicability. To overcome these limitations, we introduce Dual Incremental Object Detection (DuIOD), a more practical setting that simultaneously handles class and domain shifts in an exemplar-free manner. We propose DuET, a Task Arithmetic-based model merging framework that enables stable incremental learning while mitigating sign conflicts through a novel Directional Consistency Loss. Unlike prior methods, DuET is detector-agnostic, allowing models like YOLO11 and RT-DETR to function as real-time incremental object detectors. To comprehensively evaluate both retention and adaptation, we introduce the Retention-Adaptability Index (RAI), which combines the Average Retention Index (Avg RI) for catastrophic forgetting and the Average Generalization Index for domain adaptability into a common ground. Extensive experiments on the Pascal Series and Diverse Weather Series demonstrate DuET’s effectiveness, achieving a +13.12\% RAI improvement while preserving 89.3\% Avg RI on the Pascal Series (4 tasks), as well as a +11.39\% RAI improvement with 88.57\% Avg RI on the Diverse Weather Series (3 tasks), outperforming existing methods.


#289
Dataset Ownership Verification for Pre-trained Masked Models

Yuechen Xie · Jie Song · Yicheng Shan · Xiaoyan Zhang · Yuanyu Wan · Shengxuming Zhang · Jiarui Duan · Mingli Song

High-quality open-source datasets have emerged as a pivotal catalyst driving the swift advancement of deep learning, while facing the looming threat of potential exploitation. Protecting these datasets is of paramount importance for the interests of their owners. The verification of dataset ownership has evolved into a crucial approach in this domain; however, existing verification techniques are predominantly tailored to supervised models and contrastive pre-trained models, rendering them ill-suited for direct application to the increasingly prevalent masked models. In this work, we introduce the inaugural methodology addressing this critical, yet unresolved challenge, termed Dataset Ownership Verification for Masked Modeling (DOV4MM). The central objective is to ascertain whether a suspicious black-box model has been pre-trained on a particular unlabeled dataset, thereby assisting dataset owners in safeguarding their rights. DOV4MM is grounded in our empirical observation that when a model is pre-trained on the target dataset, the difficulty of reconstructing masked information within the embedding space exhibits a marked contrast to models not pre-trained on that dataset. We validated the efficacy of DOV4MM through ten masked image models on ImageNet-1K and four masked language models on WikiText-103. The results demonstrate that DOV4MM rejects the null hypothesis, with a $p$-value considerably below 0.05, surpassing all prior approaches.


#290
Sparsity Outperforms Low-Rank Projections in Few-Shot Adaptation

Nairouz Mrabah · Nicolas Richet · Ismail Ayed · Eric Granger

Adapting Vision-Language Models (VLMs) to new domains with few labeled samples remains a significant challenge due to severe overfitting and computational constraints. State-of-the-art solutions, such as low-rank reparameterization, mitigate these issues but often struggle with generalization and require extensive hyperparameter tuning. In this paper, a novel Sparse Optimization (SO) framework is proposed. Unlike low-rank approaches that typically constrain updates to a fixed subspace, our SO method leverages high sparsity to dynamically adjust very few parameters. We introduce two key paradigms. First, we advocate for \textit{local sparsity and global density}, which updates a minimal subset of parameters per iteration while maintaining overall model expressiveness. As a second paradigm, we advocate for \textit{local randomness and global importance}, which sparsifies the gradient using random selection while pruning the first moment based on importance. This combination significantly mitigates overfitting and ensures stable adaptation in low-data regimes. Extensive experiments on 11 diverse datasets show that SO achieves state-of-the-art few-shot adaptation performance while reducing memory overhead.

Few-Shot Class-Incremental Learning (FSCIL) focuses on incrementally learning novel classes using only a limited number of samples from novel classes, which faces dual challenges: catastrophic forgetting of previously learned classes and over-fitting to novel classes with few available samples. Recent advances in large pre-trained vision-language models (VLMs), such as CLIP, provide rich feature representations that generalize well across diverse classes. Therefore, freezing the pre-trained backbone and aggregating class features as prototypes becomes an intuitive and effective way to mitigate catastrophic forgetting.However, this strategy fails to address the overfitting challenge, and the prototypes of novel classes exhibit semantic bias due to the few samples per class. To address these limitations, we propose a semantic $\textbf{Feature Decomposition-Recomposition (FDR)} $ method based on VLMs. Firstly, we decompose the CLIP features into semantically distinct segments guided by text keywords from base classes. Then, these segments are adaptively recomposed at the attribute level given text descriptions, forming calibrated prototypes for novel classes. The recomposition process operates linearly at the attribute level but induces nonlinear adjustments across the entire prototype. This fine-grained and non-linear recomposition inherits the generalization capabilities of VLMs and the adaptive recomposition ability of base classes, leading to enhanced performance in FSCIL. Extensive experiments demonstrate our method's effectiveness, particularly in 1-shot scenarios where it achieves improvements between 6.70\%~19.66\% for novel classes over state-of-the-art baselines on CUB200. Code will be made publicly available.


#292
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models

JIACHENG RUAN · Wenzhen Yuan · Xian Gao · Ye Guo · Daoxin Zhang · Zhe Xu · Yao Hu · Ting Liu · yuzhuo fu

Although large visual-language models (LVLMs) have demonstrated strong performance in multimodal tasks, errors may occasionally arise due to biases during the reasoning process. Recently, reward models (RMs) have become increasingly pivotal in the reasoning process. Specifically, process RMs evaluate each reasoning step, outcome RMs focus on the assessment of reasoning results, and critique RMs perform error analysis on the entire reasoning process, followed by corrections. However, existing benchmarks for vision-language RMs (VLRMs) typically assess only a single aspect of their capabilities (e.g., distinguishing between two answers), thus limiting the all-round evaluation and restricting the development of RMs in the visual-language domain. To address this gap, we propose a comprehensive and challenging benchmark, dubbed as VLRMBench, encompassing 12,634 questions. VLRMBench is constructed based on three distinct types of datasets, covering mathematical reasoning, hallucination understanding, and multi-image understanding. We design 12 tasks across three major categories, focusing on evaluating VLRMs in the aspects of process understanding, outcome judgment, and critique generation. Extensive experiments are conducted on 21 open-source models and 5 advanced closed-source models, highlighting the challenges posed by VLRMBench. For instance, in the `Forecasting Future', a binary classification task, the advanced GPT-4o achieves only a 76.0\% accuracy. Additionally, we perform comprehensive analytical studies, offering valuable insights for the future development of VLRMs. We anticipate that VLRMBench will serve as a pivotal benchmark in advancing VLRMs.


#293
ConstStyle: Robust Domain Generalization with Unified Style Transformation

Nam Duong Tran · Nam Nguyen Phuong · Hieu Pham · Phi Le Nguyen · My Thai

Deep neural networks often suffer performance drops when test data distribution differs from training data. Domain Generalization (DG) aims to address this by focusing on domain-invariant features or augmenting data for greater diversity. However, these methods often struggle with limited training domains or significant gaps between seen (training) and unseen (test) domains. To enhance DG robustness, we hypothesize that it is essential for the model to be trained on data from domains that closely resemble unseen test domains—an inherently difficult task due to the absence of prior knowledge about the unseen domains. Accordingly, we propose ConstStyle, a novel approach that leverages a unified domain to capture domain-invariant features and bridge the domain gap with theoretical analysis. During training, all samples are mapped onto this unified domain, optimized for seen domains. During testing, unseen domain samples are projected similarly before predictions. By aligning both training and testing data within this unified domain, ConstStyle effectively reduces the impact of domain shifts, even with large domain gaps or few seen domains. Extensive experiments demonstrate that ConstStyle consistently outperforms existing methods across diverse scenarios. Notably, when only a limited number of seen domains are available, ConstStyle can boost accuracy up to 19.82\% compared to the next best approach.


#294
Adversarial Attention Perturbations for Large Object Detection Transformers

Zachary Yahn · Selim Tekin · Fatih Ilhan · Sihao Hu · Tiansheng Huang · Yichang Xu · Margaret Loper · Ling Liu

Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking regression-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional regression-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG's attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and regression-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at: https://anonymous.4open.science/r/AFOG-5EC3/README.md.


#295
DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion

Hossein Mirzaei · Zeinab Taghavi · Sepehr Rezaee · Masoud Hadi · Moein Madadi · Mackenzie Mathis

Deep neural networks have demonstrated remarkable success across numerous tasks, yet they remain vulnerable to trojan (backdoor) attacks, raising serious concerns about their safety in real-world mission-critical applications. A common countermeasure is trigger inversion -- reconstructing malicious "shortcut" patterns (triggers) inserted by an adversary during training. Current trigger-inversion methods typically search the full pixel space under specific assumptions but offer no assurances that the estimated trigger is more than an adversarial perturbation that flips the model output. Here, we propose a data-free, zero-shot trigger-inversion strategy that restricts the search space while avoiding strong assumptions on trigger appearance. Specifically, we incorporate a diffusion-based generator guided by the target classifier; through iterative generation, we produce candidate triggers that align with the internal representations the model relies on for malicious behavior. Empirical evaluations, both quantitative and qualitative, show that our approach reconstructs triggers that effectively distinguish clean versus tojaned models. DISTIL surpasses alternative methods by high margins, achieving up to 7.1% higher accuracy on the BackdoorBench dataset and a 9.4% improvement on trojaned object detection model scanning, offering a promising new direction for reliable backdoor defense without reliance on extensive data or strong prior assumptions about triggers.

Articulated objects are an important type of interactable objects in everyday environments. In this paper, we propose a novel diffusion model-based approach for generating articulated objects that aligns them with partial point clouds and improves their physical plausibility. The model represents part shapes by signed distance functions (SDFs). We guide the reverse diffusion process using a point cloud alignment loss computed using the predicted SDFs. Additionally, we impose non-penetration and mobility constraints based on the part SDFs for guiding the model to generate more physically plausible objects. We also make our diffusion approach category-aware to further improve point cloud alignment if category information is available. We evaluate the generative ability and constraint consistency of samples generated with our approach using the PartNet-Mobility dataset. We also compare our approach with an unguided baseline diffusion model and demonstrate that our method can improve constraint consistency and provides a tradeoff with generative ability.


#297
MissRAG: Addressing the Missing Modality Challenge in Multimodal Large Language Models

Vittorio Pipoli · Alessia Saporita · Federico Bolelli · Marcella Cornia · Lorenzo Baraldi · Costantino Grana · Rita Cucchiara · Elisa Ficarra

Recently, Multimodal Large Language Models (MLLMs) have emerged as a leading framework for enhancing the ability of Large Language Models (LLMs) to interpret non-linguistic modalities. Despite their impressive capabilities, the robustness of MLLMs under conditions where one or more modalities are missing remains largely unexplored. In this paper, we investigate the extent to which MLLMs can maintain performance when faced with missing modality inputs. Moreover, we propose a novel framework to mitigate the aforementioned issue called Retrieval-Augmented Generation for missing modalities (MissRAG). It consists of a novel multimodal RAG technique alongside a tailored prompt engineering strategy designed to enhance model robustness by mitigating the impact of absent modalities while preventing the burden of additional instruction tuning. To demonstrate the effectiveness of our techniques, we conducted comprehensive evaluations across five diverse datasets, covering tasks such as audio-visual question answering, audio-visual captioning, and multimodal sentiment analysis. Our source code is available at https://anonymous.4open.science/r/MM_MLLM-1536


#298
ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models

Zifu Wan · Ce Zhang · Silong Yong · Martin Ma · Simon Stepputtis · Louis-Philippe Morency · Deva Ramanan · Katia Sycara · Yaqi Xie

Recent Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses. Although they have achieved remarkable performance across a range of multi-modal tasks, they face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications. Existing work has explored contrastive decoding approaches to mitigate this issue, where the output of the original LVLM is compared and contrasted with that of a perturbed version. However, these methods require two or more queries that slow down LVLM response generation, making them less suitable for real-time applications. To overcome this limitation, we propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment. Specifically, we enhance textual outputs by selectively amplifying crucial textual information using a text-to-visual entropy ratio for each token. Extensive experimental results demonstrate that our ONLY approach consistently outperforms state-of-the-art methods across various benchmarks while requiring minimal implementation effort and computational cost.

Out-of-distribution (OOD) detection aims to distinguish whether detected objects belong to known categories or not. Existing methods extract OOD samples from In-distribution (ID) data to regularize the model’s decision boundaries. However, the decision boundaries are not adequately regularized due to the model's lack of knowledge about the distribution of OOD data. To address the above issue, we propose an Adaptive Prompt Learning framework via Gaussian Outlier Synthesis (APLGOS) for OOD detection. Specifically, we leverage the Vision-Language Model (VLM) to initialize learnable ID prompts by sampling standardized results from pre-defined Q\&A pairs. Region-level prompts are synthesised in low-likelihood regions of class-conditional gaussian distributions. These prompts are then utilized to initialize learnable OOD prompts and optimized with adaptive prompt learning. Also, OOD pseudo-samples are synthesised via gaussian outlier synthesis. Similarity score between prompts and images is utilized to calculate contrastive learning loss in high-dimensional hidden space. The aforementioned methodology regularizes the model to learn more compact decision boundaries for ID and OOD categories. Extensive experiments show that our proposed method achieves state-of-the-art performance with less ID data on four mainstream datasets.


#300
FinMMR: Make Financial Numerical Reasoning More Multimodal, Comprehensive, and Challenging

Zichen Tang · Haihong E · Jiacheng Liu · Zhongjun Yang · Rongjin Li · Zihua Rong · Haoyang He · Zhuodi Hao · Xinyang Hu · Kun Ji · Ziyan Ma · Mengyuan Ji · Jun Zhang · Chenghao Ma · Qianhe Zheng · Yang Liu · Yiling Huang · Xinyi Hu · Qing Huang · Zijian Xie · Shiyao Peng

We present FinMMR, a novel bilingual multimodal benchmark tailored to evaluate the reasoning capabilities of multimodal large language models (MLLMs) in financial numerical reasoning tasks. Compared to existing benchmarks, our work introduces three significant advancements. (1) Multimodality: We meticulously transform existing financial reasoning datasets, and construct novel questions from the latest Chinese financial research reports. The dataset comprises 4.3K questions and 8.7K images spanning 14 categories, including tables, bar charts, and ownership structure charts. (2) Comprehensiveness: FinMMR encompasses 14 financial subdomains, including corporate finance, banking, and industry analysis, significantly exceeding existing benchmarks in financial domain knowledge breadth. (3) Challenge: Models are required to perform multi-step precise numerical reasoning by integrating financial knowledge with the understanding of complex financial images and text. The best-performing MLLM achieves only 51.4\% accuracy on Hard problems. We believe that FinMMR will drive advancements in enhancing the reasoning capabilities of MLLMs in real-world scenarios.


#301
Trust but Verify: Programmatic VLM Evaluation in the Wild

Viraj Prabhu · Senthil Purushwalkam · An Yan · Caiming Xiong · Ran Xu

Vision-Language Models (VLMs) frequently hallucinate responses to visual queries, undermining their reliability for critical applications. However, quantifying the effect of such hallucinations in free-form responses to open-ended queries requires visually verifying each claim within the response, which is highly challenging. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model with a high-fidelity scene-graph representation constructed from a detailed image caption, and prompt it to generate i) diverse and challenging question-answer (QA) pairs that test a range of image understanding capabilities, and ii) programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.6k challenging but grounded visual QA pairs. Next, we propose a scene graph-based evaluation framework to programmatically measure both the helpfulness and truthfulness of a free-form model response without relying on subjective LLM judgments. We extensively benchmark a range of VLMs on PROVE, and uncover a concerning tradeoff where models that provide more helpful responses often hallucinate more, whereas truthful models tend to be less informative. PROVE serves as a foundation for developing next-generation VLMs that balance helpfulness with truthfulness. A snapshot of our dataset is available at \url{https://prove-explorer-anon.netlify.app/}.

Talking head generation is gaining significant importance across various domains, with a growing demand for high-quality rendering. However, existing methods often suffer from identity leakage (IL) and rendering artifacts (RA), particularly in extreme cases. Through an in-depth analysis of previous approaches, we identify two key insights: (1) IL arises from identity information embedded within motion features, and (2) this identity information can be leveraged to address RA. Building on these findings, this paper introduces FixTalk, a novel framework designed to simultaneously resolve both issues for high-quality talking head generation. Firstly, we propose an Enhanced Motion Indicator (EMI) to effectively decouple identity information from motion features, mitigating the impact of IL on generated talking heads. To address RA, we introduce an Enhanced Detail Indicator (EDI), which utilizes the leaked identity information to supplement missing details, thus fixing the artifacts. Extensive experiments demonstrate that FixTalk effectively mitigates IL and RA, achieving superior performance compared to state-of-the-art methods.


#303
Uncalibrated Structure from Motion on a Sphere

Jonathan Ventura · Viktor Larsson · Fredrik Kahl

Spherical motion is a special case of camera motion where the camera moves on the imaginary surface of a sphere with the optical axis normal to the surface. Common sources of spherical motion are a person capturing a stereo panorama with a phone held in an outstretched hand, or a hemi-spherical camera rig used for multi-view scene capture. However, traditional structure-from-motion pipelines tend to fail on spherical camera motion sequences, especially when the camera is facing outward. Building upon prior work addressing the calibrated case, we explore uncalibrated reconstruction from spherical motion, assuming a fixed but unknown focal length parameter. We show that, although two-view spherical motion is always a critical case, self-calibration is possible from three or more views. Through analysis of the relationship between focal length and spherical relative pose, we devise a global structure-from-motion approach for uncalibrated reconstruction. We demonstrate the effectiveness of our approach on real-world captures in various settings, even when the camera motion deviates from perfect spherical motion.


#304
FreeDNA: Endowing Domain Adaptation of Diffusion-Based Dense Prediction with Training-Free Domain Noise Alignment

Hang Xu · Jie Huang · Linjiang Huang · Dong Li · Yidi Liu · Feng Zhao

Domain Adaptation(DA) for dense prediction tasks is an important topic, which enhances the dense prediction model's performance when tested on its unseen domain. Recently, with the development of Diffusion-based Dense Prediction (DDP) models, the exploration of DA designs tailored to this framework is worth exploring, since the diffusion model is effective in modeling the distribution transformation that comprises domain information. In this work, we propose a training-free mechanism for DDP frameworks, endowing them with DA capabilities. Our motivation arises from the observation that the exposure bias (e.g., noise statistics bias) in diffusion brings domain shift, and different domains in conditions of DDP models can also be effectively captured by the noise prediction statistics. Based on this, we propose a training-free Domain Noise Alignment (DNA) approach, which alleviates the variations of noise statistics to domain changes during the diffusion sampling process, thereby achieving domain adaptation. Specifically, when the source domain is available, we directly adopt the DNA method to achieve domain adaptation by aligning the noise statistics of the target domain with those of the source domain. For the more challenging source-free DA, inspired by the observation that regions closer to the source domain exhibit higher confidence meeting variations of sampling noise, we utilize the statistics from the high-confidence regions progressively to guide the noise statistic adjustment during the sampling process. Notably, our method demonstrates the effectiveness of enhancing the DA capability of DDP models across four common dense prediction tasks.

Large-scale pre-trained Vision-Language Models (VLMs) like CLIP have demonstrated promising zero-shot transfer capabilities to downstream tasks. However, their performance deteriorates when facing significant domain shifts. In this paper, we focus on cost-effective adaptation of large-scale pre-trained VLMs to unlabeled target domains. In this context, two prevalent paradigms show inherent limitations: Unsupervised Fine-Tuning (UFT) struggles with poor initial model performance, while Unsupervised Domain Adaptation (UDA) may suffer from adverse effects of inappropriate auxiliary source domain. To alleviate these limitations, we propose to adaptively construct more suitable auxiliary data from large-scale image-text pairs to facilitate unsupervised adaptation without any human annotations. Specifically, we introduce Progressive Distribution Bridging (PDB), which decomposes the challenging adaptation task into multiple simple steps through the construction of auxiliary data. To obtain such data, we design an efficient and controllable retrieval algorithm incorporating cascaded semantic filters and style controller to regulate the semantic category and domain style of retrieved data, respectively. Experimental results across 11 different domains from three standard UDA benchmarks demonstrate the effectiveness of our auxiliary data. Notably, on Office-Home, our method outperforms state-of-the-art UDA methods that rely on labeled source domains. The proposed method offers a more universal and cost-effective solution for adapting VLMs to unlabeled downstream tasks.


#306
AllGCD: Leveraging All Unlabeled Data for Generalized Category Discovery

Xinzi Cao · Ke Chen · Feidiao Yang · Xiawu Zheng · Yutong Lu · Yonghong Tian

Generalized Category Discovery (GCD) aims to identify both known and novel categories in unlabeled data by leveraging knowledge from labeled datasets. Current methods employ contrastive learning on labeled data to capture known category structures but neglect unlabeled data, limiting their effectiveness in classifying novel classes, especially in fine-grained open-set detection where subtle class differences are crucial. To address this issue, we propose a novel learning approach, AllGCD, which seamlessly integrates \textbf{all} unlabeled data into contrastive learning to enhance the discrimination of novel classes. Specifically, we introduce two key techniques: Intra-class Contrast in Labeled Data (Intra-CL) and Inter-class Contrast in Unlabeled Data (Inter-CU). Intra-CL first refines intra-class compactness within known categories by integrating potential known samples into labeled data. This process refines the decision boundaries of known categories, reducing ambiguity when distinguishing novel categories. Building on this, Inter-CU further strengthens inter-class separation between known and novel categories by applying global contrastive learning to the class distribution in the unlabeled data. By jointly leveraging Intra-CL and Inter-CU, AllGCD effectively improves both intra-class compactness and inter-class separation, effectively enhancing the discriminability between known and novel classes. Experiments demonstrate that AllGCD significantly improves novel classes accuracy, \eg, achieving increases of 7.4% on CUB and 7.5% on Stanford Cars. Our code is available at:https://anonymous.4open.science/r/AllGCD-1D41.


#307
Highlight
RALoc: Enhancing Outdoor LiDAR Localization via Rotation Awareness

Yuyang Yang · Wen Li · Sheng Ao · Qingshan Xu · Shangshu Yu · guo yu · Yin Zhou · Siqi Shen · Cheng Wang

LiDAR localization is a fundamental task in autonomous driving and robotics. Scene Coordinate Regression (SCR) exhibits leading pose accuracy, achieving impressive results in learning-based localization. We observe that the real-world LiDAR scans captured from different viewpoints usually result in the catastrophic collapse of SCR. However, existing LiDAR localization methods have largely overlooked the issue of rotation sensitivity in SCR. In this paper, we present RALoc, an outdoor LiDAR localization method with rotation awareness to achieve accurate localization. The key to our approach is to design a Point Cloud Canonicalization module, which leverages a powerful equivariant key feature aggregation to transform the input LiDAR scan towards a consistent orientation, effectively eliminating the adverse effects of rotation. This proposed module has promising scalability and can be seamlessly integrated with the existing LiDAR localization network. Moreover, we propose the $\textbf{Bi}$directional $\textbf{Li}$DAR $\textbf{Lo}$calization (BiLiLo) dataset as a benchmark to evaluate the performance of various methods in large outdoor scenes with significant rotation changes. Extensive experiments show that RALoc significantly improves localization performance in scenarios with large rotation changes, and also achieves competitive performance in the Oxford Radar RobotCar dataset. Our code and dataset will be released upon acceptance.


#308
External Knowledge Injection for CLIP-Based Class-Incremental Learning

Da-Wei Zhou · Kai-Wen Li · Jingyi Ning · Han-Jia Ye · Lijun Zhang · De-Chuan Zhan

Class-Incremental Learning (CIL) enables learning systems to continuously adapt to evolving data streams. With the advancement of pre-training, leveraging pre-trained vision-language models (e.g., CLIP) offers a promising starting point for CIL. However, CLIP makes decisions by matching visual embeddings to class names, overlooking the rich contextual information conveyed through language. For instance, the concept of ``cat'' can be decomposed into features like tail, fur, and face for recognition.Besides, since the model is continually updated, these detailed features are overwritten in CIL, requiring external knowledge for compensation.In this paper, we introduce ExterNal knowledGe INjEction (ENGINE) for CLIP-based CIL. To enhance knowledge transfer from outside the dataset, we propose a dual-branch injection tuning framework that encodes informative knowledge from both visual and textual modalities. The visual branch is enhanced with data augmentation to enrich the visual features, while the textual branch leverages GPT-4 to rewrite discriminative descriptors. In addition to this on-the-fly knowledge injection, we also implement post-tuning knowledge by re-ranking the prediction results during inference. With the injected knowledge, the model can better capture informative features for downstream tasks as data evolves. Extensive experiments demonstrate the state-of-the-art performance of ENGINE.


#309
Cooperative Pseudo Labeling for Unsupervised Federated Classification

Kuangpu Guo · Lijun Sheng · Yongcan Yu · Jian Liang · Zilei Wang · Ran He

Unsupervised federated learning (UFL) aims to collaboratively train a global model across distributed clients without data sharing and label information.Previous UFL works have predominantly focused on representation learning and clustering tasks.Recently, vision language models (e.g., CLIP) have gained significant attention for their attractive zero-shot prediction capabilities.Leveraging this advancement, classification problems that were previously infeasible under the UFL paradigm now present new opportunities but remain largely unexplored.In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, Federated Cooperative Pseudo Labeling (FedCoPL). Specifically, clients estimate and upload their pseudo label distribution, and the server adjusts and redistributes them to avoid global imbalance among categories.Moreover, we introduce a partial prompt aggregation protocol for effective collaboration and personalization.In particular, visual prompts containing general image features are aggregated at the server, while text prompts encoding personalized knowledge are retained locally.Extensive experiments on six datasets demonstrate the superior performance of our FedCoPL compared to baseline methods.Our code is available in the supplementary materials.

Low-Rank Adaptation (LoRA) has proven effective in reducing computational costs while maintaining performance comparable to fully fine-tuned foundation models across various tasks. However, its fixed low-rank structure restricts its adaptability in scenarios with substantial domain gaps, where higher ranks are often required to capture domain-specific complexities. Current adaptive LoRA methods attempt to overcome this limitation by dynamically expanding or selectively allocating ranks, but these approaches frequently depend on computationally intensive techniques such as iterative pruning, rank searches, or additional regularization. To address these challenges, we introduce Stable Rank-Guided Low-Rank Adaptation (SR-LoRA), a novel framework that utilizes the stable rank of pre-trained weight matrices as a natural prior for layer-wise rank allocation. By leveraging the stable rank, which reflects the intrinsic dimensionality of the weights, SR-LoRA enables a principled and efficient redistribution of ranks across layers, enhancing adaptability without incurring additional search costs. Empirical evaluations on few-shot tasks with significant domain gaps show that SR-LoRA consistently outperforms recent adaptive LoRA variants, achieving a superior trade-off between performance and efficiency. Our code is available at https://anonymous.4open.science/r/SR-LoRA-A18F.


#311
SVIP: Semantically Contextualized Visual Patches for Zero-Shot Learning

Zhi Chen · Zecheng Zhao · Jingcai Guo · Jingjing Li · Zi Huang

Zero-shot learning (ZSL) aims to recognize unseen classes without labeled training examples by leveraging class-level semantic descriptors such as attributes. A fundamental challenge in ZSL is semantic misalignment, where semantic-unrelated information involved in visual features introduce ambiguity to visual-semantic interaction. Unlike existing methods that suppress semantic-unrelated information post hoc either in the feature space or the model space, we propose addressing this issue at the input stage, preventing semantic-unrelated patches from propagating through the network. To this end, we introduce Semantically contextualized VIsual Patches (SVIP) for ZSL, a transformer-based framework designed to enhance visual-semantic alignment. Specifically, we propose a self-supervised patch selection mechanism that preemptively learns to identify semantic-unrelated patches in the input space. This is trained with the supervision from aggregated attention scores across all transformer layers, which estimate each patch’s semantic score. As removing semantic-unrelated patches from the input sequence may disrupt object structure, we replace them with learnable patch embeddings. With initialization from word embeddings, we can ensure they remain semantically meaningful throughout feature extraction. Extensive experiments on ZSL benchmarks demonstrate that SVIP achieves state-of-the-art performance results while providing more interpretable and semantically rich feature representations.


#312
Highlight
DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Model

Junjia Huang · Pengxiang Yan · Jinhang Cai · Jiyang Liu · Zhao Wang · Yitong Wang · Xinglong Wu · Guanbin Li

Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation and multi-layer compositions. However, existing approaches often fail to provide a thorough exploration of multi-layer structures, leading to inconsistent inter-layer interactions, such as occlusion relationships, spatial layout, and shadowing. In this paper, we introduce DreamLayer, a novel framework that enables coherent text-driven generation of multiple image layers, by explicitly modeling the relationship between transparent foreground and background layers. DreamLayer incorporates three key components, i.e., Context-Aware Cross-Attention (CACA) for global-local information exchange, Layer-Shared Self-Attention (LSSA) for establishing robust inter-layer connections, and Information Retained Harmonization (IRH) for refining fusion details at the latent level.By leveraging a coherent full-image context, DreamLayer builds inter-layer connections through attention mechanisms and applies a harmonization step to achieve seamless layer fusion. To facilitate research in multi-layer generation, we construct a high-quality, diverse multi-layer dataset including $400k$ samples. Extensive experiments and user studies demonstrate that DreamLayer generates more coherent and well-aligned layers, with broad applicability, including latent-space image editing and image-to-layer decomposition.


#313
Highlight
AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting

Xiaoyu Zhou · Jingqi Wang · Yongtao Wang · Yufei Wei · Nan Dong · Ming-Hsuan Yang

Obtaining high-quality 3D semantic occupancy from raw sensor data remains an essential yet challenging task, often requiring extensive manual labeling. In this work, we propose AutoOcc, an vision-centric automated pipeline for open-ended semantic occupancy annotation that integrates differentiable Gaussian splatting guided by vision-language models. We formulate the open-ended semantic occupancy reconstruction task to automatically generate scene occupancy by combining attention maps from vision-language models and foundation vision models. We devise semantic-aware Gaussians as intermediate geometric descriptors and propose a cumulative Gaussian-to-voxel splatting algorithm that enables effective and efficient occupancy annotation. Our framework outperforms existing automated occupancy annotation methods without human labels. AutoOcc also enables open-ended semantic occupancy auto-labeling, achieving robust performance in both static and dynamically complex scenarios. All the source codes and trained models will be released.


#314
DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning

Fucai Ke · Vijay Kumar b g · Xingjian Leng · Zhixi Cai · Zaid Khan · Weiqing Wang · Pari Delir Haghighi · Hamid Rezatofighi · Manmohan Chandraker

Visual reasoning (VR), which is crucial in many fields for enabling human-like visual understanding, remains highly challenging. Recently, compositional visual reasoning approaches, which leverage the reasoning abilities of large language models (LLMs) with integrated tools to solve problems, have shown promise as more effective strategies than end-to-end VR methods. However, these approaches face limitations, as frozen LLMs lack tool awareness in VR, leading to performance bottlenecks. While leveraging LLMs for reasoning is widely used in other domains, they are not directly applicable to VR due to limited training data, imperfect tools that introduce errors and reduce data collection efficiency in VR, and challenging in fine-tuning on noisy workflows. To address these challenges, we propose DWIM: i) Discrepancy-aware training Workflow generation, which assesses tool usage and extracts more viable workflows for training; and ii) Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions. Our experiments demonstrate that DWIM achieves state-of-the-art performance across various VR tasks, exhibiting strong generalization on multiple widely-used datasets.


#315
FW-Merging: Scaling Model Merging with Frank-Wolfe Optimization

Hao Chen · Shell Xu Hu · Wayne Luk · Timothy Hospedales · Hongxiang Fan

Model merging has emerged as a promising approach for multi-task learning (MTL) in large language models (LLMs), providing a training- and data-efficient alternative to conventional fine-tuning. However, with the rapid development of the open-source AI ecosystem and the increasing availability of fine-tuned LLMs, existing model merging methods face two key limitations: (i) they are primarily designed for in-house fine-tuned models, making them less adaptable to diverse model sources with partially unknown model and task information, (ii) they struggle to scale effectively when merging numerous model checkpoints.To address these challenges, we formulate model merging as a constrained optimization problem and introduce a novel approach: Frank-Wolfe Merging (FW-Merging). Inspired by the Frank-Wolfe optimization, our approach iteratively selects the most relevant model parameters to minimize a linear approximation of the objective function, merging them through a predefined merging function. The objective function is designed to capture the desired behavior of the target merged model, while the fine-tuned candidate models defines the constraint set.More importantly, FW-Merging serves as an orthogonal technique to existing merging methods, seamlessly integrating with them to further enhance performance.Our experiments show that FW-Merging scales across diverse model sources, remaining stable with 16 irrelevant models and improving by 15.3% with 16 relevant models on 20 CV tasks, all while maintaining constant memory overhead—unlike the linear overhead of data-informed methods.Compared with the state-of-the-art methods, FW-Merging surpasses the data-free merging method by 32.8% and outperforms the data-informed Adamerging by 8.39% when merging 20 ViT models. Our code is attached with this submission.


#316
Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning

Zhengxuan Wei · Jiajin Tang · Sibei Yang

Existing Moment Retrieval methods face three critical bottlenecks: (1) data scarcity forces models into shallow keyword-feature associations; (2) boundary ambiguity in transition regions between adjacent events; (3) insufficient discrimination of fine-grained semantics (e.g., distinguishing ''kicking" vs. ''throwing" a ball). In this paper, we propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, designed to overcome local optima caused by insufficient data annotations and the lack of robust boundary and semantic discrimination capabilities. AMR is built upon two key insights: (1) it resolves ambiguous boundary information and semantic confusion in existing annotations without additional data (avoiding costly manual labeling), and (2) it preserves boundary and semantic discriminative capabilities enhanced by training while generalizing to real-world scenarios, significantly improving performance. Furthermore, we propose a two-stage training framework with cold-start and distillation adaptation. The cold-start stage employs curriculum learning on augmented data to build foundational boundary/semantic awareness. The distillation stage introduces dual query sets: Original Queries maintain DETR-based localization using frozen Base Queries from the cold-start model, while Active Queries dynamically adapt to real-data distributions. A cross-stage distillation loss enforces consistency between Original and Base Queries, preventing knowledge forgetting while enabling real-world generalization. Experiments on multiple benchmarks show that AMR achieves improved performance over prior state-of-the-art approaches.

Despite the promise of Multi-Task Learning (MTL) in leveraging complementary knowledge across tasks, existing multi-task optimization (MTO) techniques remain fixated on resolving conflicts through optimizer-centric loss scaling and gradient manipulation, yet fail to deliver consistent gains. In this paper, we argue that the shared representation space, where task interactions naturally occur, offers rich information and potential for operations complementary to existing optimizer designs, especially for facilitating the inter-task complementarity, which is rarely explored in MTO. This intuition leads to Rep-MTL, which exploits the representation-level task saliency to quantify interactions between task-specific optimization and shared representation learning. By steering these saliencies through entropybased penalization and sample-wise cross-task alignment, Rep-MTL aims to mitigate negative transfer by maintaining the effective training of individual tasks instead pure conflict-solving, while explicitly promoting complementary information sharing. Experiments are conducted on four challenging MTL benchmarks covering both task-shift and domain-shift scenarios. The results show that Rep-MTL, even paired with the basic equal weighting (EW) policy, achieves competitive performance gains with favorable efficiency. Beyond standard performance metrics, Power Law (PL) exponent analysis demonstrates Rep-MTL’s efficacy in balancing task-specific learning and cross-task sharing.


#318
Enhancing Numerical Prediction of MLLMs with Soft Labeling

Pei Wang · Zhaowei Cai · Hao Yang · Davide Modolo · Ashwin Swaminathan

The optimality of using the de facto cross-entropy loss with one-hot target distribution (hard labeling) is questioned when training (Multimodal) Large Language Models (LLMs/MLLMs). Although it is reasonable for language token prediction, which is a typical multi-class classification problem in discrete space, it is suboptimal for task like numerical prediction, which is a typical regression problem in continuous space. However, enabling regression in LLMs/MLLMs will complicate the training and next-token prediction paradigm at inference. Instead, to address this challenge, we propose a novel loss design, called soft labeling, which smooths the target probability distribution, enabling predictions to be penalized according to their distance to the target. This is similar to regression loss, which penalizes more on the further predictions in the continuous space, but will not change the model architecture and the next-token prediction paradigm of LLMs/MLLMs. We demonstrate the efficacy of soft labeling through extensive experiments on visual grounding, object counting, and chart understanding, achieving state-of-the-art performance on multiple benchmarks without bells and whistles. Soft labeling can be applied in any LLM/MLLM.


#319
Transparent Vision: A Theory of Hierarchical Invariant Representations

Shuren Qi · Yushu Zhang · CHAO WANG · Zhihua Xia · Xiaochun Cao · FENGLEI FAN

Developing robust and interpretable vision systems is a crucial step towards trustworthy artificial intelligence. One promising paradigm is to design transparent structures, e.g., geometric invariance, for fundamental representations. However, such invariants exhibit limited discriminability, limiting their applications in larger-scale tasks. For this open problem, we conduct a systematic investigation of hierarchical invariance, exploring this topic from theoretical, practical, and application perspectives. At the theoretical level, we show how to construct discriminative invariants with a Convolutional Neural Network (CNN)-like hierarchical architecture, yet in a fully transparent manner. The general blueprint, specific definitions, invariant properties, and numerical implementations are provided. At the practical level, we discuss how to customize this transparent framework into a given task. With the over-completeness, discriminative features w.r.t. the task can be adaptively formed in a Neural Architecture Search (NAS)-like manner. We demonstrate the above arguments with accuracy, invariance, and efficiency results on laboratory-style classification experiments. Furthermore, at the application level, our representations are explored in real-world forensic tasks on adversarial perturbations and generated content. Such applications reveal that our invariants exhibit competitive discriminability even in the era of deep learning. For robust and interpretable vision tasks at larger scales, hierarchical invariant representations can be considered as an effective alternative to traditional CNNs and invariants.


#320
Fuse Before Transfer: Knowledge Fusion for Heterogeneous Distillation

Guopeng Li · Qiang Wang · Ke Yan · Shouhong Ding · Yuan Gao · Gui-Song Xia

Most knowledge distillation (KD) methods focus on teacher-student pairs with similar architectures, such as both being CNN models. The potential and flexibility of KD can be greatly improved by expanding it to Cross-Architecture KD (CAKD), where the knowledge of homogeneous and heterogeneous teachers can be transferred selectively to given students. However, it makes CAKD extremely challenging because of substantial feature gaps between heterogeneous models (e.g., a ViT teacher and a CNN student), originating from the distinction of their inherent inductive biases} and module functions. To this end, we fuse heterogeneous knowledge before transferring it from teacher to student. This fusion combines the advantages of cross-architecture inductive biases and module functions by merging directly from different combinations of convolution, attention, and MLP modules derived from both student and teacher module functions. Furthermore, we observe that heterogeneous features exhibit diverse spatial distributions, hindering the effectiveness of conventional pixel-wise MSE loss. Therefore, we leverage a spatial-agnostic InfoNCE loss to align features after spatial smoothing. Our method is evaluated across various homogeneous models and arbitrary heterogeneous combinations of CNNs, ViTs, and MLPs, yielding promising performance for distilled models with a maximum gain of 11.47% on CIFAR-100 and 3.67% on ImageNet-1K. Our codes will be released.


#321
What You Have is What You Track: Adaptive and Robust Multimodal Tracking

Yuedong Tan · Jiawei Shao · Eduard Zamfir · Ruanjun Li · Zhaochong An · Chao Ma · Danda Pani Paudel · Luc Gool · Radu Timofte · Zongwei Wu

Multimodal data is known to be helpful for visual tracking by improving robustness to appearance variations. However, sensor synchronization challenges often compromise data availability, particularly in video settings where shortages can be temporal. Despite its importance, this area remains underexplored. In this paper, we present the first comprehensive study on tracker performance with temporally incomplete multimodal data. Unsurprisingly, under such a circumstance, existing trackers exhibit significant performance degradation, as their rigid architectures lack the adaptability needed to effectively handle missing modalities.To address these limitations, we propose a flexible framework for robust multimodal tracking. We venture that a tracker should dynamically activate computational units based on missing data rates. This is achieved through a novel Heterogeneous Mixture-of-Experts fusion mechanism with adaptive complexity, coupled with a video-level masking strategy that ensures both temporal consistency and spatial completeness — critical for effective video tracking. Surprisingly, our model not only adapts to varying missing rates but also adjusts to scene complexity. Extensive experiments show that our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings. The code and benchmark will be made publicly available.

Continual learning enables pre-trained generative vision-language models (VLMs) to incorporate knowledge from new tasks without retraining data from previous ones. Recent methods update a visual projector to translate visual information for new tasks, connecting pre-trained vision encoders with large language models. However, such adjustments may cause the models to prioritize visual inputs over language instructions, particularly learning tasks with repetitive types of textual instructions. To address the neglect of language instructions, we propose a novel framework that grounds the translation of visual information on instructions for language models. We introduce a mixture of visual projectors, each serving as a specialized visual-to-language translation expert based on the given instruction context to adapt to new tasks. To avoid using experts for irrelevant instruction contexts, we propose an expert recommendation strategy that reuses experts for tasks similar to those previously learned. Additionally, we introduce expert pruning to alleviate interference from the use of experts that cumulatively activated in previous tasks. Extensive experiments on diverse vision-language tasks demonstrate that our method outperforms existing continual learning approaches by generating instruction-following responses.


#323
OrderChain: Towards General Instruct-Tuning for Stimulating the Ordinal Understanding Ability of MLLM

Jinhong Wang · Shuo Tong · Jintai CHEN · Jian liu · Dongqi Tang · Weiqiang Wang · Wentong Li · Hongxia Xu · Danny Chen · Jian Wu

Despite the remarkable progress of multimodal large language models (MLLMs), they continue to face challenges in achieving competitive performance on ordinal regression (OR; a.k.a. ordinal classification). To address this issue, this paper presents OrderChain, a novel and general prompting paradigm that augments the ordinal understanding ability of MLLMs by specificity and commonality modeling. Specifically, our OrderChain consists of a set of task-aware prompts to facilitate the specificity modeling of diverse OR tasks and a new range optimization Chain-of-Thought (RO-CoT), which learns a commonality way of thinking about OR tasks by uniformly decomposing them into multiple small-range optimization subtasks. Further, we propose a category recursive division (CRD) method to generate instruction candidate category prompts to support RO-CoT automatic optimization. Comprehensive experiments show that a Large Language and Vision Assistant (LLaVA) model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets, e.g., from 47.5% to 93.2% accuracy on the Adience dataset for age estimation, and from 30.0% to 85.7% accuracy on the Diabetic Retinopathy dataset. Notably, LLaVA with our OrderChain also remarkably outperforms state-of-the-art methods by 27% on accuracy and 0.24 on MAE on the Adience dataset. To our best knowledge, our OrderChain is the first work that augments MLLMs for OR tasks, and the effectiveness is witnessed across a spectrum of OR datasets.

Compared to 2D data, the scale of point cloud data in different domains available for training, is quite limited. Researchers have been trying to combine these data of different domains for masked autoencoder (MAE) pre-training to leverage such a data scarcity issue. However, the prior knowledge learned from mixed domains may not align well with the downstream 3D point cloud analysis tasks, leading to degraded performance. To address such an issue, we propose the Domain-Adaptive Point Cloud Masked Autoencoder (DAP-MAE), an MAE pre-training method, to adaptively integrate the knowledge of cross-domain datasets for general point cloud analysis. In DAP-MAE, we design a heterogeneous domain adapter that utilizes an adaptation mode during the pre-training, enabling the model to comprehensively learn information from point clouds across different domains, while employing a fusion mode in the fine-tuning to enhance point cloud features. Meanwhile, DAP-MAE incorporates a domain feature generator to guide the adaptation of point cloud features to various downstream tasks. With only one pre-training, DAP-MAE achieves excellent performance across four different point cloud analysis tasks, reaching 95.18\% in object classification on ScanObjectNN and 88.45\% in facial expression recognition on Bosphorus.


#325
SFUOD: Source-Free Unknown Object Detection

Keon-Hee Park · Seun-An Choe · Gyeong-Moon Park

Source-free object detection adapts a detector pre-trained on a source domain to an unlabeled target domain without requiring access to labeled source data. While this setting is practical as it eliminates the need for the source dataset during domain adaptation, it operates under the restrictive assumption that only pre-defined objects from the source domain exist in the target domain. This closed-set setting prevents the detector from detecting undefined objects.To ease this assumption, we propose $\textbf{S}$ource-$\textbf{F}$ree $\textbf{U}$nknown $\textbf{O}$bject $\textbf{D}$etection ($\textbf{SFUOD}$), a novel scenario which enables the detector to not only recognize known objects but also detect undefined objects as unknown objects. To this end, we propose $\textbf{CollaPAUL}$ ($\textbf{Colla}$borative tuning and $\textbf{P}$rincipal $\textbf{A}$xis-based $\textbf{U}$nknown $\textbf{L}$abeling), a novel framework for SFUOD. Collaborative tuning enhances knowledge adaptation by integrating target-dependent knowledge from the auxiliary encoder with source-dependent knowledge from the pre-trained detector through a cross-domain attention mechanism. Additionally, principal axis-based unknown labeling assigns pseudo-labels to unknown objects by estimating objectness via principal axes projection and confidence scores from model predictions.The proposed CollaPAUL achieves state-of-the-art performances on SFUOD benchmarks, and extensive experiments validate its effectiveness. The code will be released after the review.


#326
Activation Subspaces for Out-of-Distribution Detection

Barış Zöngür · Robin Hesse · Stefan Roth

To ensure the reliability of deep models in real-world applications, out-of-distribution (OOD) detection methods aim to distinguish samples close to the training distribution (in-distribution, ID) from those farther away (OOD). In this work, we propose a novel OOD detection method that utilizes singular value decomposition of the weight matrix of the classification head to decompose the model's feature activations into decisive and insignificant components, which contribute maximally, respectively minimally, to the final classifier output. We find that the subspace of insignificant components more effectively distinguishes ID from OOD data than raw activations. This occurs because the classification objective leaves the indecisive subspace largely unaffected, yielding features that are "untainted'' by the target classification task. Conversely, we find that activation shaping methods profit from only considering the decisive subspace, as the insignificant component can cause interference in the activation space. By combining these two findings into a single method, we achieve state-of-the-art results in various standard OOD benchmarks.


#327
A Recipe for Generating 3D Worlds from a Single Image

Katja Schwarz · Denis Rozumny · Samuel Rota Bulò · Lorenzo Porzi · Peter Kontschieder

We introduce a recipe for generating immersive 3D worlds from a single image by framing the task as an in-context learning problem for 2D inpainting models. This approach requires minimal training and uses existing generative models. Our process involves two steps: generating coherent panoramas using a pre-trained diffusion model and lifting these into 3D with a metric depth estimator. We then fill unobserved regions by conditioning the inpainting model on rendered point clouds, requiring minimal fine-tuning. Tested on both synthetic and real images, our method produces high-quality 3D environments suitable for VR display. By explicitly modeling the 3D structure of the generated environment from the start, our approach consistently outperforms state-of-the-art, video synthesis-based methods along multiple quantitative image quality metrics.


#328
On the Complexity-Faithfulness Trade-off of Gradient-Based Explanations

Amir Mehrpanah · Matteo Gamba · Kevin Smith · Hossein Azizpour

ReLU networks, while prevalent for visual data, have sharp transitions, sometimes relying on individual pixels for predictions, making vanilla gradient-based explanations noisy and difficult to interpret. Existing methods, such as GradCAM, smooth these explanations by producing surrogate models at the cost of faithfulness. We introduce a unifying spectral framework to systematically analyze and quantify smoothness, faithfulness, and their trade-off in explanations.Using this framework, we quantify and reduce the contribution of ReLU networks to high-frequency information, providing a principled approach to identifying this trade-off. Our analysis characterizes how surrogate-based smoothing distorts explanations, leading to an ``explanation gap'' that we formally define and measure for different post-hoc methods.Finally, we validate our theoretical findings across different design choices, datasets, and ablations.


#329
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis

Letian Zhang · Quan Cui · Bingchen Zhao · Cheng Yang

The success of multi-modal large language models (MLLMs) has been largely attributed to the large-scale training data. However, the training data of many MLLMs is unavailable due to privacy concerns. The expensive and labor-intensive process of collecting multi-modal data further exacerbates the problem. Is it possible to synthesize multi-modal training data automatically without compromising diversity and quality? In this paper, we propose a new method, Oasis, to synthesize high-quality multi-modal data with only images. Oasis breaks through traditional methods by prompting only images to the MLLMs, thus extending the data diversity by a large margin. Our method features a delicate quality control method which ensures the data quality. We collected over 500k data and conducted incremental experiments on LLaVA-NeXT. Extensive experiments demonstrate that our method can significantly improve the performance of MLLMs. The image-based synthesis also allows us to focus on the specific-domain ability of MLLMs. Code and data will be publicly available.


#330
Learning to Inference Adaptively for Multimodal Large Language Models

Zhuoyan Xu · Khoi Nguyen · Preeti Mukherjee · Saurabh Bagchi · Somali Chaterji · Yingyu Liang · Yin Li

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent efforts on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs.

Current lifelong person re-identification (LReID) methods predominantly rely on fully labeled data streams. However, in real-world scenarios where annotation resources are limited, a vast amount of unlabeled data coexists with scarce labeled samples, leading to the Semi-Supervised LReID (Semi-LReID) problem and making LReID methods suffer severe performance degradation. Despite the practical significance of Semi-LReID, it remains unexplored due to its inherent challenges. Existing LReID methods, even when combined with semi-supervised strategies, suffer limited long-term adaptation performance due to struggling with the noisy knowledge occurring during unlabeled data utilization, which hinders new knowledge acquisition and exacerbates catastrophic forgetting. In this paper, we pioneer the investigation of Semi-LReID, introducing a novel Self-Reinforcing PRototype Evolution with Dual-Knowledge Cooperation framework (SPRED). Our key innovation lies in establishing a self-reinforcing cycle between dynamic prototype-guided pseudo-label generation and new-old knowledge collaborative purification to enhance the utilization of unlabeled data. Specifically, learnable identity prototypes are introduced to dynamically capture the identity distributions as the pseudo-label evolves, then generate high-quality pseudo-labels, while dual-knowledge cooperation, which integrates current model specialization and historical model generalization, refines pseudo-labels by filtering out noisy information. Through this cyclic design, reliable pseudo-labels are progressively mined to improve current-stage learning and ensure positive knowledge propagation over long-term learning. Besides, a prototype structure-based knowledge distillation loss is developed to mitigate catastrophic forgetting, further boosting the long-term knowledge consolidation capacity. Extensive experiments on established Semi-LReID benchmarks demonstrate that our SPRED achieves state-of-the-art performance. Our code will be publicly available.

Existing adaptation methods of pre-trained vision-language models like CLIP often rely on base-class samples during fine-tuning, introducing systematic biases that distort decision boundaries and degrade performance on novel classes. In this work, we break new ground by proposing a hierarchical divide-and-conquer framework that addresses classification bias at its root. Our method first segregates the label space into base and novel subspaces, ensuring domain separation. Subsequently, it employs text-embedding clustering within each subspace to decompose ambiguous intra-domain classes into disentangled, fine-grained clusters. This two-stage grouping strategy not only alleviates class confusion but also enables domain-specific model training in isolated subspaces, fostering specialized learning without overfitting base categories. Experiments on three classification benchmarks reveal that our approach achieves state-of-the-art performance, surpassing the second-best competitor by 10\% average accuracy.


#333
Knowledge Transfer from Interaction Learning

Yilin Gao · Kangyi Chen · Zhongxing Peng · Hengjie Lu · Shugong Xu

Current visual foundation models (VFMs) face a fundamental limitation in transferring knowledge from vision language models (VLMs): while VLMs excel at modeling cross-modal interactions through unified representation spaces, existing VFMs predominantly adopt \textit{result-oriented} paradigms that neglect the underlying interaction processes. This representational discrepancy leads to suboptimal knowledge transfer and limited generalization capabilities across vision tasks.We propose Learning from Interactions, a cognitive-inspired framework that bridges this gap by explicitly modeling interactions during visual understanding. Our key insight is that preserving the interaction dynamics captured by VLMs -- rather than just their final representations -- enables more effective knowledge transfer to downstream VFMs. The technical core involves two innovations: (1) \textit{Interaction Queries} that maintain persistent relationships across network layers, and (2) interaction-based supervision derived from pre-trained VLMs' cross-modal attention patterns.Comprehensive experiments demonstrate consistent improvements across multiple benchmarks: achieving $\sim$3.3\% and $+$1.6 mAP/$+$2.4 $AP^{mask}$ absolute gains on TinyImageNet classification and COCO detection/segmentation respectively, with minimal parameter overhead and faster convergence (7$\times$ speedup). The framework particularly excels in cross-domain scenarios, delivering $\sim$2.4\% and $\sim$9.3\% zero-shot improvements on PACS and VLCS. Human evaluations confirm our approach's cognitive alignment, outperforming result-oriented methods by 2.7$\times$ in semantic consistency metrics.


#334
DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou · Yuxin Chen · Haokun Lin · Yichen Wu · Shuyu Yang · Zhongang Qi · Chen Ma · Li Zhu

With recent advances in Multimodal Large Language Models (MLLMs), grounding and referring capabilities have gained increasing attention for achieving detailed understanding and flexible user interaction. However, these capabilities still remain underdeveloped in visual document understanding due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and rEferring data engine (DOGE-Engine), which generates two types of high-quality fine-grained document data: (1) multi-granular parsing data to improve text localization and recognition, and (2) instruction-tuning data to activate MLLMs' grounding and referring capabilities in dialogue and reasoning. Using the DOGE-Engine, we construct DOGE-Bench, a benchmark covering seven grounding and referring tasks across three document types (chart, poster, and PDF document), offering a comprehensive evaluation of fine-grained document understanding. Leveraging the generated data, we further develop DOGE, a strong baseline model that excels in text localization and recognition, while precisely grounds and refers to key textual information during conversation and reasoning, thereby advancing document understanding to a finer granularity and enable flexible interaction paradigms. Our code, data, and model will be open-sourced to support community development.


#335
Lark: Low-Rank Updates After Knowledge Localization for Few-shot Class-Incremental Learning

Jinxin Shi · Jiabao Zhao · Yifan Yang · Xingjiao Wu · Jiawen Li · Liang He

For Few-Shot Class-Incremental Learning (FSCIL), direct fine-tuning causes significant parameter shifts, resulting in catastrophic forgetting and increased resource consumption. While, freezing the pre-trained backbone exacerbates the inconsistency between the backbone and the evolving classifier. To overcome these challenges, we introduce a method called Low-Rank updates after knowledge localization (Lark). In the knowledge localization phase, the Fisher Information Matrix is calculated to measure the sensitivity of parameters in different layers to previously acquired knowledge. This phase ultimately identifies the parameters within the model that are most suitable for learning new knowledge. In the subsequent incremental editing phase, a low-rank incremental update strategy is applied. This strategy ensures that the model parameter updates adhere to a Rank-One matrix structure. By doing so, it minimizes alterations to the original parameters, thereby enabling the model to integrate new knowledge while retaining as much of the previous knowledge as possible. Extensive experimental results demonstrate that the Lark method achieves significant performance improvements on the CIFAR100, mini-ImageNet, and CUB200 datasets, surpassing current state-of-the-art methods.


#336
Advancing Textual Prompt Learning with Anchored Attributes

Zheng Li · Yibing Song · Ming-Ming Cheng · Xiang Li · jian Yang

Textual-based prompt learning methods primarily employ multiple learnable soft prompts and hard class tokens in a cascading manner as text inputs, aiming to align image and text (category) spaces for downstream tasks. However, current training is restricted to aligning images with predefined known categories and cannot be associated with unknown categories. In this work, we propose utilizing universal attributes as a bridge to enhance the alignment between images and unknown categories. Specifically, we introduce an Attribute-anchored Textual Prompt learning method for vision-language models, named ATPrompt. This approach expands the learning space of soft prompts from the original one-dimensional category level into the multi-dimensional attribute level by incorporating multiple attribute tokens into the learnable soft prompts. Through this modification, we transform the text prompt from a category-centric form to an attribute-category hybrid form. Additionally, we introduce a straightforward differentiable attribute search method to identify representative and suitable attributes for downstream tasks. As an easy-to-use plug-in technique, ATPrompt can seamlessly replace the existing prompt format in textual-based methods, providing general improvements at a negligible computational cost. Extensive experiments across 11 datasets validate the effectiveness of our method.


#337
Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning

Liwei Luo · Shuaitengyuan Li · Dongwei Ren · Qilong Wang · Pengfei Zhu · Qinghua Hu

Recently, remarkable progress has been made in large-scale pre-trained model tuning, and inference efficiency is becoming more crucial for practical deployment. Early exiting in conjunction with multi-stage predictors, when cooperated with a parameter-efficient fine-tuning strategy, offers a straightforward way to achieve an inference-efficient model. However, a key challenge remains unresolved: How can early stages provide low-level fundamental features to deep stages while simultaneously supplying high-level discriminative features to early-stage predictors? To address this problem, we propose a Decoupled Multi-Predictor Optimization (DMPO) method to effectively decouple the low-level representative ability and high-level discriminative ability in early stages. First, in terms of architecture, we introduce a lightweight bypass module into multi-stage predictors for functional decomposition of shallow features from early stages, while a high-order statistics-based predictor is developed for early stages to effectively enhance their discriminative ability. To reasonably train our multi-predictor architecture, a decoupled optimization is proposed to allocate two-phase loss weights for multi-stage predictors during model tuning, where the initial training phase enables the model to prioritize the acquisition of discriminative ability of deep stages via emphasizing representative ability of early stages, and the latter training phase drives discriminative ability towards earlier stages as much as possible. As such, our DMPO can effectively decouple representative and discriminative abilities in early stages in terms of architecture design and model optimization. Experiments across various datasets and pre-trained backbones demonstrate that DMPO clearly outperforms its counterparts when reducing computational cost. Particularly, DMPO with 30% FLOPs is comparable with or even suppresses counterparts with 70% FLOPs.


#338
SHIFT: Smoothing Hallucinations by Information Flow Tuning for Multimodal Large Language Models

Sudong Wang · Yunjian Zhang · Yao Zhu · Enci Liu · Jianing Li · Yanwei Liu · Xiangyang Ji

Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in recent years, the persistent challenge of ``hallucination'' has surfaced as a major barrier, sharply constraining their practical applicability and reliability in real-world systems. In this paper, we provide a novel perspective for the causes and mitigations for hallucinations by tracking the information flow within MLLMs. We find that information in MLLMs does not flow in a strictly continuous manner, instead, they may mutate abruptly in deep layers. The mutated information does not originate from shallow layers, on the contrary, it is directly injected into the model, which may cause the model's outputs to deviate from the input, leading to hallucinations. Inspired by this observation, we propose a hallucination mitigation method that directly operates on the mutated information, named \textbf{S}moothing \textbf{H}allucinations by \textbf{I}nformation \textbf{F}low \textbf{T}uning (SHIFT). In this method, the differences of feature encodings between adjacent layers are monitored, and once the mutated information is detected, the knowledge from shallow layers is used to tune it. This process filters out hallucinated knowledge, aligning features more faithfully with the input and effectively reducing hallucinations. Extensive experiments on multiple benchmarks have demonstrated the superior performance in terms of accuracy and efficiency of SHIFT on mitigating hallucinations compared with baselines.


#339
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Kaichen Zhang · Yifei Shen · Bo Li · Ziwei Liu

Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.

While contrastive pre-training is widely employed, its data efficiency problem has remained relatively under-explored thus far.Existing methods often rely on static coreset selection algorithms to pre-identify important data for training.However, this static nature renders them unable to dynamically track the data usefulness throughout pre-training, leading to subpar pre-trained models.To address this challenge, our paper introduces a novel dynamic bootstrapping dataset pruning method.It involves pruning data preparation followed by dataset mutation operations, both of which undergo iterative and dynamic updates.We apply this method to two prevalent contrastive pre-training frameworks: \textbf{CLIP} and \textbf{MoCo}, representing vision-language and vision-centric domains, respectively.In particular, we individually pre-train seven CLIP models on two large-scale image-text pair datasets, and two MoCo models on the ImageNet dataset, resulting in a total of 16 pre-trained models.With a data pruning rate of 30-35\% across all 16 models, our method exhibits only marginal performance degradation (less than \textbf{1\%} on average) compared to corresponding models trained on the full dataset counterparts across various downstream datasets, and also surpasses several baselines with a large performance margin.Additionally, the byproduct from our method, \ie, coresets derived from the original datasets after pre-training, also demonstrates significant superiority in terms of downstream performance over other static coreset selection approaches.We include the code in the supplementary material to facilitate the reproduction of our results.


#341
A Conditional Probability Framework for Compositional Zero-shot Learning

Peng Wu · Qiuxia Lai · Hao Fang · Guo-Sen Xie · Yilong Yin · Xiankai Lu · Wenguan Wang

Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions. Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning. However, this assumption overlooks the semantic constraints and contextual dependencies inside a composition. For example, certain attributes naturally pair with specific objects (e.g., "striped'' applies to "zebra'' or "shirts'' but not "sky'' or "water''), while the same attribute can manifest differently depending on context (e.g., "young'' in "young tree'' vs "young dog''). Thus, capturing attribute-object interdependence remains a fundamental yet long-ignored challenge in CZSL.In this paper, we adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies. We decompose the probability of a composition into two components: the likelihood of an object and the conditional likelihood of its attribute. To enhance object feature learning, we incorporate textual descriptors to highlight semantically relevant image regions. These enhanced object features then guide attribute learning through a cross-attention mechanism, ensuring better contextual alignment. By jointly optimizing object likelihood and conditional attribute likelihood, our method effectively captures compositional dependencies and generalizes well to unseen compositions. Extensive experiments on multiple CZSL benchmarks demonstrate the superiority of our approach. The source code will be released.


#342
Backdooring Self-Supervised Contrastive Learning by Noisy Alignment

Tuo Chen · Jie Gui · Minjing Dong · Ju Jia · Lanting Fang · Jian liu

Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning's random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance with +45.9\% attack success rate improvement over existing DPCLs on ImageNet-100 while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses.


#343
Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Zhisheng Zhong · Chengyao Wang · Yuqi Liu · Senqiao Yang · Longxiang Tang · Yuechen Zhang · Jingyao Li · Tianyuan Qu · Yanwei Li · Yukang Chen · Shaozuo Yu · WU Sitong · Eric Lo · Shu Liu · Jiaya Jia

As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond single-domain capabilities is essential to meet the demands for more versatile and efficient AI. However, previous omni-models have insufficiently explored speech, neglecting its integration with multi-modality. We introduce Lyra, an efficient MLLM that enhances multi-modal abilities, including advanced long speech comprehension, sound understanding, cross-modality efficiency, and seamless speech interaction. To achieve efficiency and speech-centric capabilities, Lyra employs three strategies: (1) leveraging existing open-source large models and a proposed multi-modality LoRA to reduce training costs and data requirements; (2) using a latent multi-modality regularizer and extractor to strengthen the relationship between speech and other modalities, thereby enhancing model performance; and (3) constructing a high-quality, extensive dataset that includes 1.5M multi-modal (language, vision, audio) data samples and 12K long speech samples, enabling Lyra to handle complex long speech inputs and achieve more robust omni-cognition. Compared to other omni-methods, Lyra achieves state-of-the-art performance on various vision-language, vision-speech, and speech-language benchmarks, while also using fewer computational resources and less training data. All code, data, and models will be available to the public.


#344
p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Jun Zhang · Desen Meng · Zhengming Zhang · Zhenpeng Huang · Tao Wu · Limin Wang

Despite the remarkable performance of multimodal large language models (MLLMs) across diverse tasks, the substantial training and inference costs impede their advancement. In this paper, we propose p-MoD, an efficient MLLM architecture that significantly reduces training and inference costs while maintaining model performance. The majority of computation in MLLMs stems from the overwhelming volume of vision tokens processed by the transformer-based LLM. Accordingly, we leverage the Mixture-of-Depths (MoD) mechanism, where each LLM layer selects essential vision tokens to process while skipping redundant ones. However, integrating MoD into MLLMs is non-trivial. To address the challenges of training and inference stability as well as limited training data, we adapt the MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) and symmetric token reweighting (STRing). Moreover, we observe that vision tokens exhibit higher redundancy in deeper layers and thus design a progressive ratio decay (PRD) strategy, which gradually reduces the token retention ratio layer by layer, employing a shifted cosine schedule. This crucial design fully unleashes the potential of MoD, significantly boosting the efficiency and performance of our models. Extensive experiments on two baseline models across 15 benchmarks show that our model matches or even surpasses the performance of corresponding baselines, while requiring only 55.6% TFLOPs and 53.7% KV cache storage during inference, and 77.7% GPU hours during training.

In the field of AI security, the vulnerability of deep neural networks has garnered widespread attention. Specifically, the sensitivity of DNNs to adversarial examples (AEs) can lead to severe consequences, even small perturbations in input data can result in incorrect predictions. AEs demonstrate transferability across models, however, targeted attack success rates (TASRs) remain low due to significant differences in feature dimensions and decision boundaries. To enhance the transferability of targeted AEs, we propose a novel approach by introducing Inverse Target Gradient Competition (ITC) and Spatial Distance Stretching (SDS) in the optimization process. Specifically, we utilize a twin-network-like framework to generate both non-targeted and targeted AEs, introducing a new competition mechanism ITC where non-targeted adversarial gradients are applied each epoch to hinder the optimization of targeted adversarial perturbations, thus enhancing robustness in targeted attacks. Additionally, a top-k SDS strategy is employed, guiding AEs to penetrate target class regions in the latent multi-dimensional space while globally distancing from multiple closest non-targeted regions, ultimately achieving optimal adversarial transferability. Compared with state-of-the-art competition-based attacks, our method demonstrates significant transferability advantages, with average transferable TASRs improved by 16.1% and 21.4% on mainstream CNNs and ViTs, respectively, while also achieving an unmatched breaking-through defense capability.


#346
FedDifRC: Unlocking the Potential of Text-to-Image Diffusion Models in Heterogeneous Federated Learning

Huan Wang · Haoran Li · Huaming Chen · Jun Yan · Jiahua Shi · Jun Shen

Federated learning aims at training models collaboratively across participants while protecting privacy. However, one major challenge for this paradigm is the data heterogeneity issue, where biased data preferences across multiple clients, harming the model's convergence and performance. In this paper, we first introduce powerful diffusion models into the federated learning paradigm and show that diffusion representations are effective steers during federated training. To explore the possibility of using diffusion representations in handling data heterogeneity, we propose a novel diffusion-inspired Federated paradigm with Diffusion Representation Collaboration, termed FedDifRC, leveraging meaningful guidance of diffusion models to mitigate data heterogeneity. The key idea is to construct text-driven diffusion contrasting and noise-driven diffusion regularization, aiming to provide abundant class-related semantic information and consistent convergence signals. On the one hand, we exploit the conditional feedback from the diffusion model for different text prompts to build a text-driven contrastive learning strategy. On the other hand, we introduce a noise-driven consistency regularization to align local instances with diffusion denoising representations, constraining the optimization region in the feature space. In addition, FedDifRC can be extended to a self-supervised scheme without relying on any labeled data. We also provide a theoretical analysis for FedDifRC to ensure convergence under non-convex objectives. The experiments on different scenarios validate the effectiveness of FedDifRC and the efficiency of crucial components.


#347
LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement

Jieming Bian · Lei Wang · Letian Zhang · Jie Xu

Foundation models (FMs) achieve strong performance across diverse tasks with task-specific fine-tuning, yet full parameter fine-tuning is often computationally prohibitive for large models. Parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) reduce this cost by introducing low-rank matrices for tuning fewer parameters. While LoRA allows for efficient fine-tuning, it requires significant data for adaptation, making Federated Learning (FL) an appealing solution due to its privacy-preserving collaborative framework. However, combining LoRA with FL introduces two key challenges: the \textbf{Server-Side Aggregation Bias}, where server-side averaging of LoRA matrices diverges from the ideal global update, and the \textbf{Client-Side Initialization Lag}, emphasizing the need for consistent initialization across rounds. Existing approaches address these challenges individually, limiting their effectiveness. We propose LoRA-FAIR, a novel method that tackles both issues by introducing a correction term on the server, enhancing aggregation efficiency and accuracy. LoRA-FAIR maintains computational and communication efficiency, yielding superior performance over state-of-the-art methods. Experimental results on ViT and MLP-Mixer models across large-scale datasets demonstrate that LoRA-FAIR consistently achieves performance improvements in FL settings.


#348
Diversity-Enhanced Distribution Alignment for Dataset Distillation

Hongcheng Li · Yucan Zhou · Xiaoyan Gu · Bo Li · Weiping Wang

Dataset distillation, which compresses large-scale datasets into compact synthetic representations (i.e., distilled datasets), has become crucial for the efficient training of modern deep learning architectures. While existing large-scale dataset distillation methods leverage a pre-trained model through batch normalization statistics alignment, they neglect the essential role of covariance matrices in preserving inter-feature correlations, resulting in reduced diversity in the distilled datasets. In this paper, we propose a simple yet effective approach, Diversity-Enhanced Distribution Alignment (DEDA), which enhances the diversity of distilled data by leveraging inter-feature relationships. Our method first establishes Gaussian distribution alignment by matching the means and covariances of each class in the original dataset with those of the distilled dataset in the feature space of a pre-trained model. Since features within the last layer of a pre-trained model are often highly similar within each class, aligning distributions in this layer cannot obtain diversified distilled data, resulting in gradient starvation during downstream training tasks. To overcome this limitation, we introduce a regularizer that constrains the covariance matrix of the distilled data in the last layer to maximize diagonal elements while minimizing non-diagonal elements. Extensive evaluations across CIFAR-10/100, Tiny-ImageNet, and ImageNet-1K demonstrate state-of-the-art performance without additional computational overhead.

Since real-world multi-label data often exhibit significant label imbalance, long-tailed multi-label image classification has emerged as a prominent research area in computer vision. Traditionally, it is considered that deep neural networks' classifiers are vulnerable to long-tailed distributions, whereas the feature extraction backbone remains relatively robust. However, our analysis from the feature learning perspective reveals that the backbone struggles to maintain high sensitivity to sample-scarce categories but retains the ability to localize specific areas effectively. Based on this observation, we propose a new model for long-tailed multi-label image classification named category-specific selective feature enhancement (CSSFE). First, it utilizes the retained localization capability of the backbone to capture label-dependent class activation maps. Then, a progressive attention enhancement mechanism, updating from head to medium to tail categories, is introduced to address the low-confidence issue in medium and tail categories. Finally, visual features are extracted according to the optimized class activation maps and combined with semantic information to perform the classification task. Extensive experiments on two benchmark datasets highlight our findings' generalizability and the proposed CSSFE's superior performance.


#350
Highlight
Registration beyond Points: General Affine Subspace Alignment via Geodesic Distance on Grassmann Manifold

Jaeho Shin · Hyeonjae Gil · Junwoo Jang · Maani Ghaffari · Ayoung Kim

Affine Grassmannian has been favored for expressing proximity between lines and planes due to its theoretical exactness in measuring distances among features. Despite this advantage, the existing method can only measure the proximity without yielding the distance as an explicit function of rigid body transformation. Thus, an optimizable distance function on the manifold has remained underdeveloped, stifling its application in registration problems. This paper is the first to explicitly derive an optimizable cost function between two Grassmannian features with respect to rigid body transformation ($\mathbf{R}$ and $\mathbf{t}$). Specifically, we present a rigorous mathematical proof demonstrating that the bases of high-dimensional linear subspaces can serve as an explicit representation of the cost. Finally, we propose an optimizable cost function based on the transformed bases that can be applied to the registration problem of any affine subspace. Compared to vector parameter-based approaches, our method is able to find a globally optimal solution by directly minimizing the geodesic distance which is agnostic to representation ambiguity. The resulting cost function and its extension to the inlier-set maximizing Branch-and-Bound (BnB) solver have been demonstrated to improve the convergence of existing solutions or outperform them in various computer vision tasks. The code will be made available after the review process.


#351
Highlight
Mind the Gap: Preserving and Compensating for the Modality Gap in CLIP-Based Continual Learning

Linlan Huang · Xusheng Cao · Haori Lu · Yifan Meng · Fei Yang · Xialei Liu

Continual learning aims to enable models to learn sequentially from continuously incoming data while retaining performance on previously learned tasks.With the Contrastive Language-Image Pre-trained model (CLIP) exhibiting strong capabilities across various downstream tasks, there has been growing interest in leveraging CLIP for continual learning in such scenarios.Most existing works overlook the inherent modality gap in CLIP, a key factor in its generalization and adaptability. In this paper, we analyze the variations in the modality gap during the fine-tuning of vision-language pre-trained models.Our observations reveal that the modality gap effectively reflects the extent to which pre-trained knowledge is preserved.Based on these insights, we propose a simple yet effective method that improves CLIP’s performance in class-incremental learning.Our approach leverages modality gap preservation to mitigate forgetting and modality gap compensation to enhance the capacity for new data, introducing a novel modality-gap-based perspective for continual learning. Extensive experiments on multiple benchmarks demonstrate that our method outperforms existing approaches without requiring additional replay data.


#352
Sibai: A Few-Shot Meta-Classifier for Poisoning Detection in Federated Learning

Melanie Götz · Torsten Krauß · Alexandra Dmitrienko

Federated Learning (FL) enables collaborative machine learning across decentralized clients without sharing raw data, which offers enhanced privacy and improved performance. However, FL is vulnerable to poisoning attacks, compromising model integrity through both untargeted performance degradation and targeted backdoor attacks. Detecting backdoors in FL is challenging due to their stealthy nature and variability in local datasets. Existing defenses struggle against adaptive adversaries and distinguishing between poisoning and genuine dataset anomalies. This paper introduces the Siamese Backdoor Inspector (Sibai), a novel meta-classifier-based poisoning defense for FL. Leveraging the staple few-shot learning technique of Siamese networks, Sibai effectively detects malicious contributions in various scenarios, including settings with strong variations between clients' datasets and encounters with adaptive adversaries. Sibai achieves high detection rates, prevents backdoors, minimizes performance impact, and outperforms eight recent defenses regarding F1 score, poisoning prevention, and consistency across complex scenarios.


#353
Generalization-Preserved Learning: Closing the Backdoor to Catastrophic Forgetting in Continual Deepfake Detection

Xueyi Zhang · Peiyin Zhu · Chengwei Zhang · Zhiyuan Yan · Jikang Cheng · Mingrui Lao · Siqi Cai · Yanming Guo

Existing continual deepfake detection methods typically treat stability (retaining previously learned forgery knowledge) and plasticity (adapting to novel forgeries) as conflicting properties, emphasizing an inherent trade-off between them, while regarding generalization to unseen forgeries as secondary. In contrast, we reframe the problem: stability and plasticity can coexist and be jointly improved through the model’s inherent generalization. Specifically, we propose Generalization-Preserved Learning (GPL), a novel framework consisting of two key components: (1) Hyperbolic Visual Alignment, which introduces learnable watermarks to align incremental data with the base set in hyperbolic space, alleviating inter-task distribution shifts; (2) Generalized Gradient Projection, which prevents parameter updates that conflict with generalization constraints, ensuring new knowledge learning does not interfere with previously acquired knowledge. Notably, GPL requires neither backbone retraining nor historical data storage. Experiments conducted on four mainstream datasets (FF++, Celeb-DF v2, DFD, and DFDCP) demonstrate that GPL achieves an accuracy of 92.14\%, outperforming replay-based state-of-the-art methods by 2.15\%, while reducing forgetting by 2.66\%. Moreover, GPL achieves an 18.38\% improvement on unseen forgeries using only 1\% of baseline parameters, thus presenting an efficient adaptation to continuously evolving forgery techniques.

Domain incremental object detection in remote sensing addresses the challenge of adapting to continuously emerging domains with distinct characteristics. Unlike natural images, remote sensing data vary significantly due to differences in sensors, altitudes, and geographic locations, leading to data distribution shifts and feature misalignments. These challenges make it difficult for models to generalize across domains while retaining knowledge from previous tasks, requiring effective adaptation strategies to mitigate catastrophic forgetting. To address these challenges, we propose the Dual Domain Control via Active Learning (Active-DDC) method, which integrates active learning strategies to handle data distribution and model feature shifts. The first component, the Data-based Active Learning Example Replay (ALER) module, combines a high-information sample selection strategy from active learning with the characteristic extreme foreground-background ratio in remote sensing images, enabling the selection of highly representative samples for storage in a memory bank. The second component, the Query-based Active Domain Shift Control (ADSC) module, leverages the query vector, a key element for DETR-based detectors, to implement query active preselection and optimal transport matching, thus facilitating effective cross-domain knowledge transfer. Our method achieves optimal performance in domain incremental tasks across four remote sensing datasets, and ablation studies further validate the effectiveness of both components.


#355
Gradient Extrapolation for Debiased Representation Learning

Ihab Asaad · Maha Shadaydeh · Joachim Denzler

Machine learning classifcation models trained with empirical risk minimization (ERM) often inadvertently rely on spurious correlations. When absent in the test data, these unintended associations between non-target attributes and target labels lead to poor generalization. This paper addresses this problem from a model optimization perspective and proposes a novel method, Gradient Extrapolation for Debiased Representation Learning (GERNE), designed to learn debiased representations in both known and unknown attribute training cases. GERNE uses two distinct batches with different amounts of spurious correlations to define the target gradient as the linear extrapolation of two gradients computed from each batch’s loss. It is demonstrated that the extrapolated gradient, if directed toward the gradient of the batch with fewer amount of spurious correlation, can guide the training process toward learning a debiased model. GERNE can serve as a general framework for debiasing with methods, such as ERM, reweighting, and resampling, being shown as special cases. The theoretical upper and lower bounds of the extrapolation factor are derived to ensure convergence. By adjusting this factor, GERNE can be adapted to maximize the Group-Balanced Accuracy (GBA) or the Worst-Group Accuracy. The proposed approach is validated on five vision and one NLP benchmarks, demonstrating competitive and often superior performance compared to state-of-the-art baseline methods.


#356
VisNumBench: Evaluating Number Sense of Multimodal Large Language Models

Tengjin Weng · Jingyi Wang · Wenhao Jiang · Zhong Ming

Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about $1,900$ multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks.Our experiments on VisNumBench led to the following key findings:(i) The 17 MLLMs we tested—including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash—perform significantly below human levels in number sense-related tasks.(ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities.(iii) Stronger MLLMswith larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities.We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing LVLMs' number sense abilities. All benchmark resources, including code and datasets, will be publicly released upon the paper’s acceptance.


#357
FedAGC: Federated Continual Learning with Asymmetric Gradient Correction

Chengchao Zhang · Fanhua Shang · Hongying Liu · Liang Wan · Wei Feng

Federated Continual Learning (FCL) has emerged as a prominent distributed learning paradigm and aims at addressing model learning challenges in both federated and continual learning settings. Efficient personalization in FCL remains a major challenge, as it must handle not only conflicts between old and new knowledge within parallel task streams but also heterogeneous knowledge conflicts from different clients. Recent approaches attempt to mitigate these issues through gradient correction. However, they often overlook the combined impact of gradient magnitude and direction, leading to unsatisfactory gradient solutions. To address these issues, we propose a novel federated continual learning method (called FedAGC) with asymmetric gradient correction, which performs memory rehearsal using representative samples selected via a centroid-based approach from historical tasks. By formulating the problem as a multi-objective optimization paradigm, FedAGC derives more effective gradients while incorporating group-level personalization to facilitate useful knowledge integration and irrelevant knowledge isolation, effectively mitigating both temporal and spatial catastrophic forgetting. Extensive experiments confirm the effectiveness of FedAGC.


#358
Highlight
BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes

Minkyun Seo · Hyungtae Lim · Kanghee Lee · Luca Carlone · Jaesik Park

Recent advances in deep learning-based point cloud registration have improved generalization, yet most methods still require retraining or manual parameter tuning for each new environment. In this paper, we identify three key factors limiting generalization: (a) reliance on environment-specific voxel size and search radius, (b) poor out-of-domain robustness of learning-based keypoint detectors,and (c) raw coordinate usage, which exacerbates scale discrepancies. To address these issues, we present a zero-shot registration pipeline called BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using farthest point sampling to bypass learned detectors, and (c) leveraging patch-wise scale normalization for consistent coordinate bounds. In particular, we present a multi-scale patch-based descriptor generation and a hierarchical inlier search across scales to improve robustness in diverse scenes. We also propose a novel generalizability benchmark using 11 datasets that cover various indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X achieves substantial generalization without prior information or manual parameter tuning for the test datasets. Our code will be made publicly available.

With the rapid growth of deep learning, there is an increasing availability of open-source models for various tasks. However, single fine-tuned models often fall short of meeting the diverse needs of users. Model merging has thus emerged as an efficient method to integrate the capabilities of existing models into a unified model. Nevertheless, existing model merging methods face challenging trade-offs between performance and deployment costs, primarily due to task interference. For the first time, we reveal that task interference is evident in the frequency domain of model parameters, yet current efforts only focus on spatial domain solutions, which are largely ineffective in addressing frequency domain interference. To mitigate the impact of frequency domain interference, we propose FR-Merging, an innovative method that effectively filters harmful frequency domain interference on the backbone with minimal computational overhead. Since performance loss is inevitable with cost-free methods, we propose a lightweight task-specific expert module that dynamically compensates for information loss during merging. This proposed framework, FREE-Merging (FR-Merging with experts), strikes a balanced trade-off between training cost, inference latency, storage requirements, and performance. We demonstrate the effectiveness of both FR-Merging and FREE-Merging on multiple tasks across CV, NLP, and Multi-Modal domains and show that they can be flexibly adapted to specific needs.


#360
RANKCLIP: Ranking-Consistent Language-Image Pretraining

Yiming Zhang · Zhuokai Zhao · Zhaorun Chen · Zhili Feng · Zenghui Ding · Yining Sun

Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.


#361
Enhancing Adversarial Transferability by Balancing Exploration and Exploitation with Gradient-Guided Sampling

Zenghao Niu · Weicheng Xie · Siyang Song · Zitong YU · Feng Liu · Linlin Shen

Adversarial attacks present a critical challenge to deep neural networks' robustness, particularly in transfer scenarios across different model architectures. However, the transferability of adversarial attacks faces a fundamental dilemma between Exploitation (maximizing attack potency) and Exploration (enhancing cross-model generalization). Traditional momentum-based methods over-prioritize Exploitation, i.e., higher loss maxima for attack potency but weakened generalization (narrow loss surface). Conversely, recent methods with inner-iteration sampling over-prioritize Exploration, i.e., flatter loss surfaces for cross-model generalization but weakened attack potency (suboptimal local maxima). To resolve this dilemma, we propose a simple yet effective Gradient-Guided Sampling (GGS), which harmonizes both objectives through guiding sampling along the gradient ascent direction to improve both sampling efficiency and stability. Specifically, based on MI-FGSM, GGS introduces inner-iteration random sampling and guides the sampling direction using the gradient from the previous inner-iteration (the sampling's magnitude is determined by a random distribution). This mechanism encourages adversarial examples to reside in balanced regions with both flatness for cross-model generalization and higher local maxima for strong attack potency. Comprehensive experiments across multiple DNN architectures and multimodal large language models (MLLMs) demonstrate the superiority of our method over state-of-the-art transfer attacks. Code will be made publicly available.


#362
An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval

Jaeseok Byun · Seokhyeon Jeong · Wonjae Kim · Sanghyuk Chun · Taesup Moon

Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable image searches.The mainstream Zero-Shot (ZS) CIR methods bypass the need for expensive training CIR triplets by projecting image embeddings into the text token embedding space, forming a composed query for retrieval.However, we highlight an inherent limitation in these projection-based CIR: a task discrepancy of text encoders between the original pre-training task of the encoders (text $\leftrightarrow$ image) and the target CIR task (image + text $\leftrightarrow$ image), which potentially negatively impacts CIR performance.To reduce such a discrepancy, a naive solution would be to train both image and text encoders with CIR triplets in a supervised manner. Instead, we introduce Reducing Task Discrepancy of Text Encoders (RTD), an efficient text-only post-hoc framework that complements projection-based CIR methods. We devise a novel target-anchored text contrastive learning designed to enhance the capability of the text encoder for CIR. We also propose two key enhancements: (1) a hard negative-based refined batch sampling strategy and (2) a refined concatenation scheme to further mitigate training-inference discrepancy. Integrating \ours into state-of-the-art projection-based methods achieves performance comparable to, or even surpassing, resource-intensive state-of-the-art synthetic CIR triplet-based approaches only with 23 minutes of additional training on 4 A100 GPUs— up to $100\times$ faster in training.Our code will be available upon acceptance.


#363
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Weitai Kang · Haifeng Huang · Yuzhang Shang · Mubarak Shah · Yan Yan

Recent advancements in 3D Large Language Models (3DLLMs) show their potential to build general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D further integrates an improved vision projector and enhanced sequence organization. Notably, we achieve a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap).

Federated Learning (FL) enables collaborative global model training without data sharing but facing critical challenges from privacy leakage and Byzantine attacks. Existing privacy-preserving robust FL frameworks suffer from three main limitations: high computational costs, restricted RAR usage, and inadequate handling of data heterogeneity. To address these limitations, we propose the FLSeg framework, which leverages Segment Exchange and Segment Aggregation to avoid excessive encryption computations while allowing unrestricted use of any RAR. Additionally, a regularization term in local training balances personalization with global model performance, effectively adapting to heterogeneous data. Our theoretical and experimental analyses demonstrate FLSeg’s semi-honest security and computational efficiency. FLSeg achieves client and server time complexities of $O(\ell)$ and $O(n\ell)$, with empirical results showing significantly reduced computational times, e.g., 233 ms for clients and 78 ms per client on the server, compared to ACORN (USENIX 23) at 1696 ms and 181 ms. Extensive experiments confirm FLSeg’s robustness across diverse heterogeneous and adversarial scenarios, e.g., achieving 64.59\% accuracy on non-IID CIFAR-10 with 20\% Min-Max attackers, compared to ACORN of 48.21\%.


#365
ZIUM: Zero-Shot Intent-Aware Adversarial Attack on Unlearned Models

Hyun Jun Yook · Ga San Jhun · Cho Hyun · Min Jeon · Donghyun Kim · Tae Hyung Kim · Youn Lee

Machine unlearning (MU) removes specific data points or concepts from deep learning models to enhance privacy and prevent sensitive content generation. Adversarial prompts can exploit unlearned models to generate content containing removed concepts, posing a significant security risk. However, existing adversarial attack methods still face challenges in generating content that aligns with an attacker’s intent while incurring high computational costs to identify successful prompts. To address these challenges, we propose ZIUM, a Zero-shot Intent-aware adversarial attack on Unlearned Models, which enables the flexible customization of target attack images to reflect an attacker’s intent. Additionally, ZIUM supports zero-shot adversarial attacks without requiring further optimization for previously attacked unlearned concepts. The evaluation across various MU scenarios demonstrated ZIUM's effectiveness in successfully customizing content based on user-intent prompts while achieving a superior attack success rate compared to existing methods. Moreover, its zero-shot adversarial attack significantly reduces the attack time for previously attacked unlearned concepts.


#366
Federated Prompt-Tuning with Heterogeneous and Incomplete Multimodal Client Data

Hang Phung · Manh Nguyen · Thanh Huynh · Quoc Viet Hung Nguyen · Trong Nghia Hoang · Phi Le Nguyen

This paper develops a generalized federated prompt-tuning framework under practical scenarios where local datasets are multi-modal and have different distributional patterns of missing features at the input level. The proposed framework helps bridge the gap between federated learning and multi-modal prompt-tuning which previously focus on either uni-modal or centralized data. A key challenge in bridging this gap is due to the inherent lack of a semantic alignment between prompt instructions that encodes the same distributional patterns of missing data across different clients. To address this challenge, our proposed framework introduces specific client-tuning and server-aggregation designs that learns to simultaneously optimize, align, and aggregate prompt-tuning instructions across clients and data modalities, enabling them to complement one another and be combined effectively. A thorough evaluation of our framework on a variety of multimodal benchmark datasets demonstrates consistent and significant performance improvement over existing state-of-the-art (SOTA) baselines.


#367
Learning Interpretable Queries for Explainable Image Classification with Information Pursuit

Stefan Kolek · Aditya Chattopadhyay · Kwan Ho Ryan Chan · Hector Andrade Loarca · Gitta Kutyniok · Rene Vidal

Building image classification models that are both highly accurate and interpretable remains a challenge in computer vision. Information Pursuit (IP) is an information-theoretic framework for interpretable-by-design sequential prediction. Given a set of task-relevant and semantic data queries, IP selects a sequence of queries in order of information gain and updates the posterior at each step based on the gathered query-answer pairs. To carry out IP, previous methods construct hand-crafted dictionaries of potential data queries, curated either by a domain expert or by prompting large language models. However, in practice, such hand-crafted dictionaries are limited by the expertise of the curator and the heuristics of prompt engineering, resulting in a gap between the predictive performance of IP versus non-interpretable black-box predictors. In this work, we propose to parameterize the IP queries as a learnable dictionary defined in the latent space of vision-language models such as CLIP. Drawing inspiration from sparse dictionary learning, we propose an alternating optimization algorithm that iterates between solving IP's optimization problem for a fixed query dictionary and optimizing the dictionary to maximize classification accuracy. Empirically, our experiments show that our method learns a query dictionary that reduces the accuracy gap between explainable image classification with IP and black-box methods, while preserving interpretability.

Dense visual prediction tasks, such as detection and segmentation, are crucial for time-critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature-based KD methods rely on static, teacher-driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student-teacher interactions. To address these limitations, we propose Adaptive student-teacher Cooperative Attention Masking for Knowledge Distillation (ACAM-KD), which introduces two key components: (1) Student-Teacher Cross-Attention Feature Fusion (STCA-FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial-Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel-wise feature selection. Unlike conventional KD methods, ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM-KD improves object detection performance by up to 1.4 mAP over the state-of-the-art when distilling a ResNet-50 student from a ResNet-101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3-MobileNetV2 as the student model.


#369
Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning

Wenjin Mo · Zhiyuan Li · Minghong Fang · Mingwei Fang

Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model with coordination from a central server, without needing to share their raw data. This approach is particularly appealing in the era of privacy regulations like the GDPR, leading many prominent companies to adopt it. However, FL's distributed nature makes it susceptible to poisoning attacks, where malicious clients, controlled by an attacker, send harmful data to compromise the model. Most existing poisoning attacks in FL aim to degrade the model’s integrity, such as reducing its accuracy, with limited attention to privacy concerns from these attacks. In this study, we introduce FedPoisonMIA, a novel poisoning membership inference attack targeting FL. FedPoisonMIA involves malicious clients crafting local model updates to infer membership information. Additionally, we propose a robust defense mechanism to mitigate the impact of FedPoisonMIA attacks. Extensive experiments across various datasets demonstrate the attack's effectiveness, while our defense approach reduces its impact to a degree.

OpenAI's CLIP models, released in early 2021, have long been the only viable choice for the research community in building multimodal foundation models. This dominance has only recently been challenged by a few alternatives like SigLIP. However, to the best of our knowledge, all these solutions are still not fully open, \eg, their training data remains proprietary and/or their training frameworks are unreleased. In this paper, we address this challenge by introducing a family of fully open vision encoders that are as competitive as, or even surpass, OpenAI's CLIP in building multimodal foundation models like LLaVA. Moreover, due to their fully open nature, we offer these vision encoders in a wide range of sizes, from as few as 5.9 million parameters to 632.1 million parameters. We further demonstrate that these variable-sized vision encoders provide significant flexibility: larger models deliver enhanced multimodal performance, while smaller models enable efficient and portable multimodal foundation models suitable for edge device deployment. The training data, code and trained models will be released soon.


#371
Integrating Visual Interpretation and Linguistic Reasoning for Geometric Problem Solving

Zixian Guo · Ming Liu · Qilong Wang · Zhilong Ji · Jinfeng Bai · Lei Zhang · Wangmeng Zuo

In addressing geometric problems, the reasoning capabilities demonstrated by existing large vision-language models (LVLMs) are significantly inferior to those of their corresponding large language model (LLM) backbones. We attribute this issue to the inadequate alignment and joint comprehension of visual and linguistic features. Furthermore, the imprecise information extracted from images by LVLMs further impairs their reasoning abilities. To this end, we propose a dual-mind architecture that can capture detailed visual information from images and facilitate effective linguistic reasoning through joint optimization. Different from the existing supervised fine-tune pipeline, which makes LVLMs conduct problem-solving directly, we let the LVLMs interpret the visual content first. LVLMs extract key elements like precise geometric primitives and spatial relationships as natural language conditions. Then, LLM serves as a linguistic reasoner for deriving the answer through step-by-step reasoning. The visual interpreting module and the linguistic reasoning module can effectively collaborate by an outcome-rewarded joint tuning strategy. By solving the multimodal question using the dual-mind of LVLM and LLM, we achieve significant improvements in visually intensive geometric math problems. This work advances multimodal reasoning by a new coupled architecture with explicit visual perception and linguistic reasoning, which can overcome the limitations of current LVLMs.The code will be made publicly available.


#372
SAFER: Sharpness Aware layer-selective Finetuning for Enhanced Robustness in vision transformers

Bhavna Gopal · Huanrui Yang · Mark Horton · Yiran Chen

Vision transformers (ViTs) have become essential backbones in advanced computer vision applications and multi-modal foundation models. Despite their strengths, ViTs remain vulnerable to adversarial perturbations, comparable to or even exceeding the vulnerability of convolutional neural networks (CNNs). Furthermore, the large parameter count and complex architecture of ViTs make them particularly prone to adversarial overfitting, often compromising both clean and adversarial accuracy. This paper mitigates adversarial overfitting in ViTs through a novel, layer-selective fine-tuning approach: SAFER. Instead of optimizing the entire model, we identify and selectively fine-tune a small subset of layers most susceptible to overfitting, applying sharpness-aware minimization to these layers while freezing the rest of the model. Our method consistently enhances both clean and adversarial accuracy over baseline approaches. Typical improvements are around 5%, with some cases achieving gains as high as 20% across various ViT architectures and datasets.


#373
Adding Additional Control to One-Step Diffusion with Joint Distribution Matching

Yihong Luo · Tianyang Hu · Yifan Song · Jiacheng Sun · Zhenguo Li · Jing Tang

While diffusion distillation has enabled one-step generation through methods like Variational Score Distillation, adapting distilled models to emerging new controls -- such as novel structural constraints or latest user preferences -- remains challenging. Conventional approaches typically requires modifying the base diffusion model and redistilling it -- a process that is both computationally intensive and time-consuming. To address these challenges, we introduce Joint Distribution Matching (JDM), a novel approach that minimizes the reverse KL divergence between image-condition joint distributions. By deriving a tractable upper bound, JDM decouples fidelity learning from condition learning. This asymmetric distillation scheme enables our one-step student to handle controls unknown to the teacher model and facilitates improved classifier-free guidance (CFG) usage and seamless integration of human feedback learning (HFL). Experimental results demonstrate that JDM surpasses baseline methods such as multi-step ControlNet by mere one-step in most cases, while achieving state-of-the-art performance in one-step text-to-image synthesis through improved usage of CFG or HFL integration.

he Vision Transformer (ViT) has gained prominence for its superior relational modeling prowess. However, its global attention mechanism's quadratic complexity poses substantial computational burdens. A common remedy spatially groups tokens for self-attention, reducing computational requirements. Nonetheless, this strategy neglects semantic information in tokens, possibly scattering semantically-linked tokens across distinct groups, thus compromising the efficacy of self-attention intended for modeling inter-token dependencies. Motivated by these insights, we introduce a fast and balanced clustering method, named Semantic Equitable Clustering (SEC). SEC clusters tokens based on their global semantic relevance in an efficient, straightforward manner. In contrast to traditional clustering methods requiring multiple iterations, our method achieves token clustering in a single pass. Additionally, SEC regulates the number of tokens per cluster, ensuring a balanced distribution for effective parallel processing on current computational platforms without necessitating further optimization. Capitalizing on SEC, we propose a versatile vision backbone, SECViT. Comprehensive experiments in image classification, object detection, instance segmentation, and semantic segmentation validate to the effectiveness of SECViT. Remarkably, SECViT attains an impressive 84.3% image classification accuracy with only 27M parameters and 4.6G FLOPs, without the need for for additional supervision or data. Moreover, SEC can be conveniently and swiftly applied to multimodal large language models (MLLM), such as LLaVA, to serve as a vision language connector, effectively accelerating the model’s efficiency while maintaining unchanged or better performance.


#375
SPD: Shallow Backdoor Protecting Deep Backdoor Against Backdoor Detection

Shunjie Yuan · Xinghua Li · Xuelin Cao · Haiyan Zhang · Mengyao Zhu · Robert Deng

Backdoor attacks have revealed the vulnerability of deep neural networks (DNNs), which motivates the development of secure deep learning systems. However, existing backdoor attacks often fail to bypass backdoor detection and human visual inspection, resulting in the exposure of the backdoor implanted in DNNs, which can subsequently be significantly mitigated through pruning or fine-tuning on benign data. To address this issue, in this paper, we propose a novel backdoor attack called SPD (Shallow Protecting Deep), which consists of a deep backdoor in the frequency domain and a shallow backdoor in the pixel domain, where the shallow backdoor acts as a firewall to protect the deep backdoor from being detected. Specifically, the deep backdoor in SPD samples from a specific Gaussian distribution, and encodes the sampled results into the intensity of the image's amplitude component in the frequency domain using an autoencoder, which serves as the backdoor trigger, thereby ensuring the invisibility of the backdoor attack. The shallow backdoor leverages traditional patch-based triggers, which covers all classes and attracts the defender's attention, thereby preserving the deep backdoor's resistance to existing backdoor detection techniques. Experimental results demonstrate that SPD not only can resist existing backdoor detection techniques, but also, due to the minimal disturbance caused by the backdoor trigger on benign samples, remains invisible, allowing the backdoor samples to pass through the human visual inspection.

Active learning (AL) seeks to reduce annotation costs by selecting the most informative samples for labeling, making it particularly valuable in resource-constrained settings. However, traditional evaluation methods, which focus solely on final accuracy, fail to capture the full dynamics of the learning process. To address this gap, we propose PALM (Performance Analysis of Active Learning Models), a unified and interpretable mathematical model that characterizes AL trajectories through four key parameters: achievable accuracy, coverage efficiency, early-stage performance, and scalability. PALM provides a predictive description of AL behavior from partial observations, enabling the estimation of future performance and facilitating principled comparisons across different strategies. We validate PALM through extensive experiments on CIFAR-10/100 and ImageNet-50/100/200, covering a wide range of AL methods and self-supervised embeddings. Our results demonstrate that PALM generalizes effectively across datasets, budgets, and strategies, accurately predicting full learning curves from limited labeled data. Importantly, PALM reveals crucial insights into learning efficiency, data space coverage, and the scalability of AL methods. By enabling the selection of cost-effective strategies and predicting performance under tight budget constraints, PALM lays the basis for more systematic, reproducible, and data-efficient evaluation of AL in both research and real-world applications.


#377
Inference-Time Diffusion Model Distillation

Geon Yeong Park · Sang Wan Lee · Jong Ye

Diffusion distillation models effectively accelerate reverse sampling by compressing the process into fewer steps. However, these models still exhibit a performance gap compared to their pre-trained diffusion model counterparts, exacerbated by distribution shifts and accumulated errors during multi-step sampling. To address this, we introduce Distillation++, a novel inference-time distillation framework that reduces this gap by incorporating teacher-guided refinement during sampling. Inspired by recent advances in conditional sampling, our approach recasts student model sampling as a proximal optimization problem with a score distillation sampling loss (SDS). To this end, we integrate distillation optimization during reverse sampling, which can be viewed as teacher guidance that drives student sampling trajectory towards the clean manifold using pre-trained diffusion models. Thus, Distillation++ improves the denoising process in real-time without additional source data or fine-tuning. Distillation++ demonstrates substantial improvements over state-of-the-art distillation baselines, particularly in early sampling stages, positioning itself as a robust guided sampling process crafted for diffusion distillation models.


#378
G2D: Boosting Multimodal Learning with Gradient-Guided Distillation

Mohammed Rakib · Arunkumar Bagavathi

Multimodal learning aims to leverage information from diverse data modalities to achieve more comprehensive performance. However, conventional multimodal models often suffer from modality imbalance, where one or a few modalities dominate model optimization, leading to suboptimal feature representation and underutilization of weak modalities. To address this challenge, we introduce Gradient-Guided Distillation ($G^{2}D$), a knowledge distillation framework that optimizes the multimodal model with a custom-built loss function that fuses both unimodal and multimodal objectives. $G^{2}D$ further incorporates a dynamic sequential modality prioritization (SMP) technique in the learning process to ensure each modality leads the learning process, avoiding the pitfall of stronger modalities overshadowing weaker ones. We validate $G^{2}D$ on multiple real-world datasets and show that $G^{2}D$ amplifies the significance of weak modalities while training and outperforms state-of-the-art methods in classification and regression tasks. The source code is available with the supplementary materials.


#379
Personalized Federated Learning under Local Supervision

Qiqi Liu · Jiaqiang Li · Yuchen Liu · Yaochu Jin · Lingjuan Lyu · Xiaohu Wu · Han Yu

A crucial issue in federated learning is the heterogeneity of data across clients, which may lead to model divergence, eventually deteriorating the model performance. Personalized federated learning (pFL) has been shown to be an effective approach to addressing data heterogeneity in federated learning. However, many existing pFL studies rely on directly using the global model for local training without fully assessing its impact on the performance of the local model, resulting in a potential conflict between personalization and generalization. To address this issue, we propose a parallel structure of a local supervisor and an inter-learning model for the local model and introduce a novel pFL method called federated learning by considering data similarity across clients assisted by a local supervisor (FedSimSup). Specifically, FedSimSup maintains an inter-learning model for each client and refines the inter-learning model using a local supervisor for each client. The local supervisor monitors the aggregated global information and ensures that the inter-learning model aligns with the local heterogeneous data to enhance local model performance. Additionally, the similarity between clients is measured based on differences in local data distributions, and this similarity is used to adjust the weights of the inter-learning models.Experimental results show that FedSimSup outperforms eight state-of-the-art federated learning methods in handling heterogeneous data. Additionally, it supports different model architectures across clients, providing greater flexibility when computational resources vary among them.


#380
Token Activation Map to Visually Explain Multimodal LLMs

Yi Li · Hualiang Wang · Xinpeng Ding · Haonan Wang · Xiaomeng Li

Multimodal large language models (MLLMs) are broadly empowering various fields. Despite their advancements, the explainability of MLLMs remains less explored, hindering deeper understanding, model credibility, and effective visualization. Unlike conventional vision models (e.g., CNNs, ViTs, CLIP) that produce a single output, MLLMs generate sequences of tokens progressively, where each generated token depends on the previous context. Therefore, earlier context tokens can introduce redundant activations that interfere with the explanation of later tokens beyond their original information. Existing studies often overlook this issue, but our observations reveal that these redundant correlations can significantly hurt the reliability of explanations. To address this, we propose an estimated causal inference method to mitigate the interference of context to achieve high-quality MLLM explanation, with a novel rank Gaussian filter to further reduce activation noises. We term this method Token Activation Map (TAM) to highlight the consideration of interactions between tokens. TAM also indicates that it excels at explaining multiple tokens of MLLM, which is different from the Class Activation Map (CAM) for a single prediction. Our TAM method significantly outperforms existing SoTA methods, showcasing high-quality visualization results that can be utilized for various scenarios, such as object localization, failure case analysis, video visualization, MLLMs visual comparison, and model understanding (e.g., color, shape, action, location, visual reasoning, multi-turn conversation, etc.). The code will be released upon acceptance.

An immersive acoustic experience enabled by spatial audio is just as crucial as the visual aspect in creating realistic virtual environments. However, existing methods for room impulse response estimation rely either on data-demanding learning-based models or computationally expensive physics-based modeling. In this work, we introduce Audio-Visual Differentiable Room Acoustic Rendering (AV-DAR), a framework that leverages visual cues extracted from multi-view images and acoustic beam tracing for physics-based room acoustic rendering. Experiments across six real-world environments from two datasets demonstrate that our multimodal, physics-based approach is efficient, interpretable, and accurate, significantly outperforming a series of prior methods. Notably, on the Real Acoustic Field dataset, AV-DAR achieves comparable performance to models trained on 10 times more data while delivering relative gains ranging from 16.6% to 50.9% when trained at the same scale.


#382
Highlight
Noise-Modeled Diffusion Models for Low-Light Spike Image Restoration

Ruonan Liu · Lin Zhu · Xijie Xiang · Lizhi Wang · Hua Huang

Spike-based imaging, inspired by the human visual system, offers several advantages, including high temporal resolution and low power consumption, but suffers from significant image degradation in low-light conditions due to noise interference. Restoring spike images under such conditions poses a significant challenge, as traditional frame-based or spike-based techniques are ill-suited to handle such severe noise and unique noise characteristics. This paper proposes a novel approach for restoring low-light spike images using noise-modeled diffusion models. By establishing a noise-embedded spike imaging model under low light, we model the forward diffusion process as the degradation of spike images with proportional and residual terms and incorporate determinstic and non-determinstic components with reverse shifting, enabling the model to capture the distinctive spike noise structure. Additionally, we utilize region mask image, dark current map and spike density value as conditions to further guide the restoration process by providing prompts for degradation regions, deterministic parameters and noise intensity. Experimental results demonstrate that our method significantly outperforms existing spike-based reconstruction and diffusion-based image restoration methods in both quantitative performance and visual qualityThis work opens new possibilities for spike-based imaging systems, particularly in low-light environments, and lays the groundwork for future developments in spike image restoration using advanced diffusion models.


#383
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models

Hao Fang · Jiawei Kong · Wenbo Yu · Bin Chen · Jiawei Li · Hao Wu · Shu-Tao Xia · Ke Xu

Vision-Language Pre-training (VLP) models have exhibited unprecedented capability in many applications by taking full advantage of the learned multimodal alignment. However, previous studies have shown they are vulnerable to maliciously crafted adversarial samples. Despite recent success, these attacks are generally instance-specific and require generating perturbations for each input sample. In this paper, we reveal that VLP models are also susceptible to the instance-agnostic universal adversarial perturbation (UAP). Specifically, we design a novel Contrastive-training Perturbation Generator with Cross-modal conditions (C-PGC). In light that the pivotal multimodal alignment in VLP models is achieved via contrastive learning, we devise to turn this powerful weapon against VLP models themselves. I.e., we employ a malicious version of contrastive learning to train the proposed generator using our carefully crafted positive and negative image-text pairs. Once training is complete, the generator is able to produce universal perturbations that can essentially destroy the established alignment relationship in VLP models. Besides, C-PGC fully utilizes the characteristics of Vision-and-Language (V+L) scenarios by incorporating both unimodal and cross-modal information as effective guidance. Extensive experiments show that C-PGC successfully forces adversarial samples to move away from their original area in the VLP model's feature space, thus fundamentally enhancing attack performance across various victim models and V+L tasks.


#384
Why LVLMs Are More Prone to Hallucinations in Longer Responses: The Role of Context

Ge Zheng · Jiaye Qian · Jiajin Tang · Sibei Yang

Large Vision-Language Models (LVLMs) have made significant progress in recent years but are also prone to hallucination issues. They exhibit more hallucinations in longer, free-form responses, often attributed to accumulated uncertainties. In this paper, we ask: Does increased hallucination result solely from length-induced errors, or is there a deeper underlying mechanism? After a series of preliminary experiments and findings, we suggest that the risk of hallucinations is not caused by length itself but by the increased reliance on context for coherence and completeness in longer responses. Building on these insights, we propose a novel ``induce-detect-suppress" framework that actively induces hallucinations through deliberately designed contexts, leverages induced instances for early detection of high-risk cases, and ultimately suppresses potential hallucinations during actual decoding. Our approach achieves consistent, significant improvements across all benchmarks, demonstrating its efficacy. The strong detection and improved hallucination mitigation not only validate our framework but, more importantly, re-validate our hypothesis on context. Rather than solely pursuing performance gains, this study aims to provide new insights and serves as a first step toward a deeper exploration of hallucinations in LVLMs' longer responses. Code will be released.


#385
Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery

Xinhang Wan · Jiyuan Liu · Qian Qu · Suyuan Liu · Chuyu Zhang · Fangdi Wang · Xinwang Liu · En Zhu · Kunlun He

In this paper, we address the problem of novel class discovery (NCD), which aims to cluster novel classes by leveraging knowledge from disjoint known classes. While recent advances have made significant progress in this area, existing NCD methods face two major limitations. First, they primarily focus on single-view data (e.g., images), overlooking the increasingly common multi-view data, such as multi-omics datasets used in disease diagnosis. Second, their reliance on pseudo-labels to supervise novel class clustering often results in unstable performance, as pseudo-label quality is highly sensitive to factors such as data noise and feature dimensionality. To address these challenges, we propose a novel framework named Intra-view and Inter-view Correlation Guided Multi-view Novel Class Discovery (IICMVNCD), which is the first attempt to explore NCD in multi-view setting so far. Specifically, at the intra-view level, leveraging the distributional similarity between known and novel classes, we employ matrix factorization to decompose features into view-specific shared base matrices and factor matrices. The base matrices capture distributional consistency among the two datasets, while the factor matrices model pairwise relationships between samples. At the inter-view level, we utilize view relationships among known classes to guide the clustering of novel classes. This includes generating predicted labels through the weighted fusion of factor matrices and dynamically adjusting view weights of known classes based on the supervision loss, which are then transferred to novel class learning. Experimental results validate the effectiveness of our proposed approach.

In this paper, we mitigate the problem of Self-Supervised Learning (SSL) for fine-grained representation learning, aimed at distinguishing subtle differences within highly similar subordinate categories. Our preliminary analysis shows that SSL, especially the multi-stage alignment strategy, performs well on generic categories but struggles with fine-grained distinctions. To overcome this limitation, we propose a prototype-based contrastive learning module with stage-wise progressive augmentation. Unlike previous methods, our stage-wise progressive augmentation adapts data augmentation across stages to better suit SSL on fine-grained datasets. The prototype-based contrastive learning module captures both holistic and partial patterns, extracting global and local image representations to enhance feature discriminability. Experiments on popular fine-grained benchmarks for classification and retrieval tasks demonstrate the effectiveness of our method, and extensive ablation studies confirm the superiority of our proposals.


#387
Highlight
Radiant Foam: Real-Time Differentiable Ray Tracing

Shrisudhan Govindarajan · Daniel Rebain · Kwang Moo Yi · Andrea Tagliasacchi

Research on differentiable scene representations is consistently moving towards more efficient, real-time models. Recently, this has led to the popularization of splatting methods, which eschew the traditional ray-based rendering of radiance fields in favor of rasterization. This has yielded a significant improvement in rendering speeds due to the efficiency of rasterization algorithms and hardware, but has come at a cost: the approximations that make rasterization efficient also make implementation of light transport phenomena like reflection and refraction much more difficult. We propose a novel scene representation which avoids these approximations, but keeps the efficiency and reconstruction quality of splatting by leveraging a decades-old efficient volumetric mesh ray tracing algorithm which has been largely overlooked in recent computer vision research. The resulting model, which we name Radiant Foam, achieves rendering speed and quality comparable to Gaussian Splatting, without the constraints of rasterization. Unlike ray traced Gaussian models that use hardware ray tracing acceleration, our method requires no special hardware or APIs beyond the standard features of a programmable GPU.


#388
COSTARR: Consolidated Open Set Technique with Attenuation for Robust Recognition

Ryan Rabinowitz · Steve Cruz · Walter Scheirer · Terrance Boult

Handling novelty is a common challenge in visual recognition systems. Existing open-set methods rely on the familiarity hypothesis, detecting novelty by the absence of familiar features. We introduce a novel attenuation hypothesis, arguing that small weights learned during training, which attenuate features, play a dual role: they differentiate known classes but also discard information valuable for distinguishing known and unknown classes. How to effectively leverage this attenuation information to enhance open-set recognition remains unclear, so we present COSTARR, a novel approach that combines the requirement of familiar features and the lack of unfamiliar ones. We provide a probabilistic interpretation of the COSTARR score, linking it to the likelihood of correct classification and belonging in a known class. To determine the individual contributions of the pre- and post-attenuated features to COSTARR's performance, we conduct ablation studies that demonstrate both pre-attenuated deep features and the underutilized post-attenuated Hadamard product features are essential for improving open-set recognition. Also, to validate generalizability and efficacy across diverse architectures and datasets, we evaluate COSTARR on a large-scale setting, using ImageNet2012-1K as known data and NINCO, iNaturalist, OpenImage-O and other datasets as unknowns, across multiple modern pre-trained architectures (ViTs, ConvNeXts, and ResNet). The experiments demonstrate that COSTARR generalizes effectively across various architectures and significantly outperforms prior state-of-the-art methods by incorporating previously discarded attenuation information, thus advancing open-set recognition capabilities. Code available upon publication.


#389
ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Predictions

Dubing Chen · Jin Fang · Wencheng Han · Xinjing Cheng · Junbo Yin · Cheng-zhong Xu · Fahad Khan · Jianbing Shen

Vision-based semantic occupancy and flow prediction provide critical spatiotemporal cues for real-world tasks like autonomous driving and robotics. In this work, we strive to improve performance by introducing a series of targeted improvements for 3D semantic occupancy prediction and flow estimation. First, we propose an occlusion-aware adaptive lifting mechanism with depth denoising to improve the robustness of 2D-to-3D feature transformation and reduce reliance on depth priors. Second, we enhance semantic consistency between 3D and 2D features using shared semantic prototypes to jointly constrain both modalities. This is supported by confidence- and category-based sampling to tackle long-tail challenges in 3D space. Third, to ease the feature encoding burden in joint semantics and flow prediction, we introduce a BEV cost volume-based method. It connects flow and semantic features via the cost volume and applies a classification-regression supervision scheme to manage varying flow scales in dynamic scenes. Our purely convolutional framework achieves SOTA results across multiple benchmarks for 3D semantic occupancy prediction and joint semantic occupancy-flow prediction. It is also the 2nd solution for the Occupancy and Flow in Autonomous Driving Challenge. We provide multiple model variants that optimally balance efficiency and performance. Our real-time version exceeds all existing real-time methods in speed and accuracy, showcasing unmatched deployability. Code and models will be publicly released.


#390
Information Density Principle for MLLM Benchmarks

Chunyi Li · Xiaozhe Li · Zicheng Zhang · Yuan Tian · Ziheng Jia · Xiaohong Liu · Xiongkuo Min · Jia Wang · Haodong Duan · Kai Chen · Guangtao Zhai

With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs. We characterize it from four key dimensions: (1) Fallacy, (2) Difficulty, (3) Redundancy, (4) Diversity. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density. We hope this principle can promote the development and application of future MLLM benchmarks.


#391
Perspective-Aware Teaching: Adapting Knowledge for Heterogeneous Distillation

Jhe-Hao Lin · Yi Yao · Chan-Feng Hsu · Hongxia Xie · Hong-Han Shuai · Wen-Huang Cheng

Knowledge distillation (KD) involves transferring knowledge from a pre-trained heavy teacher model to a lighter student model, thereby reducing the inference cost while maintaining comparable effectiveness. Prior KD techniques typically assume homogeneity between the teacher and student models. However, as technology advances, a wide variety of architectures have emerged, ranging from initial Convolutional Neural Networks (CNNs) to Vision Transformers (ViTs), and Multi-Level Perceptrons (MLPs). Consequently, developing a universal KD framework compatible with any architecture has become an important research topic. In this paper, we introduce a perspective-aware teaching (PAT) KD framework to enable feature distillation across diverse architectures. Our framework comprises two key components. First, we design prompt tuning blocks that incorporate student feedback, allowing teacher features to adapt to the student model's learning process. Second, we propose region-aware attention to mitigate the view mismatch problem between heterogeneous architectures. By leveraging these two modules, effective distillation of intermediate features can be achieved across heterogeneous architectures. Extensive experiments on CIFAR, ImageNet, and COCO demonstrate the superiority of the proposed method.


#392
Is Meta-Learning Out? Rethinking Unsupervised Few-Shot Classification with Limited Entropy

Yunchuan Guan · Yu Liu · Ke Zhou · Zhiqi Shen · Jenq-Newng Hwang · Serge Belongie · Lei Li

Meta-learning is a powerful paradigm for tackling few-shot tasks. However, recent studies indicate that models trained with the whole-class training strategy can achieve comparable performance to those trained with meta-learning in few-shot classification tasks. To demonstrate the value of meta-learning, we establish an entropy-limited supervised setting for fair comparisons. Through both theoretical analysis and experimental validation, we establish that meta-learning has a tighter generalization bound compared to whole-class training. We unravel that meta-learning is more efficient with limited entropy and is more robust to label noise and heterogeneous tasks, making it well-suited for unsupervised tasks. Based on these insights, We propose MINO, a meta-learning framework designed to enhance unsupervised performance. MINO utilizes the adaptive clustering algorithm DBSCAN with a dynamic head for unsupervised task construction and a stability-based meta-scaler for robustness against label noise. Extensive experiments confirm its effectiveness in multiple unsupervised few-shot and zero-shot tasks.


#393
GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Xudong LU · Yinghao Chen · Renshou Wu · Haohao Gao · Xi Chen · Xue Yang · Xiangyu Zhao · Aojun Zhou · Fangyuan Li · Yafei Wen · Xiaoxin Chen · shuai ren · Hongsheng Li

Recent advancements in Multimodal Large Language Models (MLLMs) have enabled their deployment on mobile devices. However, challenges persist in maintaining strong language capabilities and ensuring hardware compatibility, both of which are crucial for user experience and practical deployment efficiency. In our deployment process, we observe that existing MLLMs often face performance degradation on pure language tasks, and the current NPU platforms on smartphones do not support the MoE architecture, which is commonly used to preserve pure language capabilities during multimodal training. To address these issues, we systematically analyze methods to maintain pure language capabilities during the training of MLLMs, focusing on both training data and model architecture aspects. Based on these analyses, we propose GenieBlue, an efficient MLLM structural design that integrates both linguistic and multimodal capabilities for LLMs on mobile devices. GenieBlue freezes the original LLM parameters during MLLM training to maintain pure language capabilities. It acquires multimodal capabilities by duplicating specific transformer blocks for full fine-tuning and integrating lightweight LoRA modules. This approach preserves language capabilities while achieving comparable multimodal performance through extensive training. Deployed on smartphone NPUs, GenieBlue demonstrates efficiency and practicality for applications on mobile devices.

Machine Unlearning has recently garnered significant attention, aiming to selectively remove knowledge associated with specific data while preserving the model’s performance on the remaining data. A fundamental challenge in this process is balancing effective unlearning with knowledge retention, as naive optimization of these competing objectives can lead to conflicting gradients, hindering convergence and degrading overall performance. To address this issue, we propose Learning to Unlearn while Retaining, aimed to mitigate gradient conflicts between unlearning and retention objectives. Our approach strategically avoids conflicts through an implicit gradient regularization mechanism that emerges naturally within the proposed framework. This prevents conflicting gradients between unlearning and retention, leading to effective unlearning while preserving the model’s utility. We validate our approach across both discriminative and generative tasks, demonstrating its effectiveness in achieving unlearning without compromising performance on remaining data. Our results highlight the advantages of avoiding such gradient conflicts, outperforming existing methods that fail to account for these interactions.


#395
Neural Architecture Search Driven by Locally Guided Diffusion for Personalized Federated Learning

PENG LIAO · Xilu Wang · Yaochu Jin · WenLi Du · Han Hu

Neural Architecture Search (NAS) has gained significant attention in personalized federated learning (PFL) due to its ability to automatically design tailored models for individual clients. While most existing NAS approaches for PFL perform architecture searches on the server side, client-side NAS—where architectures are optimized locally on clients—offers stronger privacy protection by eliminating the need to transmit sensitive model information. However, this paradigm remains underexplored and often suffers from suboptimal average client performance, primarily due to two limitations: (1) Inefficient client-side search strategies caused by data isolation and restricted access to local architectures across clients, and (2) slow supernet convergence arising from server aggregation and local supernet training. To address these challenges, we propose a Personalized Federated Stochastic Differential Equation-based NAS (PerFedSDE-NAS). To achieve effective local search, each client employs a guided diffusion model to generate promising personalized architectures tailored to local data characteristics, while a performance predictor based on radial basis functions is used to select only the most promising subset of architectures for evaluation. To accelerate supernet convergence, each client maintains a supernet with an archive-driven training mechanism, and a novel model aggregation strategy is proposed to further enhance weight convergence during FL rounds. We validate PerFedSDE-NAS across three NAS search spaces, including convolutional neural networks and transformers, demonstrating broad applicability. Compared to traditional fixed-model and NAS-based PFL approaches, our method achieves state-of-the-art performance.

Recently, multi-view learning (MVL) has garnered significant attention due to its ability to fuse discriminative information from multiple views. However, real-world multi-view datasets are often heterogeneous and imperfect, which usually makes MVL methods designed for specific combinations of views lack application potential and limits their effectiveness. To address this issue, we propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment. Specifically, we introduce a simple yet effective multi-view transformer fusion network where we transform heterogeneous multi-view data into homogeneous word embeddings, and then integrate multiple views by the sample-level attention mechanism to obtain a fused representation. Furthermore, we propose a simulated perturbation based multi-view contrastive learning framework that dynamically generates the noise and unusable perturbations for simulating imperfect data conditions. The simulated noisy and unusable data obtain two distinct fused representations, and we utilize contrastive learning to align them for learning discriminative and robust representations. Our RML is self-supervised and can also be applied for downstream tasks as a regularization. In experiments, we employ it in unsupervised multi-view clustering, noise-label classification, and as a plug-and-play module for cross-modal hashing retrieval. Extensive comparison experiments and ablation studies validate the effectiveness of RML.


#397
PAN-Crafter: Learning Modality-Consistent Alignment for PAN-Sharpening

Jeonghyeok Do · Sungpyo Kim · Geunhyuk Youk · Jaehyup Lee · Munchurl Kim

PAN-sharpening aims to fuse high-resolution panchromatic (PAN) images with low-resolution multi-spectral (MS) images to generate high-resolution multi-spectral (HRMS) outputs. However, cross-modality misalignment---caused by sensor placement, acquisition timing, and resolution disparity---induces a fundamental challenge. Conventional deep learning methods assume perfect pixel-wise alignment and rely on per-pixel reconstruction losses, leading to spectral distortion, double edges, and blurring when misalignment is present. To address this, we propose PAN-Crafter, a modality-consistent alignment framework that explicitly mitigates the misalignment gap between PAN and MS modalities. At its core, Modality-Adaptive Reconstruction (MARs) enables a single network to jointly reconstruct HRMS and PAN images, leveraging PAN’s high-frequency details as auxiliary self-supervision. Additionally, we introduce Cross-Modality Alignment-Aware Attention (CM3A), a novel mechanism that bidirectionally aligns MS texture to PAN structure and vice versa, enabling adaptive feature refinement across modalities. Extensive experiments on multiple benchmark datasets demonstrate that our PAN-Crafter outperforms the most recent state-of-the-art method in all metrics, even with 50.11$\times$ faster inference time and 0.63$\times$ the memory size. Furthermore, it demonstrates strong generalization performance on unseen satellite datasets, showing its robustness across different conditions.


#398
Attention to the Burtiness in Visual Prompt Tuning!

Yuzhu Wang · Manni Duan · Shu Kong

Visual Prompt Tuning (VPT) is a parameter-efficient finetuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover "burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Interestingly, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner.Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 points on the CUB dataset; interestingly, it learns ``bursty prompts''.As bilinear models are known to introduce burstiness, we present a compact method by learning two small sets of parameters whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments demonstrate that BPT methods not only outperform various VPT methods across multiple benchmark datasets but also reduce parameter count and computation overhead.


#399
DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Sophia Sirko-Galouchenko · Spyros Gidaris · Antonin Vobecky · Andrei Bursuc · Nicolas THOME

We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles.To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations.


#400
Zero-Shot Vision Encoder Grafting via LLM Surrogates

Kaiyu Yue · Vasu Singla · Menglin Jia · John Kirchenbauer · Rifaa Qadri · Zikui Cai · Abhinav Bhatele · Furong Huang · Tom Goldstein

Vision language models (VLMs) typically pair a modestly sized vision encoder with a large language model (LLM), e.g., Llama-70B, making the decoder the primary computational burden during training.To reduce costs, a promising strategy is to first train the vision encoder using a small language model before transferring it to the large one.We construct small ''surrogate models'' that share the same embedding space and representation language as the large target LLM by directly inheriting its shallow layers.Vision encoders trained on the surrogate can then be directly transferred to the larger model, a process we call zero-shot grafting -- when plugged directly into the full-size target LLM, the grafted pair surpasses the encoder-surrogate pair and, on some benchmarks, even performs on par with full decoder training with the target LLM.Furthermore, our surrogate training approach reduces overall VLM training costs by $\sim$45\% when using Llama-70B as the decoder.


#401
Long-Tailed Classification with Multi-Granularity Semantics

Yuting Liu · Liu Yang · Yu Wang

Real-world data often exhibit long-tailed distributions, which degrade data quality and pose challenges for deep learning. To address this issue, knowledge transfer from head classes to tail classes has been shown to effectively mitigate feature sparsity. However, existing methods often overlook class differences, leading to suboptimal knowledge transfer. While the class space exhibits a label hierarchy, similarity relationships beyond hierarchically related categories remain underexplored. Considering the human ability to process visual perception problems in a multi-granularity manner guided by semantics, this paper presents a novel semantic knowledge-driven contrastive learning method. Inspired by the implicit knowledge embedded in large language models, the proposed LLM-based label semantic generation method overcomes the limitations of the label hierarchy. Additionally, a semantic knowledge graph is constructed based on the extended label information to guide representation learning. This enables the model to dynamically identify relevant classes for learning and facilitates multi-granularity knowledge transfer between similar categories. Experiments on long-tail benchmark datasets, including CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT, demonstrate that the proposed method significantly improves the accuracy of tail classes and enhances overall performance without compromising the accuracy of head classes.


#402
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

CHENMING ZHU · Tai Wang · Wenwei Zhang · Jiangmiao Pang · Xihui Liu

Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D scene understanding capabilities has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D visual understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we utilize the 3D position embeddings to enhance the 2D CLIP Patches with 3D spatial context information and construct 3D patches. By integrating the 3D position embeddings into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D visual understanding and 3D scene understanding. In contrast to previous 3D LMMs, LLaVA-3D supports decoding accurate 3D spatial perception outputs, e.g., 3D bounding boxes, directly from these 3D patches, without relying on the time-consuming off-the-shelf 3D segmentors. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D visual understanding and vision-language conversation capabilities with LLaVA.


#404
Highlight
ReTracker: Exploring Image Matching for Robust Online Any Point Tracking

Dongli Tan · Xingyi He · Sida Peng · Yiqing Gong · Xing Zhu · Jiaming Sun · Ruizhen Hu · Yujun Shen · Hujun Bao · Xiaowei Zhou

This paper aims to establish correspondences for a set of 2D query points across a video sequence in an online manner. Recent methods leverage future frames to achieve smooth point tracking at the current frame, but they still struggle to find points with significant viewpoint changes after long-term occlusions and inherently cannot achieve online tracking. To overcome these challenges, we develop a novel online tracking framework, named ReTracker, that integrates two advances in image matching with tracking-specific designs. First, a decoder network with a global receptive field is incorporated with a temporal attention module to robustly track points undergoing large location changes. Second, the decoder network is adapted to pretrain on large-scale two-view matching data, which offers significantly greater diversity and volume than tracking data, to learn general matching priors. This pretraining strategy effectively enhances our tracker's ability to handle viewpoint and appearance variations after long-term occlusions. Experiments demonstrate that our method outperforms recent online trackers across multiple benchmarks and achieves competitive or superior performance compared to offline methods. Furthermore, we collect an ego-centric, occlusion-heavy dataset to illustrate the retracking capabilities of our approach. The code and dataset will be released for the reproducibility.


#405
HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding

JIAHE ZHAO · RuiBing Hou · zejie tian · Hong Chang · Shiguang Shan

We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research of human behavior analysis in 3D scenes, advancing embodied AI and world models.


#406
Backdoor Attacks on Neural Networks via One-Bit Flip

Xiang Li · Lannan Luo · Qiang Zeng

Conventional backdoor attacks on deep neural networks (DNNs) typically assume that an attacker can manipulate the training data or process. However, recent research introduces a more practical threat model by injecting backdoors at the inference stage. These approaches leverage bit flip attacks to modify model weights using memory fault injection techniques such as Rowhammer. Despite their effectiveness, they suffer from a significant limitation---the need to flip a relatively large number of bits simultaneously, which is highly challenging in practice. To overcome this constraint, we propose SOLEFLIP, the first one-bit-flip backdoor attack. Unlike prior methods that rely on optimization-based bit searches and require flipping multiple bits, our algorithm identifies a promising weight for the attack and flips a single bit to insert a backdoor. We evaluate SOLEFLIP on CIFAR-10, SVHN, and ImageNet across various DNN architectures, including a vision transformer. The results show that SOLEFLIP achieves high attack success rates (up to 99.9\%, with an average of 98.9\%) while causing minimal degradation to benign accuracy (0.0\% on average). Furthermore, SOLEFLIP is resilient to backdoor defenses. Our findings reveal a critical threat to DNNs: flipping just one bit is sufficient to execute a successful backdoor attack.


#407
Highlight
A Linear N-Point Solver for Structure and Motion from Asynchronous Tracks

Hang Su · Yunlong Feng · Daniel Gehrig · Panfeng Jiang · Ling Gao · Xavier Lagorce · Laurent Kneip

Structure and continuous motion estimation from point correspondences is a fundamental problem in computer vision that has been powered by well-known algorithms such as the familiar 5-point or 8-point algorithm. However, despite their acclaim, these algorithms are limited to processing point correspondences originating from a pair of views each one representing an instantaneous capture of the scene. Yet, in the case of rolling shutter cameras, or more recently, event cameras, this synchronization breaks down. In this work, we present a unified approach for structure and linear motion estimation from 2D point correspondences with arbitrary timestamps, from an arbitrary set of views. By formulating the problem in terms of first-order dynamics and leveraging a constant velocity motion model, we derive a novel, linear point incidence relation allowing for the efficient recovery of both linear velocity and 3D points with predictable degeneracies and solution multiplicities. Owing to its general formulation, it can handle correspondences from a wide range of sensing modalities such as global shutter, rolling shutter, and event cameras, and can even combine correspondences from different collocated sensors. We validate the effectiveness of our solver on both simulated and real-world data, where we show consistent improvement across all modalities when compared to recent approaches. We believe our work opens the door to efficient structure and motion estimation from asynchronous data.


#408
Highlight
Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

Chen Ziwen · Hao Tan · Kai Zhang · Sai Bi · Fujun Luan · Yicong Hong · Li Fuxin · Zexiang Xu

We propose Long-LRM, a feed-forward 3D Gaussian reconstruction model for instant, high-resolution, 360$^\circ$ wide-coverage, scene-level reconstruction. Specifically, it takes in 32 input images at a resolution of $960\times 540$ and produces the Gaussian reconstruction in just 1 second on a single A100 GPU. To handle the long sequence of **250K** tokens brought by the large input size, Long-LRM features a mixture of the recent Mamba2 blocks and the classical transformer blocks, enhanced by a light-weight token merging module and Gaussian pruning steps that balance between quality and efficiency. We evaluate Long-LRM on the large-scale DL3DV benchmark and Tanks&Temples, demonstrating reconstruction quality comparable to the optimization-based methods while achieving an **800**$\times$ speedup w.r.t. the optimization-based approaches and an input size at least **60**$\times$ larger than the previous feed-forward approaches. We conduct extensive ablation studies on our model design choices for both rendering quality and computation efficiency. We also explore Long-LRM's compatibility with other Gaussian variants such as 2D GS, which enhances Long-LRM's ability in geometry reconstruction. Project page: https://longgggglrm.github.io


#409
TR-PTS: Task-Relevant Parameter and Token Selection for Efficient Tuning

Siqi Luo · Haoran Yang · Yi Xin · Mingyang Yi · Guangyang Wu · Guangtao Zhai · Xiaohong Liu

Large pre-trained models achieve remarkable performance in vision tasks but are impractical for fine-tuning due to high computational and storage costs. Parameter-Efficient Fine-Tuning (PEFT) methods mitigate this issue by updating only a subset of parameters; however, most existing approaches are task-agnostic, failing to fully exploit task-specific adaptations, which leads to suboptimal efficiency and performance. To address this limitation, we propose Task-Relevant Parameter and Token Selection (TR-PTS), a task-driven framework that enhances both computational efficiency and accuracy. Specifically, we introduce Task-Relevant Parameter Selection, which utilizes the Fisher Information Matrix (FIM) to identify and fine-tune only the most informative parameters in a layer-wise manner, while keeping the remaining parameters frozen.Simultaneously, Task-Relevant Token Selection dynamically preserves the most informative tokens and merges redundant ones, reducing computational overhead. By jointly optimizing g parameters and tokens, TR-PTS enables the model to concentrate on task-discriminative information. We evaluate TR-PTS on benchmark datasets, including FGVC and VTAB-1k, where it achieves state-of-the-art performance, surpassing full fine-tuning by 3.40% and 10.35%, respectively.

The intrication of brain signals drives research that leverages multimodal AI to align brain modalities with visual and textual data for explainable descriptions. However, most existing studies are limited to coarse interpretations, lacking essential details on object descriptions, locations, attributes, and their relationships. This leads to imprecise and ambiguous reconstructions when using such cues for visual decoding. To address this, we analyze different choices of vision feature spaces from pre-trained visual components within Multimodal Large Language Models (MLLMs) and introduce a zero-shot multimodal brain decoding method that interacts with these models to decode across multiple levels of granularities. To assess a model's ability to decode fine details from brain signals, we propose the Multi-Granularity Brain Detail Understanding Benchmark (MG-BrainDub). This benchmark includes two key tasks: detailed descriptions and salient question-answering, with metrics highlighting key visual elements like objects, attributes, and relationships. Our approach enhances neural decoding precision and supports more accurate neuro-decoding applications.


#411
Joint Diffusion Models in Continual Learning

Paweł Skierś · Kamil Deja

In this work, we introduce JDCL - a new method for continual learning with generative rehearsal based on joint diffusion models. Neural networks suffer from catastrophic forgetting defined as abrupt loss in the model's performance when retrained with additional data coming from a different distribution. Generative-replay-based continual learning methods try to mitigate this issue by retraining a model with a combination of new and rehearsal data sampled from a generative model. In this work, we propose to extend this idea by combining a continually trained classifier with a diffusion-based generative model into a single - jointly optimized neural network. We show that such shared parametrization, combined with the knowledge distillation technique allows for stable adaptation to new tasks without catastrophic forgetting. We evaluate our approach on several benchmarks, where it outperforms recent state-of-the-art generative replay techniques. Additionally, we extend our method to the semi-supervised continual learning setup, where it outperforms competing buffer-based replay techniques, and evaluate, in a self-supervised manner, the quality of trained representations.


#412
TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction

Zewei Zhou · Zhihao Zhao · Tianhui Cai · Zhiyu Huang · Bolei Zhou · Jiaqi Ma

End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models by nearly 9%. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances cooperative detection and prediction. The codebase will be released to facilitate future multi-agent multi-task research.

Neural networks are widely adopted to solve complex and challenging tasks. Especially in high-stakes decision-making, understanding their reasoning process is crucial, yet proves challenging for modern deep networks. Feature visualization (FV) is a powerful tool to decode what information neurons are responding to and hence to better understand the reasoning behind such networks. In particular, in FV we generate human-understandable images that reflect the information detected by neurons of interest. However, current methods often yield unrecognizable visualizations, exhibiting repetitive patterns and visual artifacts that are hard to understand for a human. To address these problems, we propose to guide FV through statistics of real image features combined with measures of relevant network flow to generate prototypical images. Our approach yields human-understandable visualizations that both qualitatively and quantitatively improve over state-of-the-art FVs across various architectures. As such, it can be used to decode which information is used by the network's reasoning process, complementing the methodology of mechanistic circuits that identify where relevant information is encoded.


#414
Overcoming Dual Drift for Continual Long-Tailed Visual Question Answering

Feifei Zhang · Zhihao Wang · Xi Zhang · Changsheng Xu

Visual Question Answering (VQA) is a widely explored multimodal task aimed at answering questions based on images. Recently, a few studies have started to investigate continual learning in VQA to cope with evolving multimodal data streams. However, these studies fall short of tackling another critical issue in real-world VQA applications: the long-tailed distribution of data. In this paper, we introduce Continual Long-Tailed Visual Question Answering (CLT-VQA) and identify two critical challenges: \textbf{inner-task prototype drift}, where classifier prototypes become biased toward majority classes due to imbalanced data, and \textbf{inter-task feature drift}, where learned features shift over time, causing forgetting of previously learned knowledge. To address these challenges, we propose a unified dual-balance approach that integrates a Balanced Classifier Prototype (BCP) learning module and a Multi-modal Feature Alignment (MFA) module. The BCP optimizes classifier prototypes to achieve balanced class representation, while the MFA aligns features consistently across tasks, preventing catastrophic forgetting. Extensive experimental results demonstrate that our method outperforms existing models, validating the effectiveness of the proposed approach. \textcolor{raspberry}{Code is available in the supplementary materials.}

Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.

In practice, environments constantly change over time and space, posing significant challenges for object detectors trained based on a closed-set assumption, i.e., training and test data share the same distribution. To this end, continual test-time adaptation has attracted much attention, aiming to improve detectors' generalization by fine-tuning a few specific parameters, e.g., BatchNorm layers. However, based on a small number of test images, fine-tuning certain parameters may affect the representation ability of other fixed parameters, leading to performance degradation. Instead, we explore a new mechanism, i.e., converting the fine-tuning process to a specific-parameter generation. Particularly, we first design a dual-path LoRA-based domain-aware adapter that disentangles features into domain-invariant and domain-specific components, enabling efficient adaptation. Additionally, a conditional diffusion-based parameter generation mechanism is presented to synthesize the adapter’s parameters based on the current environment, preventing the optimization from getting stuck in local optima. Finally, we propose a class-centered optimal transport alignment method to mitigate catastrophic forgetting. Extensive experiments conducted on various continuous domain adaptive object detection tasks demonstrate the effectiveness. Meanwhile, visualization results show that representation extracted by the generated parameters can capture more object-related information and strengthen the generalization ability.

Transfer-based attacks pose a significant security threat to deep neural networks (DNNs), due to their strong performance on unseen models in real-world black-box scenarios.Building on this, feature importance-based attacks further improve the transferability of adversarial examples by effectively suppressing model-specific feature patterns.However, existing methods primarily focus on single-granularity patch and single-stage training, leading to suboptimal solutions.To address these limitations, we propose a general multi-stage optimization framework based on Semantics-aware Multi-granularity Patchout, dubbed as SMP-Attack.Compared to the non-deformable/regular patch definition, we incorporate multi-granularity into the generation process of deformable/irregular patches, thereby enhancing the quality of the computed aggregate gradient.In contrast to conventional joint optimization of multi-layer losses, we introduce an effective multi-stage training strategy that systematically explores significant model-agnostic features from shallow to intermediate layers.Employing the ImageNet dataset, we conduct extensive experiments on undefended/defended CNNs and ViTs, which unequivocally demonstrate the superior performance of our proposed SMP attack over current state-of-the-art methods in black-box scenarios.Furthermore, we assess the compatibility of our multi-stage optimization, which supersedes single-stage training employed in existing feature-based methods, culminating in substantial performance improvement.


#418
Towards Comprehensive Lecture Slides Understanding: Large-scale Dataset and Effective Method

Enming Zhang · Yuzhe Li · Yuliang Liu · Yingying Zhu · Xiang Bai

Online education has been widespread in worldwide universities and educational institutions. Lecture slides, a fundamental component of online education, contain a wealth of information, playing a crucial role in learning.However, previous works have not yet paid sufficient attention to understanding lecture slides, including the absence of the large-scale dataset and comprehensive understanding tasks. To facilitate the research about lecture slides understanding, we establish the LecSlides-370K, which consists of 25,542 lectures with 370,078 slides across 15 areas. We also introduce two comprehensive tasks, Lecture Summary and Lecture Question Answering (QA), for providing different perspectives of slides understanding. Furthermore, complex and flexible text relations can hinder the understanding of the internal logic of slides. To address this challenge, we propose a novel method, named SlideParser, which includes an auxiliary branch to predict text relations within slides and enhance attention between related texts, thereby improving slides understanding. With extensive experiments, we show the superiority of our proposed method on both LecSlides-370k and SlideVQA. Dataset and code will be released soon.


#419
Highlight
Backdoor Mitigation by Distance-Driven Detoxification

Shaokui Wei · Jiayin Liu · Hongyuan Zha

Backdoor attacks undermine the integrity of machine learning models by allowing attackers to manipulate predictions using poisoned training data. Such attacks lead to targeted misclassification when specific triggers are present, while the model behaves normally under other conditions. This paper considers a post-training backdoor defense task, aiming to detoxify the backdoors in pre-trained models. We begin by analyzing the underlying issues of vanilla fine-tuning and observe that it is often trapped in regions with low loss for both clean and poisoned samples. Motivated by such observations, we propose Distance-Driven Detoxification (D3), an innovative approach that reformulates backdoor defense as a constrained optimization problem. Specifically, D3 promotes the model's departure from the vicinity of its initial weights, effectively reducing the influence of backdoors. Extensive experiments on state-of-the-art (SOTA) backdoor attacks across various model architectures and datasets demonstrate that D3 not only matches but often surpasses the performance of existing SOTA post-training defense techniques.


#420
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces

Ziming Yu · Pan Zhou · Sike Wang · Jia Li · Mi Tian · Hua Huang

Fine-tuning Large Language Models (LLMs) has proven effective for a variety of downstream tasks. However, as LLMs grow in size, the memory demands for backpropagation become increasingly prohibitive. Zeroth-order (ZO) optimization methods offer a memory-efficient alternative by using forward passes to estimate gradients, but the variance of gradient estimates typically scales linearly with the model's parameter dimension—a significant issue for LLMs. In this paper, we propose the random Subspace Zeroth-order (SubZero) optimization to address the challenges posed by LLMs' high dimensionality. We introduce a low-rank perturbation tailored for LLMs that significantly reduces memory consumption while improving performance. Additionally, we prove that our gradient estimation closely approximates the backpropagation gradient, exhibits lower variance than traditional ZO methods, and ensures convergence when combined with SGD. Experimental results show that SubZero enhances fine-tuning performance and achieves faster convergence compared to standard ZO approaches like MeZO across various language modeling tasks. The source code is in the supplementary and will be publicly released.


#421
Gradient Decomposition and Alignment for Incremental Object Detection

Wenlong Luo · Shizhou Zhang · De Cheng · Yinghui Xing · Guoqiang Liang · PENG WANG · Yanning Zhang

Incremental object detection (IOD) is crucial for enabling AI systems to continuously learn new object classes over time while retaining knowledge of previously learned categories, allowing model to adapt to dynamic environments without forgetting prior information.Existing IOD methods primarily employ knowledge distillation to mitigate catastrophic forgetting, yet these approaches overlook class overlap issues, often resulting in suboptimal performance. In this paper, we propose a novel framework for IOD that leverages a decoupled gradient alignment technique on top of the specially proposed pseudo-labeling strategy. Our method employs a Gaussian Mixture Model to accurately estimate pseudo-labels of previously learned objects in current training images, effectively functioning as a knowledge-replay mechanism. This strategy reinforces prior knowledge retention and prevents the misclassification of unannotated foreground objects from earlier classes as background. Furthermore, we introduce an adaptive gradient decomposition and alignment method to maintain model stability while facilitating positive knowledge transfer. By aligning gradients from both old and new classes, our approach preserves previously learned knowledge while enhancing plasticity for new tasks. Extensive experiments on two IOD benchmarks demonstrate the effectiveness of the proposed method, achieving superior performances to state-of-the-art methods.


#422
Highlight
Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection

Jinglun Li · Kaixun Jiang · Zhaoyu Chen · Bo Lin · Yao Tang · Weifeng Ge · Wenqiang Zhang

Pre-trained vision-language models have exhibited remarkable abilities in detecting out-of-distribution (OOD) samples. However, some challenging OOD samples, which lie close to in-distribution (InD) data in image feature space, can still lead to misclassification. The emergence of foundation models like diffusion models and multimodal large language models (MLLMs) offers a potential solution to this issue. In this work, we propose SynOOD, a novel approach that harnesses foundation models to generate synthetic, challenging OOD data for fine-tuning CLIP models, thereby enhancing boundary-level discrimination between InD and OOD samples. Our method uses an iterative in-painting process guided by contextual prompts from MLLMs to produce nuanced, boundary-aligned OOD samples. These samples are refined through noise adjustments based on gradients from OOD scores like the energy score, effectively sampling from the InD/OOD boundary. With these carefully synthesized images, we fine-tune the CLIP image encoder and negative label features derived from the text encoder to strengthen connections between near-boundary OOD samples and a set of negative labels. Finally, SynOOD achieves state-of-the-art performance on the large-scale ImageNet benchmark, with minimal increases in parameters and runtime. Our approach significantly surpasses existing methods, improving AUROC by 2.80% and reducing FPR95 by 11.13%. The code for SynOOD will be made publicly available.


#423
CAVIS: Context-Aware Video Instance Segmentation

Seunghun Lee · Jiwan Seo · Kiljoon Han · Minwoo Choi · Sunghoon Im

In this paper, we introduce the Context-Aware Video Instance Segmentation (CAVIS), a novel framework designed to enhance instance association by integrating contextual information adjacent to each object. To efficiently extract and leverage this information, we propose the Context-Aware Instance Tracker (CAIT), which merges contextual data surrounding the instances with the core instance features to improve tracking accuracy. Additionally, we design the Prototypical Cross-frame Contrastive (PCC) loss, which ensures consistency in object-level features across frames, thereby significantly enhancing instance matching accuracy. CAVIS demonstrates superior performance over state-of-the-art methods on all benchmark datasets in video instance segmentation (VIS) and video panoptic segmentation (VPS). Notably, our method excels on the OVIS dataset, which is known for its particularly challenging videos.


#424
Unlearning the Noisy Correspondence Makes CLIP More Robust

Haochen Han · Alex Jinpeng Wang · Peijun Ye · Fangming Liu

The data appetite for Vision-Language Models (VLMs) has continuously scaled up from the early millions to billions today, which faces an untenable trade-off with data quality and inevitably introduces Noisy Correspondence (NC) samples. Undoubtedly, such semantically unrelated data significantly impairs the performance of VLMs. Previous efforts mainly address this challenge by estimating refined alignment for more precise guidance. However, such resource-intensive pipelines that train VLMs from scratch struggle to meet realistic data demands. In this paper, we present a brand new perspective that seeks to directly eliminate the harmful effects of NC in pre-trained VLMs. Specifically, we propose NCU, a Noisy Correspondence Unlearning fine-tuning framework that efficiently enhances VLMs' robustness by forgetting learned noisy knowledge. The key to NCU is learning the hardest negative information, which can provide explicit unlearning direction for both false positives and false negatives. Such twin goals unlearning process can be formalized into one unified optimal transport objective for fast fine-tuning. We validate our approach with the prevailing CLIP model over various downstream tasks. Remarkably, NCU surpasses the robust pre-trained method on zero-shot transfer while with lower computational overhead. The code will be released upon acceptance.


#425
FEVER-OOD: Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection

Brian Isaac-Medina · Mauricio Che · Yona Falinie A. Gaus · Samet Akcay · Toby Breckon

Modern machine learning models, that excel on computer vision tasks such as classification and object detection, are often overconfident in their predictions for Out-of-Distribution (OOD) examples, resulting in unpredictable behaviour for open-set environments. Recent works have demonstrated that the free energy score is an effective measure of uncertainty for OOD detection given its close relationship to the data distribution. However, despite free energy-based methods representing a significant empirical advance in OOD detection, our theoretical analysis reveals previously unexplored and inherent vulnerabilities within the free energy score formulation such that in-distribution and OOD instances can have distinct feature representations yet identical free energy scores. This phenomenon occurs when the vector direction representing the feature space difference between the in-distribution and OOD sample lies within the null space of the last layer of a neural-based classifier. To mitigate these issues, we explore lower-dimensional feature spaces to reduce the null space footprint and introduce novel regularisation to maximize the least singular value of the final linear layer, hence enhancing inter-sample free energy separation. We refer to these techniques as Free Energy Vulnerability Elimination for Robust Out-of-Distribution Detection (FEVER-OOD). Our experiments show that FEVER-OOD techniques achieve state of the art OOD detection in Imagenet-100, with average OOD false positive rate (at 95\% true positive rate) of 36.50\% and an AUROC of 92.74 when used with the baseline Dream-OOD model, compared with a 39.33\% and 91.84 AUROC without FEVER-OOD.


#426
Local Dense Logit Relations for Enhanced Knowledge Distillation

Liuchi Xu · Kang Liu · Jinshuai Liu · Lu Wang · Lisheng XU · Jun Cheng

State-of-the-art logit distillation methods exhibit versatility, simplicity, and efficiency.Despite the advances, existing studies have yet to delve thoroughly into fine-grained relationships within logit knowledge.In this paper, we propose Local Dense Relational Logit Distillation (LDRLD), a novel method that captures inter-class relationships through recursively decoupling and recombining logit information, thereby providing more detailed and clearer insights for student learning.To further optimize performance, we introduce an Adaptive Decay Weight (ADW) strategy, which can dynamically adjust the weights for critical category pairs using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD).Specifically, IRW assigns weights inversely proportional to the rank differences between pairs, while ERD adaptively controls weight decay based on total ranking scores of category pairs. Furthermore, after the recursive decoupling, we distill the remaining non-target knowledge to ensure knowledge completeness and enhance performance. Ultimately, our method improves the student's performance by transferring fine-grained knowledge and emphasizing the most critical relationships.Extensive experiments on datasets such as CIFAR-100, ImageNet-1K, and Tiny-ImageNet demonstrate that our method compares favorably with state-of-the-art logit-based knowledge distillation methods. The code will be made publicly available.


#427
Controllable Feature Whitening for Hyperparameter-Free Bias Mitigation

Yooshin Cho · Hanbyel Cho · Janghyeon Lee · HyeongGwon Hong · Jaesung Ahn · Junmo Kim

As the use of artificial intelligence rapidly increases, the development of trustworthy artificial intelligence has become important. However, recent studies have shown that deep neural networks are susceptible to learn spurious correlations present in datasets. To improve fairness, we propose a simple yet effective framework called controllable feature whitening. We quantify the linear correlation between the target and bias features by the covariance matrix, and eliminate it through the whitening module. Our results systemically demonstrate that removing the linear correlations between features which are passed to the last linear classifier significantly improves the fairness. A particular advantage of the proposed method is that it does not require regularization terms or adversarial learning, which often leads to unstable optimization in practice. Furthermore, we show that two fairness criteria, demographic parity and equalized odds, can be effectively handled by whitening with the re-weighted covariance matrix. Consequently, our method optimizes the trade-off between the utility and fairness of algorithms by adjusting the re-weighting coefficient. Finally, we validate that our method outperforms existing approaches on four benchmark datasets: Corrupted CIFAR-10, Biased FFHQ, WaterBirds, and Celeb-A.


#428
Differentially Private Fine-Tuning of Diffusion Models

Yu-Lin Tsai · Yizhe Li · Zekai Chen · Po-Yu Chen · Francois Buet-Golfouse · Chia-Mu Yu · Xuebin Ren

The integration of Differential Privacy (DP) with diffusion models (DMs) presents a promising yet challenging frontier, particularly due to the substantial memorization capabilities of DMs that pose significant privacy risks. Differential privacy offers a rigorous framework for safeguarding individual data points during model training, with Differential Privacy Stochastic Gradient Descent (DP-SGD) being a prominent implementation. Diffusion method decomposes image generation into iterative steps, theoretically aligning well with DP's incremental noise addition. Despite the natural fit, the unique architecture of DMs necessitates tailored approaches to effectively balance privacy-utility trade-off. Recent developments in this field have highlighted the potential for generating high-quality synthetic data by pre-training on public data ($i.e.$, ImageNet) and fine-tuning on private data, however, there is a pronounced gap in research on optimizing the trade-offs involved in DP settings, particularly concerning parameter efficiency and model scalability. Our work addresses this by proposing a parameter-efficient fine-tuning strategy optimized for private diffusion models, which minimizes the number of trainable parameters to enhance the privacy-utility trade-off. We empirically demonstrate that our method achieves state-of-the-art performance in DP synthesis, significantly surpassing previous benchmarks on widely studied datasets ($e.g.$, with only 0.47M trainable parameters, achieving a more than 35% improvement over the previous state-of-the-art with a small privacy budget on the CelebA-64 dataset).

Explainable AI (XAI) methods have demonstrated significant success in recent years at identifying relevant features in input data that drive deep learning model decisions, enhancing interpretability for users. However, the potential of XAI beyond providing model transparency has remained largely unexplored in adjacent machine learning domains. In this paper, we show for the first time how XAI can be utilized in the context of federated learning. Specifically, while federated learning enables collaborative model training without raw data sharing, it suffers from performance degradation when client data distributions exhibit statistical heterogeneity. We introduce FedXDS (Federated Learning via XAI-guided Data Sharing), the first approach to utilize feature attribution techniques to identify precisely which data elements should be selectively shared between clients to mitigate heterogeneity. By employing propagation-based attribution, our method identifies task-relevant features through a single backward pass, enabling selective data sharing that aligns client contributions. To protect sensitive information, we incorporate metric differential privacy techniques that provide formal privacy guarantees while preserving utility. Experimental results demonstrate that our approach consistently achieves higher accuracy and faster convergence compared to existing methods across varying client numbers and heterogeneity settings. We provide theoretical privacy guarantees and empirically demonstrate robustness against both membership inference and feature inversion attacks.


#430
Towards Effective Foundation Model Adaptation for Extreme Cross-Domain Few-Shot Learning

Fei Zhou · Peng Wang · Lei Zhang · Wei Wei · Chen Ding · Guosheng Lin · Yanning Zhang

Large-scale pre-trained foundation models have demonstrated remarkable generalization capabilities across diverse computer vision tasks through fine-tuning. However, existing fine-tuning approaches often encounter challenges in extreme cross-domain few-shot learning scenarios, primarily due to the significant domain shift between pre-training data and target tasks, as well as the scarcity of annotated target samples. To mitigate this issue, we propose a novel absorption adaptation learning framework which meticulously regularizes the fine-tuning procedure of foundation model using an expert model with the same architecture but trained from scratch on the targeted data in two aspects. On one hand, we first design a masked cross-model unidirectional reconstruction scheme, which forces the foundation model to recover the intermediate feature of the expert model in a randomly masked manner. On the other hand, a decision graph association loss is developed to encourage the consistency of token similarity matrix between these two models. By doing these, the task-relevant semantic knowledge in the expert model from both intermediate feature and the final decision levels are appropriately extracted and absorbed by the foundation model during its fine-tuning, thus mitigating the performance drop caused by domain gap and limited annotation. Sufficient experiments with further observations and analyses underpin our observation and argument.


#431
Highlight
Consensus-Driven Active Model Selection

Justin Kay · Grant Horn · Subhransu Maji · Daniel Sheldon · Sara Beery

The widespread availability of off-the-shelf machine learning models poses a challenge: which model, of the many available candidates, should be chosen for a given data analysis task?This question of model selection is traditionally answered by collecting and annotating a validation dataset---a costly and time-intensive process.We propose a method for active model selection, using predictions from candidate models to prioritize the labeling of test data points that efficiently differentiate the best candidate. Our method, CODA, performs consensus-driven active model selection by modeling relationships between classifiers, categories, and data points within a probabilistic framework. The framework uses the consensus and disagreement between models in the candidate pool to guide the label acquisition process, and Bayesian inference to update beliefs about which model is best as more information is collected.We validate our approach by curating a collection of 25 benchmark tasks capturing a range of model selection scenarios.CODA outperforms existing methods for active model selection significantly, reducing the annotation effort required to discover the best model by upwards of 50% compared to the previous state-of-the-art. We will make our code and data public.


#432
Adversarial Purification via Super-Resolution and Diffusion

Mincheol Park · Cheonjun Park · Seungseop Lim · Mijin Koo · Hyunwuk Lee · Won Woo Ro · Suhyun Kim

Deep neural networks are widely used in various computer vision tasks, but their vulnerability to adversarial perturbations remains a significant challenge for reliable decision-making. Adversarial purification, a test-time defense strategy, has shown potential in countering these threats by removing noise through diffusion models. This plug-and-play method, using off-the-shelf models, appears highly effective. However, the purified data from diffusion often deviates more from the original data than the adversarial examples, leading to missing critical information and causing misclassification. In this study, we propose that upsampling with Super-Resolution (SR), followed by downsampling, can also aid in eliminating adversarial noise, similar to the noise addition and removal process in diffusion models. While SR alone is not as effective as the diffusion process, it better restores the original features typically associated with the early layers of networks. By combining SR, which initially mitigates damage to early-layer information from adversarial attacks, with diffusion, we observe a synergistic effect, leading to enhanced performance over diffusion models alone. Our comprehensive evaluations demonstrate that this combined approach, PuriFlow, significantly improves accuracy and robustness, working synergistically with state-of-the-art methods.


#433
FedWSQ: Efficient Federated Learning with Weight Standardization and Distribution-Aware Non-Uniform Quantization

Seung-Wook Kim · Seongyeol Kim · Jiah Kim · Seowon Ji · Se-Ho Lee

Federated learning (FL) often suffers from performance degradation due to key challenges such as data heterogeneity and communication constraints. To address these limitations, we present a novel FL framework called FedWSQ, which integrates weight standardization (WS) and the proposed distribution-aware non-uniform quantization (DANUQ). WS enhances FL performance by filtering out biased components in local updates during training, thereby improving the robustness of the model against data heterogeneity and unstable client participation. In addition, DANUQ minimizes quantization errors by leveraging the statistical properties of local model updates. As a result, FedWSQ significantly reduces communication overhead while maintaining superior model accuracy. Extensive experiments on FL benchmark datasets demonstrate that FedWSQ consistently outperforms existing FL methods across various challenging FL settings, including extreme data heterogeneity and ultra-low-bit communication scenarios.


#434
CMAD: Correlation-Aware and Modalities-Aware Distillation for Multimodal Sentiment Analysis with Missing Modalities

Yan Zhuang · Minhao Liu · Wei Bai · Yanru Zhang · Xiaoyue Zhang · Jiawen Deng · Fuji Ren

Multimodal Sentiment Analysis (MSA) enhances emotion recognition by integrating information from multiple modalities. However, multimodal learning with missing modalities suffers from representation inconsistency and optimization instability, leading to suboptimal performance. In this paper, we introduce Correlation-Aware and Modalities-Aware Distillation (CMAD), a unified framework designed for MSA under varying missing-modality conditions. Specifically, CMAD comprises two key components: (1) Correlation-Aware Feature Distillation (CAFD), which enforces multi-level representation alignment by preserving both feature similarities and high-order correlation structures between teacher and student models, and (2) Modality-Aware Regularization (MAR) employs an adaptive weighting strategy guided by modality difficulty, enabling a curriculum learning paradigm to stabilize the training process. Extensive evaluations on five datasets show that CMAD consistently outperforms existing methods, achieving average performance improvements of 1.0\% on MOSEI, 4.4\% on IEMOCAP, 1.9\% on MUStARD, 0.5\% on UR-FUNNY and 1.9\% on CHERMA.


#435
SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models

Xianfu Cheng · Wei Zhang · Shiwei Zhang · Jian Yang · Xiangyuan Guan · Xianjie Wu · Xiang Li · Ge Zhang · Jiaheng Liu · Yuying Mai · Yutao Zeng · Zhoufutu Wen · JinKe JinKe · Baorui Wang · Weixiao Zhou · Lu Yunhong · Hangyuan Ji · Tongliang Li · Wenhao Huang · Zhoujun Li

The increasing application of multi-modal large language models (MLLMs) across various sectors has spotlighted the essence of their output reliability and accuracy, particularly their ability to produce content grounded in factual information (e.g. common and domain-specific knowledge). In this work, we introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate the factuality ability of MLLMs to answer natural language short questions. SimpleVQA is characterized by 7 key features: it is based on bilingual, it covers multiple tasks and multiple scenarios, ensures high quality and challenging queries, maintains static and timeless reference answers, and is straightforward to evaluate. Our approach involves categorizing visual question-answering items into 9 different tasks around objective events or common knowledge and situating these within 9 scenario domains. Rigorous quality control processes are implemented to guarantee high-quality, concise, and clear answers, facilitating evaluation with minimal variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into their image comprehension and text generation abilities by identifying and analyzing error cases.


#436
Highlight
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Wenqi Zhang · Hang Zhang · Xin Li · Jiashuo Sun · Yongliang Shen · Weiming Lu · Deli Zhao · Yueting Zhuang · Lidong Bing

Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving.


#437
Revelio: Interpreting and leveraging semantic information in diffusion models

Dahye Kim · Xavier Thomas · Deepti Ghadiyaram

We study $\textit{how}$ rich visual semantic information is represented within various layers and denoising timesteps of different diffusion architectures. We uncover monosemantic interpretable features by leveraging k-sparse autoencoders (k-SAE). We substantiate our mechanistic interpretations via transfer learning using light-weight classifiers on off-the-shelf diffusion models' features. On $4$ datasets, we demonstrate the effectiveness of diffusion features for representation learning. We provide an in-depth analysis of how different diffusion architectures, pre-training datasets, and language model conditioning impacts visual representation granularity, inductive biases, and transfer learning capabilities. Our work is a critical step towards deepening interpretability of black-box diffusion models. Code and visualizations available at: \url{https://github.com/revelio-diffusion/revelio}


#438
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

Siyu Jiao · Haoye Dong · Yuyang Yin · ZEQUN JIE · Yinlong Qian · Yao Zhao · Humphrey Shi · Yunchao Wei

Recent works in 3D representation learning and multimodal pre-training have made remarkable progress. However, typically multimodal 3D models are only capable of handling point clouds. Compared to the emerging 3D representation technique, 3D Gaussian Splatting (3DGS), the spatially sparse point cloud cannot depict the texture information of 3D objects, resulting in inferior reconstruction capabilities. This limitation constrains the potential of point cloud-based 3D multimodal representation learning. In this paper, we present CLIP-GS, a novel multimodal representation learning framework grounded in 3DGS. We introduce the GS Tokenizer to generate serialized gaussian tokens, which are then processed through a series of transformer layers pre-initialized with weights from point cloud models, resulting in the 3DGS embeddings. CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization. Furthermore, we develop an efficient way to generate triplets of 3DGS, images, and text, facilitating CLIP-GS in learning unified multimodal representations. Leveraging the well-aligned multimodal representations, CLIP-GS demonstrates versatility and outperforms point cloud-based models on various 3D tasks, including multimodal retrieval, zero-shot, and few-shot classification.


#439
ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

Jiaxin Ai · Pengfei Zhou · xu Pan · Ming Li · Fanrui Zhang · Zizhen Li · Jianwen Sun · Yukang Feng · Baojin Huang · Zhongyuan Wang · Kaipeng Zhang

As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses. Since human evaluation is laborious and costly, prompting MLLMs as automated process judges has become a common practice. However, the reliability of these model-based judges remains uncertain. To address this, we introduce ProJudgeBench, the first comprehensive benchmark specifically designed for evaluating abilities of MLLM-based process judges. ProJudgeBench comprises 2,400 test cases and 50,118 step-level labels, spanning four scientific disciplines with diverse difficulty levels and multi-modal content. In ProJudgeBench, each step is meticulously annotated by human experts for correctness, error type, and explanation, enabling a systematic evaluation of judges' capabilities to detect, classify and diagnose errors. Evaluation on ProJudgeBench reveals a significant performance gap between open-source and proprietary models. To bridge this gap, we further propose ProJudge-173k, a large-scale instruction-tuning dataset, and a Dynamic Dual-Phase fine-tuning strategy that encourages models to explicitly reason through problem-solving before assessing solutions. Both contributions significantly enhance the process evaluation capabilities of open-source models. All the resources will be released to foster future research of reliable multi-modal process evaluation.

Adversarial Training (AT) is one of the most effective methods to train robust Deep Neural Networks (DNNs). However, AT creates an inherent trade-off between clean accuracy and adversarial robustness, which is commonly attributed to the more complicated decision boundary caused by the insufficient learning of hard adversarial samples. In this work, we reveal a counterintuitive fact for the first time: from the perspective of perception consistency, hard adversarial samples that can still attack the robust model after AT are already learned better than those successfully defended. Thus, different from previous views, we argue that it is rather the over-sufficient learning of hard adversarial samples that degrades the decision boundary and contributes to the trade-off problem. Specifically, the excessive pursuit of perception consistency would force the model to view the perturbations as noise and ignore the information within them, which should have been utilized to induce a smoother perception transition towards the decision boundary to support its establishment to an appropriate location. In response, we define a new AT objective named Robust Perception, encouraging the model perception to change smoothly with input perturbations, based on which we propose a novel Robust Perception Adversarial Training (RPAT) method, effectively mitigating the current accuracy-robustness trade-off. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-34-10 demonstrate the effectiveness of our method beyond four common baselines and 12 state-of-the-art (SOTA) works.


#441
VehicleMAE: View-asymmetry Mutual Learning for Vehicle Re-identification Pre-training via Masked AutoEncoders

Qi Wang · Zeyu Zhang · Dong Wang · Di Gai · Xin Xiong · Jiyang Xu · Ruihua Zhou

Large-scale pre-training technology has achieved remarkable performance in diversified object re-identification (Re-ID) downstream tasks. Nevertheless, to our best knowledge, the pre-training model specifically for vehicle Re-ID, which focuses on tackling the challenge of multi-view variations, has not been fully investigated. In this paper, we first leverage a diffusion model to build a large-scale vehicle Re-ID benchmark dataset, dubbed “DiffVERI”, containing over 1700K images from abundant multi-view annotations. In terms of this dataset, we further present VehicleMAE, a novel masked image modeling pre-training paradigm that learns view-invariant representations by performing mutual-distillation in a self-supervised manner. To be specific, the pipeline of VehicleMAE unfolds two core modules, i.e., view-asymmetry masked image modeling (VMIM) and past-to-present mutual-distillation (PPMD). Technically, VMIM consists of two homogeneous masked autoencoders (MAE) that simultaneously reconstruct the RGB pixels and multi-view semantic information of the specific vehicle body region via paired asymmetric mask sampling strategies. To progressively distill the knowledge of the model itself, PPMD considers the two MAEs in the current epoch and the previous one as the student models and the teacher models, respectively, which leverages the knowledge learned by the current student and the historical teacher for mutual feature-level distillation. Extensive experimental results have verified that the proposed pre-training paradigm on DiffVERI gains compelling downstream task performance for vehicle Re-ID.


#442
SplatTalk: 3D VQA with Gaussian Splatting

Anh Thai · Kyle Genova · Songyou Peng · Leonidas Guibas · Thomas Funkhouser

Language-guided 3D scene understanding is important for advancing applications in robotics, AR/VR, and human-computer interaction, enabling models to comprehend and interact with 3D environments through natural language. While 2D vision-language models (VLMs) have achieved remarkable success in 2D VQA tasks, progress in the 3D domain has been significantly slower due to the complexity of 3D data and the high cost of manual annotations. In this work, we introduce SplatTalk, a novel method that uses a generalizable 3D Gaussian Splatting (3DGS) framework to produce 3D tokens suitable for direct input into a pretrained LLM, enabling effective zero-shot 3D visual question answering (3D VQA) for scenes with only posed images. During experiments on multiple benchmarks, our approach outperforms both 3D models trained specifically for the task and previous 2D-LMM-based models utilizing only images (our setting), while achieving competitive performance with state-of-the-art 3D LMMs that additionally utilize 3D inputs.


#443
CaO2: Rectifying Inconsistencies in Diffusion-Based Dataset Distillation

Haoxuan Wang · Zhenghao Zhao · Junyi Wu · Yuzhang Shang · Gaowen Liu · Yan Yan

The recent introduction of diffusion models in dataset distillation has shown promising potential in creating compact surrogate datasets for large, high-resolution target datasets, offering improved efficiency and performance over traditional bi-level/uni-level optimization methods. However, current diffusion-based dataset distillation approaches overlook the evaluation process and exhibit two critical inconsistencies in the distillation process: (1) Objective Inconsistency, where the distillation process diverges from the evaluation objective, and (2) Condition Inconsistency, leading to mismatches between generated images and their corresponding conditions. To resolve these issues, we introduce \textbf{C}ondition-\textbf{a}ware \textbf{O}ptimization with \textbf{O}bjective-guided Sampling (\textbf{CaO$_2$}), a two-stage diffusion-based framework that aligns the distillation process with the evaluation objective. The first stage employs a probability-informed sample selection pipeline, while the second stage refines the corresponding latent representations to improve conditional likelihood.CaO$_2$ achieves state-of-the-art performance on ImageNet and its subsets, surpassing the best-performing baselines by an average of 2.3\% accuracy.


#444
Taming the Untamed: Graph-Based Knowledge Retrieval and Reasoning for MLLMs to Conquer the Unknown

Bowen Wang · Zhouqiang Jiang · Yasuaki Susumu · Shotaro Miwa · Tianwei Chen · Yuta Nakashima

The real value of knowledge lies not just in its accumulation, but in its potential to be harnessed effectively to conquer the unknown. Although recent multimodal large language models (MLLMs) exhibit impressing multimodal capabilities, they often fail in rarely encountered domain-specific tasks due to limited relevant knowledge. To explore this, we adopt visual game cognition as a testbed and select "Monster Hunter: World'' as the target to construct a multimodal knowledge graph (MH-MMKG), which incorporates multi-modalities and intricate entity relations. We also design a series of challenging queries based on MH-MMKG to evaluate the models’ ability for complex knowledge retrieval and reasoning. Furthermore, we propose a multi-agent retriever that enables a model to autonomously search relevant knowledge without additional training. Experimental results show that our approach significantly enhances the performance of MLLMs, providing a new perspective on multimodal knowledge-augmented reasoning and laying a solid foundation for future research.


#445
MambaML: Exploring State Space Models for Multi-Label Image Classification

Xuelin Zhu · Jian liu · Jiuxin Cao · Bing WANG

Mamba, a selective state-space model, has recently been widely applied to various visual tasks due to its powerful capability to capture long-range dependencies. Although promising performance has been achieved on image classification, the effectiveness of Mamba on multi-label image classification has not been explored yet. In this work, we develop a novel MambaML framework for multi-label image classification, which incorporates a Mamba-based decoder to aggregate visual information from image features into label embeddings, yielding label-specific visual representations for classification. Building upon this, MambaML further employ Mamba to model both image feature sequence and label embedding sequence. In this way, MambaML is capable of exploring the spatial relationships of image features, semantic dependencies between label embeddings, as well as their cross-correlations, thereby resulting in robust label-specific visual representations and training binary classifiers for high-performance multi-label image classification. Extensive experimental results demonstrate that our MambaML achieves state-of-the-art performance on multiple benchmarks in multi-label image classification task.


#446
MUNBa: Machine Unlearning via Nash Bargaining

Jing Wu · Mehrtash Harandi

Machine Unlearning (MU) aims to selectively erase harmful behaviors from models while retaining the overall utility of the model. As a multi-task learning problem, MU involves balancing objectives related to forgetting specific concepts/data and preserving general performance. A naive integration of these forgetting and preserving objectives can lead to gradient conflicts and dominance, impeding MU algorithms from reaching optimal solutions.To address the gradient conflict and dominance issue, we reformulate MU as a two-player cooperative game, where the two players, namely, the forgetting player and the preservation player, contribute via their gradient proposals to maximize their overall gain and balance their contributions.To this end, inspired by the Nash bargaining theory, we derive a closed-form solution to guide the model toward the Pareto stationary point.Our formulation of MU guarantees an equilibrium solution, where any deviation from the final state would lead to a reduction in the overall objectives for both players, ensuring optimality in each objective.We evaluate our algorithm's effectiveness on a diverse set of tasks across image classification and image generation.Extensive experiments with ResNet, vision-language model CLIP, and text-to-image diffusion models demonstrate that our method outperforms state-of-the-art MU algorithms, achieving a better trade-off between forgetting and preserving.Our results also highlight improvements in forgetting precision, preservation of generalization, and robustness against adversarial attacks.


#447
Highlight
SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking

Sixian Chan · Zedong Li · Xiaoqin Zhang · Wenhao Li · Shijian Lu · Chunhua Shen

Multi-modal object tracking has emerged as a significant research focus in computer vision due to its robustness in complex environments, such as exposure variations, blur, and occlusions. Despite the fact that existing studies integrate supplementary modal information into pre-trained RGB trackers through visual prompt mechanisms, this exhibits a critical limitation: they inherently prioritize RGB information as the dominant modality, thereby underutilizing the complementary information of alternative modal.To address this fundamental limitation, we present SMSTracker, an innovative tri-path score mask sigma fusion framework for multi-modal tracking, including three key modules. Firstly, we design a tri-path Score Mask Fusion (SMF) module to evaluate and quantify the reliability of each modality, allowing optimal exploitation of complementary features between modalities. Secondly, we introduce a pioneering Sigma Interaction (SGI) module to facilitate a sophisticated fusion of modal features across tri-branches, representing the first application of Sigma point-based feature interaction in object tracking tasks. Furthermore, we advance a Drop Key Fine-tuning (DKF) strategy to address the inherent challenge of unequal data contribution in multi-modal learning scenarios, thereby enhancing the model's capacity for comprehensive multi-modal information processing.Finally, extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event datasets demonstrate the significant performance improvements achieved by SMSTracker over existing state-of-the-art methods. The source code will be available after review.


#448
Auxiliary Prompt Tuning of Vision-Language Models for Few-Shot Out-of-Distribution Detection

Wenjun Miao · Guansong Pang · Zihan Wang · Jin Zheng · Xiao Bai

Recent advancements in CLIP-based out-of-distribution (OOD) detection have shown promising results via regularization on prompt tuning, leveraging background features extracted from a few in-distribution (ID) samples as proxies for OOD features.However, these methods suffer from an inherent limitation: a lack of diversity in the extracted OOD features from the few-shot ID data.To address this issue, we propose to leverage external datasets as auxiliary outlier data (i.e., pseudo OOD samples) to extract rich, diverse OOD features, with the features from not only background regions but also foreground object regions, thereby supporting more discriminative prompt tuning for OOD detection. We further introduce Auxiliary Prompt Tuning (APT), a novel framework that can be used as a plug-in module to enable existing prompt tuning-based methods to utilize the auxiliary data for more accurate OOD detection.There are two key challenges of utilizing those auxiliary data in prompt tuning, including I) foreground-background decomposition of unlabeled auxiliary data with diverse outlying objects and II) optimization of foreground OOD features. APT tackles challenge I with an adaptive logit-based Kullback–Leibler divergence method and challenge II by constructing foreground-background pairs for each foreground region to enable effective exploitation of foreground OOD features. Extensive experiments on standard and hard OOD benchmarks show that APT achieves state-of-the-art performance, obtaining significant improvements in challenging scenarios, e.g., hard OOD and 1-shot detection.


#449
Enhancing Transformers Through Conditioned Embedded Tokens

Hemanth Saratchandran · Simon Lucey

Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.


#450
Improved Noise Schedule for Diffusion Training

Tiankai Hang · Shuyang Gu · Jianmin Bao · Fangyun Wei · Dong Chen · Xin Geng · Baining Guo

Diffusion models have emerged as the de facto choice for generating high-quality visual signals across various domains.However, training a single model to predict noise across various levels poses significant challenges, necessitating numerous iterations and incurring significant computational costs.Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence and improve model performance.In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. Our key insight is that the importance sampling of the logarithm of the Signal-to-Noise ratio ($\log \text{SNR}$), theoretically equivalent to a modified noise schedule, is particularly beneficial for training efficiency when increasing the sample frequency around $\log \text{SNR}=0$. This strategic sampling allows the model to focus on the critical transition point between signal dominance and noise dominance, potentially leading to more robust and accurate predictions.We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule.Furthermore, we highlight the advantages of our noise schedule design on the ImageNet benchmark, showing that the designed schedule consistently benefits different prediction targets.Our findings contribute to the ongoing efforts to optimize diffusion models, potentially paving the way for more efficient and effective training paradigms in the field of generative AI.


#451
How Do Multimodal Large Language Models Handle Complex Multimodal Reasoning? Placing Them in An Extensible Escape Game

Ziyue Wang · Yurui Dong · Fuwen Luo · Minyuan Ruan · Zhili Cheng · Chi Chen · Peng Li · Yang Liu

The rapid advancing of Multimodal Large Language Models (MLLMs) has spurred interest in complex multimodal reasoning tasks in the real-world and virtual environment, which require coordinating multiple abilities, including visual perception, visual reasoning, spatial awareness, and target deduction. However, existing evaluations primarily assess the final task completion, often degrading assessments to isolated abilities such as visual grounding and visual question answering. Less attention is given to comprehensively and quantitatively analyzing reasoning process in multimodal environments, which is crucial for understanding model behaviors and underlying reasoning mechanisms beyond merely task success. To address this, we introduce MM-Escape, an extensible benchmark for investigating multimodal reasoning, inspired by real-world escape games. MM-Escape emphasizes intermediate model behaviors alongside final task completion. To achieve this, we develop EscapeCraft, a customizable and open environment that enables models to engage in free-form exploration for assessing multimodal reasoning. Extensive experiments show that MLLMs, regardless of scale, can successfully complete the simplest room escape tasks, with some exhibiting human-like exploration strategies. Yet, performance dramatically drops as task difficulty increases. Moreover, we observe that performance bottlenecks vary across models, revealing distinct failure modes and limitations in their multimodal reasoning abilities, such as repetitive trajectories without adaptive exploration, getting stuck in corners due to poor visual spatial awareness, and ineffective use of acquired props, such as the key. We hope our work sheds light on new challenges in multimodal reasoning, and uncovers potential improvements in MLLMs capabilities.

Representation learning lies at the core of deep reinforcement learning. While CNNs have been default models for encoding image observations so far, modifying the encoder architecture presents challenges, particularly due to the the necessity of identifying a new set of hyper-parameters that align with each modification. To address this problem, we propose a powerful representation learning technique for visual reinforcement learning using Fourier Neural Operators (FNO). Our findings demonstrate that the proposed FNO encoder effectively learns representations from images that encapsulate the underlying differential equations (PDEs) governing the dynamics of the environment in an online model-free RL framework.The FNO encoder with the Efficient Rainbow algorithm achieves a median Human Normalized Score (HNS) of $26.1\%$ on the Atari100k benchmark across 26 environments, delivering a $10$-point enhancement over the CNN-based Efficient Rainbow algorithm. In the context of offline reinforcement learning Atari games, we achieve a remarkable $2.89\times$ improvement compared to sate-of-the-art transformer based models. Additionally, upon using our FNO encoder with the A2C algorithm on the ViZDoom environment, we achieve $\sim38\%$ improvement in rewards in the first $200$ episodes. Further, we match the vanilla A2C performance after just $\sim100$ episodes. We also achieve $81\%$ mean normalized score in the CARLA Autonomous Driving task (from just image sensor inputs), which is a $20$ points improvement in the absolute scale over the CNN-based PPO algorithm while requiring only $\sim55\%$ samples to match the CNN-PPO performance. We currently hold the state-of-the-art scores (in the model-free RL setting) at both the CARLA Autonomous Driving from image observations benchmark and the Atari 100k benchmark. Our proposed FNO encoder is compatible with all model-free reinforcement learning algorithms, enhances both rewards and sample efficiency by implicitly learning the underlying dynamics of the environment, and eliminates the need for additional hyper-parameter tuning.


#453
LIFT: Latent Implicit Functions for Task- and Data-Agnostic Encoding

Amirhossein Kazerouni · Soroush Mehraban · Michael Brudno · Babak Taati

Implicit Neural Representations (INRs) are proving to be a powerful paradigm in unifying task modeling across diverse data domains, offering key advantages such as memory efficiency and resolution independence. Conventional deep learning models are typically modality-dependent, often requiring custom architectures and objectives for different types of signals. However, existing INR frameworks frequently rely on global latent vectors or exhibit computational inefficiencies that limit their broader applicability. We introduce LIFT, a novel, high-performance framework that addresses these challenges by capturing multiscale information through meta-learning. LIFT leverages multiple parallel localized implicit functions alongside a hierarchical latent generator to produce unified latent representations that span local, intermediate, and global features. This architecture facilitates smooth transitions across local regions, enhancing expressivity while maintaining inference efficiency. Additionally, we introduce ReLIFT, an enhanced variant of LIFT that incorporates residual connections and expressive frequency encodings. With this straightforward approach, ReLIFT effectively addresses the convergence-capacity gap found in comparable methods, providing an efficient yet powerful solution to improve capacity and speed up convergence. Empirical results show that LIFT achieves state-of-the-art (SOTA) performance in generative modeling and classification tasks, with notable reductions in computational costs. Moreover, in single-task settings, the streamlined ReLIFT architecture proves effective in signal representations and inverse problem tasks.


#454
Improving Noise Efficiency in Privacy-preserving Dataset Distillation

Runkai Zheng · Vishnu Dasu · Yinong Wang · Haohan Wang · Fernando De la Torre

Modern machine learning models heavily rely on large datasets that often include sensitive and private information, raising serious privacy concerns. Differentially private (DP) data generation offers a solution by creating synthetic datasets that limit the leakage of private information within a predefined privacy budget; however, it requires a substantial amount of data to achieve performance comparable to models trained on the original data. To mitigate the significant expense incurred with synthetic data generation, Dataset Distillation (DD) stands out for its remarkable training and storage efficiency. This efficiency is particularly advantageous when integrated with DP mechanisms, curating compact yet informative synthetic datasets without compromising privacy. However, current state-of-the-art private DD methods suffer from a synchronized sampling-optimization process and the dependency on noisy training signals from randomly initialized networks. This results in the inefficient utilization of private information due to the addition of excessive noise. To address these issues, we introduce a novel framework that % decouples sampling from optimization and utilize auxiliary datasets to identify informative subspaces of the signal. Our approach decouples sampling from optimization for better convergence and improves signal quality by mitigating the impact of DP noise through matching in an informative subspace, all without incurring additional privacy costs. On CIFAR-10, our method achieves a \textbf{10.0%} improvement with 50 images per class and \textbf{8.3%} increase with just \textbf{one-fifth} the distilled set size of previous state-of-the-art methods, demonstrating significant potential to advance privacy-preserving dataset distillation.


#455
Boosting MLLM Reasoning with Text-Debiased Hint-GRPO

Qihan Huang · Weilong Dai · Jinlong Liu · Wanggui He · Hao Jiang · Mingli Song · Jingyuan CHEN · Chang Yao · Jie Song

MLLM reasoning has drawn widespread research for its excellent problem-solving capability. Current reasoning methods fall into two types: PRM, which supervises the intermediate reasoning steps, and ORM, which supervises the final results. Recently, DeepSeek-R1 has challenged the traditional view that PRM outperforms ORM, which demonstrates strong generalization performance using an ORM method (i.e., GRPO). However, current MLLM's GRPO algorithms still struggle to handle challenging and complex multimodal reasoning tasks (e.g., mathematical reasoning). In this work, we reveal two problems that impede the performance of GRPO on the MLLM: Low data utilization and Text-bias. Low data utilization refers to that GRPO cannot acquire positive rewards to update the MLLM on difficult samples, and text-bias is a phenomenon that the MLLM bypasses image condition and solely relies on text condition for generation after GRPO training. To tackle these problems, this work proposes Hint-GRPO that improves data utilization by adaptively providing hints for samples of varying difficulty, and text-bias calibration that mitigates text-bias by calibrating the token prediction logits with image condition in test-time. Experiment results on three base MLLMs across eleven datasets demonstrate that our proposed methods advance the reasoning capability of original MLLM by a large margin, exhibiting superior performance to existing MLLM reasoning methods. Our code will be made available soon.


#456
Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning

Muhammad Aqeel · Shakiba Sharifi · Marco Cristani · Francesco Setti

So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training-validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets. Code will be made available upon acceptance.


#457
RIPE: Reinforcement Learning on Unlabeled Image Pairs for Robust Keypoint Extraction

Johannes Künzel · Anna Hilsmann · Peter Eisert

We introduce RIPE, an innovative reinforcement learning-based framework for weakly-supervised training of a keypoint extractor that excels in both detection and description tasks. In contrast to conventional training regimes that depend heavily on artificial transformations, pre-generated models, or 3D data, RIPE requires only a binary label indicating whether paired images represent the same scene.This minimal supervision significantly expands the pool of training data, enabling the creation of a highly generalized and robust keypoint extractor. RIPE utilizes the encoder's intermediate layers for the description of the keypoints with a hyper-column approach to integrate information from different scales. Additionally, we propose a auxiliary loss to enhance the discriminative capability of the learned descriptors.Comprehensive evaluations on standard benchmarks demonstrate that RIPE simplifies data preparation while achieving competitive performance compared to state-of-the-art techniques, marking a significant advancement in robust keypoint extraction and description.Code and data will be made available for research purposes.


#458
Efficient Adaptation of Pre-trained Vision Transformer underpinned by Approximately Orthogonal Fine-Tuning Strategy

Yiting Yang · Hao Luo · Yuan Sun · Qingsen Yan · Haokui Zhang · Wei Dong · Guoqing Wang · Peng Wang · Yang Yang · Heng Tao Shen

A prevalent approach in Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformers (ViT) involves freezing the majority of the backbone parameters and solely learning low-rank adaptation weight matrices to accommodate downstream tasks. These low-rank matrices are commonly derived through the multiplication structure of down-projection and up-projection matrices, exemplified by methods such as LoRA and Adapter. In this study, we observe an approximate orthogonality among any two row or column vectors within any weight matrix of the backbone parameters; however, this property is absent in the vectors of the down/up-projection matrices. Approximate orthogonality implies a reduction in the upper bound of the model's generalization error, signifying that the model possesses enhanced generalization capability. If the fine-tuned down/up-projection matrices were to exhibit this same property as the pre-trained backbone matrices, could the generalization capability of fine-tuned ViTs be further augmented? To address this question, we propose an Approximately Orthogonal Fine-Tuning (AOFT) strategy for representing the low-rank weight matrices. This strategy employs a single learnable vector to generate a set of approximately orthogonal vectors, which form the down/up-projection matrices, thereby aligning the properties of these matrices with those of the backbone. Extensive experimental results demonstrate that our method achieves competitive performance across a range of downstream image classification tasks, confirming the efficacy of the enhanced generalization capability embedded in the down/up-projection matrices. Our code is available at anonymous link: https://drive.google.com/file/d/1rg3JYfkmeLGDbRWXspO22wxVspbtnthV/view?usp=drive_link.


#459
Highlight
DiffRefine: Diffusion-based Proposal Specific Point Cloud Densification for Cross-Domain Object Detection

Sangyun Shin · Yuhang He · Xinyu Hou · Samuel Hodgson · Andrew Markham · Niki Trigoni

The robustness of 3D object detection in large-scale outdoor point clouds degrades significantly when deployed in an unseen environment due to domain shifts. To minimize the domain gap, existing works on domain adaptive detection focuses on several factors, including point density, object shape and sizes, to reduce the false negative detections. However, the adaptation results indicate that there are still remaining challenges. We argue that this is due to the challenge in recognizing comparably less distinctive region on object surface due to sparsity, occlusion, etc. In this work, we aim to reinforce those features by generating points on object surface to make them straightforwardly recognizable. We draw our motivation from a common observation that detection proposals already contain the accurate bounding boxes, but with relatively low objectness score predictions, which lead to false negatives. Given these box proposals, we densify sparse object points with a diffusion approach. As a result, our model DiffRefine can act as a simple additional module before second-stage refinement, where most existing detection models for two-stage detection can use. Experimental results on domain adaptive detection show competitive performance, especially on vanishing points due to distance on various detection architectures.


#460
On the Robustness Tradeoff in Fine-Tuning

Kunyang Li · Jean-Charles Noirot Ferrand · Ryan Sheatsley · Blaine Hoak · Yohan Beugin · Eric Pauley · Patrick McDaniel

Fine-tuning has become the standard practice for adapting pre-trained (upstream) models to downstream tasks. However, the impact on model robustness is not well understood. In this work, we characterize the robustness-accuracy trade-off in fine-tuning. We evaluate the robustness and accuracy of fine-tuned models over 6 benchmark datasets and 7 different fine-tuning strategies. We observe a consistent trade-off between adversarial robustness and accuracy. Peripheral updates such as BitFit are more effective for simple tasks---over 75\% above the average measured with area under the Pareto frontiers on CIFAR-10 and CIFAR-100. In contrast, fine-tuning information-heavy layers, such as attention layers via Compacter, achieves a better Pareto frontier on more complex tasks---57.5\% and 34.6\% above the average on Caltech-256 and CUB-200, respectively. Lastly, we observe that robustness of fine-tuning against out-of-distribution data closely tracks accuracy. These insights emphasize the need for robustness-aware fine-tuning to ensure reliable real-world deployments.


#461
Understanding Flatness in Generative Models: Its Role and Benefits

Taehwan Lee · Kyeongkook Seo · Jaejun Yoo · Sung Whan Yoon

Flat minima, known to enhance generalization and robustness in supervised learning, remain largely unexplored in generative models.In this work, we systematically investigate the role of loss surface flatness in generative models, both theoretically and empirically, with a particular focus on diffusion models. We establish a theoretical claim that flatter minima improve robustness against perturbations in target prior distributions, leading to benefits such as reduced exposure bias---where errors in noise estimation accumulate over iterations---and significantly improved resilience to model quantization, preserving generative performance even under strong quantization constraints. We further observe that Sharpness-Aware Minimization (SAM), which explicitly controls the degree of flatness, effectively enhances flatness in diffusion models, whereas other well-known methods such as Stochastic Weight Averaging (SWA) and Exponential Moving Average (EMA), which promote flatness indirectly via ensembling, are less effective. Through extensive experiments on CIFAR-10, LSUN Tower, and FFHQ, we demonstrate that flat minima in diffusion models indeed improves not only generative performance but also robustness.