ICCV Poster Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

Poster

Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

Tianjiao Jiang · Zhen Zhang · Yuhang Liu · Javen Qinfeng Shi

[ Abstract ]

Abstract:

Few-shot learning (FSL) aims to enable models to learn effectively from limited labeled data. However, existing methods often struggle with overfitting due to the high dimensionality of feature spaces and the small sample sizes typically available. More precisely, the features used in most FSL applications can be viewed as a mixture of latent disentangled features. As a result, the learner is often required to implicitly infer the mixing procedure, which involves estimating a large number of parameters and frequently leads to overfitting. Building on recent theoretical advances in multi-modal contrastive learning, we propose the Causal CLIP Adapter (CCA), a novel approach that disentangles visual features obtained from CLIP by applying independent component analysis (ICA). While ICA effectively disentangles latent features, it may inadvertently introduce misalignment in the feature space. To address this, we leverage CLIP's inherent cross-modal alignment and enhance it both unidirectionally and bidirectionally through fine-tuning and cross-attention mechanisms. The logits from uni-modal and cross-modal classifications are then combined linearly to improve overall classification accuracy. Extensive experiments conducted across 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art (SOTA) techniques in terms of robustness to distributional shifts and resistance to adversarial noise, all while maintaining computational efficiency. These results underscore the effectiveness of causal disentanglement and enhanced cross-modal alignment in significantly boosting FSL performance.

Live content is unavailable. Log in and register to view live content