DiMPLe - Disentangled Multi-Modal Prompt Learning: Enhancing Out-Of-Distribution Alignment with Invariant and Spurious Feature Separation
Abstract
We introduce \textbf{DiMPLe} (\textbf{Di}sentangled \textbf{M}ulti-Modal \textbf{P}rompt \textbf{Le}arning), a novel approach to disentangle invariant and spurious features across vision and language modalities in multi-modal learning. Spurious correlations in visual data often hinder out-of-distribution (OOD) performance. Unlike prior methods focusing solely on image features, DiMPLe \textbf{disentangles} features \textbf{within and across modalities} while maintaining consistent alignment, enabling better generalization to \textbf{novel classes} and robustness to \textbf{distribution shifts}.Our method combines three key objectives: (1) mutual information minimization between invariant and spurious features, (2) spurious feature regularization, and (3) contrastive learning on invariant features. Extensive experiments demonstrate DiMPLe demonstrates superior performance compared to CoOp-OOD, when averaged across 11 diverse datasets, and achieves absolute gains of 15.27 in base class accuracy and 44.31 in novel class accuracy. The code will be released publicly upon acceptance.