ICCV Poster Attention to the Burtiness in Visual Prompt Tuning!

Poster

Attention to the Burtiness in Visual Prompt Tuning!

Yuzhu Wang · Manni Duan · Shu Kong

[ Abstract ]

Abstract: Visual Prompt Tuning (VPT) is a parameter-efficient finetuning technique that adapts a pre-trained vision Transformer (ViT) by learning a small set of parameters in the input space, known as prompts. In VPT, we uncover "burstiness'' in the values arising from the interaction of image patch embeddings, and the key and query projectors within Transformer's self-attention module. Interestingly, the values of patch embeddings and the key and query projectors exhibit Laplacian and hyper-Laplacian distribution, respectively. Intuitively, these non-Gaussian distributions pose challenges for learning prompts. To address this, we propose whitening these data, de-correlating them and equalizing their variance towards more Gaussian before learning prompts. We derive the whitening matrix over random image patch embeddings and ViT's key and query projectors, and multiply it with the prompt to be learned in a bilinear manner.Surprisingly, this method significantly accelerates prompt tuning and boosts accuracy, e.g., $>$25 points on the CUB dataset; interestingly, it learns ``bursty prompts''.As bilinear models are known to introduce burstiness, we present a compact method by learning two small sets of parameters whose multiplication yields the final prompts. We call the proposed methods Bilinear Prompt Tuning (BPT). Extensive experiments demonstrate that BPT methods not only outperform various VPT methods across multiple benchmark datasets but also reduce parameter count and computation overhead.

Live content is unavailable. Log in and register to view live content