Skip to yearly menu bar Skip to main content


Poster

Hierarchical Variational Test-Time Prompt Generation for Zero-Shot Generalization

Zhaoyang Wu · Fang Liu · Licheng Jiao · Shuo Li · Lingling Li · Xu Liu · Puhua Chen · wenping ma


Abstract:

Vision-language models like CLIP have demonstrated strong zero-shot generalization, making them valuable for various downstream tasks through prompt learning. However, existing test-time prompt tuning methods, such as entropy minimization, treat both text and visual prompts as fixed learnable parameters, limiting their adaptability to unseen domains. In contrast, we propose Hierarchical Variational Test-Time Prompt Generation, a novel approach where both text and visual prompts are dynamically generated via a HyperTransformer at inference time. This enables the model to produce data-specific prompts for each modality, significantly improving generalization. To further address template sensitivity and distribution shifts, we introduce variational prompt generation, leveraging variational inference to mitigate biases introduced by different prompt templates and data augmentations. Additionally, our hierarchical variational prompt generation conditions prompts at each layer on those from previous layers, allowing the model to capture deeper contextual dependencies and refine prompt interactions for robust adaptation. Extensive experiments on domain generalization benchmarks demonstrate that our method significantly outperforms existing prompt-learning techniques, achieving state-of-the-art zero-shot accuracy while maintaining efficiency.

Live content is unavailable. Log in and register to view live content