Poster
SAMPLE: Semantic Alignment through Temporal-Adaptive Multimodal Prompt Learning for Event-Based Open-Vocabulary Action Recognition
Jing Wang · Rui Zhao · Ruiqin Xiong · Xingtao Wang · Xiaopeng Fan · Tiejun Huang
Open-vocabulary action recognition (OVAR) extends recognition systems to identify unseen action categories. While large-scale vision-language models (VLMs) like CLIP have enabled OVAR in image domains, their adaptation to event data remains underexplored. Event cameras offer high temporal resolution and inherent privacy preservation, making them suitable for capturing fine-grained motion dynamics. However, leveraging event data for OVAR presents challenges: 1) bridging the domain gap between static image-based models and event streams, and 2) preserving the generalization capabilities of pretrained VLMs in open-vocabulary settings. In this paper, we propose SAMPLE, a lightweight adaptation of VLMs for event-based action recognition, balancing supervised and open-vocabulary performance. We introduce a \textit{Temporal-Adaptive Multimodal Prompt Learning} strategy that can be divided into: 1) Unimodal prompt on both the event and text branches to learn the data distribution 2) Event-Text cross-modal prompt for representation space alignment 3) Temporal-Adaptive prompt to model temporal dependencies across event data. Extensive evaluations demonstrate that SAMPLE outperforms prior methods across fully supervised, few-shot, base-to-novel and zero-shot settings. Notably, in zero-shot scenarios, SAMPLE achieves gains of +15.46%, +29.76%, and +23.79% on SeAct, DVS128Gesture, and PAF respectively with less commute cost. Our codes are included in the supplementary materials. The codes and models will be publicly available.
Live content is unavailable. Log in and register to view live content