ICCV Poster Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition

Poster

Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition

Meiqi Cao · Xiangbo Shu · Xin Jiang · Rui Yan · Yazhou Yao · Jinhui Tang

[ Abstract ]

Abstract:

While event cameras excel in capturing microsecond temporal dynamics, they suffer from sparse spatial representations compared to traditional RGB data. Thus, multi-modal event-based action recognition approaches aim to synergize complementary strengths by independently extracting and integrating paired RGB-Event features. However, this paradigm inevitably introduces additional data acquisition costs, while eroding the inherent privacy advantages of event-based sensing. Drawing inspiration from event-to-image reconstruction, texture-enriched visual representation directly reconstructed from asynchronous event streams is a promising solution. In response, we propose an Enhanced Multimodal Perceptual (EMP) framework that hierarchically explores multimodal cues~(\eg, edges and textures) from raw event streams through two synergistic innovations spanning representation to feature levels. Specifically, we introduce Cross-Modal Frequency Enhancer (CFE) that leverages complementary frequency characteristics between reconstructed frames and stacked frames to refine event representations. Furthermore, to achieve unified feature encoding across modalities, we develop High-Frequency Guided Selector (HGS) for semantic consistency token selection guided by dynamic edge features while suppressing redundant multimodal information interference adaptively. Extensive experiments on four benchmark datasets demonstrate the superior effectiveness of our proposed framework.

Live content is unavailable. Log in and register to view live content