Skip to yearly menu bar Skip to main content


Poster

LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs

Hanyu Zhou · Gim Hee Lee


Abstract:

Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations onto frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of our method. Our code will be made publicly available.

Live content is unavailable. Log in and register to view live content