Skip to yearly menu bar Skip to main content


Poster

TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration

Gong Meiqi · Hao Zhang · Xunpeng Yi · Linfeng Tang · Jiayi Ma


Abstract:

Existing multi-modal fusion methods typically extend image fusion techniques directly to video fusion tasks, which discard inherent temporal information and struggle to maintain temporal consistency between video frames. To address this limitation, we propose a comprehensive method specifically designed for multi-modal video fusion, leveraging a temporally consistent framework with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for distillation. This approach enables the simultaneous and targeted enhancement of both the visual and semantic representations of videos for the first time. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to facilitate weak information recovery. Third, to ensure temporal consistency, we embed a temporal-enhanced mechanism into the network and devise a temporal loss to guide the optimization process. Finally, we introduce two innovative metrics tailored for video fusion, aimed at evaluating the temporal consistency of the generated fused videos. Extensive experimental results on public video datasets validate the superiority of our method.

Live content is unavailable. Log in and register to view live content