Poster
Highlight What You Want: Weakly-Supervised Instance-Level Controllable Infrared-Visible Image Fusion
Zeyu Wang · Jizheng Zhang · Haiyu Song · Mingyu Ge · Jiayu Wang · Haoran Duan
Infrared and visible image fusion (VIS-IR) aims to integrate complementary information from both source images to produce a fused image with enriched details. However, most existing fusion models lack controllability, making it difficult to customize the fused output according to user preferences. To address this challenge, we propose a novel weakly-supervised, instance-level controllable fusion model that adaptively highlights user-specified instances based on input text. Our model consists of two stages: pseudo-label generation and fusion network training. In the first stage, guided by observed multimodal manifold priors, we leverage text and manifold similarity as joint supervisory signals to train text-to-image response network (TIRN) in a weakly-supervised manner, enabling it to identify referenced semantic-level objects from instance segmentation outputs. To align text and image features in TIRN, we propose a multimodal feature alignment module (MFA), using manifold similarity to guide attention weight assignment for precise correspondence between image patches and text embeddings. Moreover, we employ spatial positional relationships to accurately select the referenced instances from multiple semantic-level objects. In the second stage, the fusion network takes source images and text as input, using the generated pseudo-labels for supervision to apply distinct fusion strategies for target and non-target regions. Experimental results show that our model not only generates precise pseudo-labels but also achieves state-of-the-art fusion performance while highlighting user-defined instances. Our code will be publicly available.
Live content is unavailable. Log in and register to view live content