Poster Exhibit Hall I #330

VISO: Accelerating In-orbit Object Detection with Language-Guided Mask Learning and Sparse Inference

Meiqi Wang ⋅ Han Qiu

2025 Poster

Abstract

In-orbit object detection is essential for Earth observation missions on satellites equipped with GPUs.A promising approach is to use pre-trained vision-language modeling (VLM) to enhance its open-vocabulary capability.However, adopting it on satellites poses two challenges: (1) satellite imagery differs substantially from natural images, and (2) satellites' embedded GPUs are insufficient for complex models' inference.We reveal their lack of a crucial prior: in-orbit detection involves identifying a set of known objects within a cluttered yet monotonous background.Motivated by this observation, we propose VISO, a Vision-language Instructed Satellite Object detection model that focuses on object-specific features while suppressing irrelevant regions through language-guided mask learning.After pre-training on a large-scale satellite dataset with 3.4M region-text pairs, VISO enhances object-text alignment and object-centric features to improve detection accuracy.Also, VISO suppresses irrelevant regions, enabling highly sparse inference to accelerate speed on satellites.Extensive experiments show that VISO without sparsity outperforms state-of-the-art (SOTA) VLMs in zero-shot detection by increasing 34.1\% AP and reducing 27$\times$ FLOPs, and surpasses specialist models in supervised object detection and object referring by improving 2.3\% AP.When sparsifying VISO to a comparable AP, FLOPs can be greatly reduced by up to 8.5$\times$.Real-world tests reveal that VISO achieves a 2.8–4.8$\times$ FPS speed-up on satellites’ embedded GPUs.

Chat is not available.