Poster
Temporal-aware Query Routing for Real-time Video Instance Segmentation
Zesen Cheng · Kehan Li · Yian Zhao · Hang Zhang · Chang Liu · Jie Chen
With the rise of applications such as embodied intelligence, developing high real-time online video instance segmentation (VIS) has become increasingly important. However, through time profiling of the components in advanced online VIS architecture (i.e., transformer-based architecture), we find that the transformer decoder significantly hampers the inference speed. Further analysis of the similarities between the outputs from adjacent frames at each transformer decoder layer reveals significant redundant computations within the transformer decoder. To address this issue, we introduce Temporal-Aware query Routing (TAR) mechanism. We embed it before each transformer decoder layer. By fusing the optimal queries from the previous frame, the queries output by the preceding decoder layer, and their differential information, TAR predicts a binary classification score and then uses an argmax operation to determine whether the current layer should be skipped. Experimental results demonstrate that integrating TAR into the baselines achieves significant efficiency gains (24.7 → 34.6 FPS for MinVIS, 22.4 → 32.8 FPS for DVIS++) while also improving performance (e.g., on YoutubeVIS 2019, 47.4 → 48.4 AP for MinVIS, 55.5 → 55.7 AP for DVIS++). Furthermore, our analysis of the TAR mechanism shows that the number of skipped layers increases as the differences between adjacent video frames decrease, which suggests that our method effectively utilizes inter-frame differences to reduce redundant computations in the transformer decoder.
Live content is unavailable. Log in and register to view live content