Skip to yearly menu bar Skip to main content


Poster

Language Decoupling with Fine-grained Knowledge Guidance for Referring Multi-object Tracking

guangyao li · Siping Zhuang · Yajun Jian · Yan Yan · Hanzi Wang


Abstract:

Referring Multi-Object Tracking (RMOT) aims to detect and track specific objects based on natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, often failing to exploit fine-grained linguistic cues that are crucial for distinguishing objects with similar characteristics. Notably, these cues play distinct roles at different tracking stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose DKGTrack, a novel RMOT method that enhances language comprehension for precise object tracking by decoupling language expressions into localized descriptions and motion states. To improve the accuracy of language-guided object identification, we introduce a Static Semantic Enhancement (SSE) module, which enhances region-level vision-language alignment through hierarchical cross-modal feature interaction, providing more discriminative object representations for tracking. Furthermore, we propose a Motion Perception Alignment (MPA) module that explicitly aligns object queries with motion descriptions, enabling accurate object trajectory prediction across frames. Experimental results on multiple RMOT benchmarks demonstrate the effectiveness of our method, which achieves competitive performance in challenging tracking scenarios.

Live content is unavailable. Log in and register to view live content