Poster
Player-Centric Multimodal Prompt Generation for Large Language Model Based Identity-Aware Basketball Video Captioning
Zeyu Xi · Haoying Sun · Yaofei Wu · Junchi Yan · Haoran Zhang · Lifang Wu · Liang Wang · Chang Wen Chen
Existing sports video captioning methods often focus on the content yet overlook player identities, limiting their applicability. Although existing methods integrate extra information to generate identity-aware descriptions, player identities are sometimes incorrect because the extra information is independent of the video content. This paper introduces a player-centric multimodal prompt generation network for identity-aware sports video captioning (LLM-VC), which focus on recognizing player identity from a visual perspective. Specifically, an identity related information extraction module (IRIEM) is designed to extract player related multimodal embeddings. IRIEM includes a player identification network (PIN) for extracting visual features and player names, and a bidirectional semantic interaction module (BSIM) to link player features with video content for mutual enhancement. Additionally, a visual context learning module (VCLM) is designed to capture the key video context information. Finally, by integrating the outputs of above modules as the multimodal prompt for the large language model (LLM), it facilitates the generation of descriptions with player identities. To support this work, we construct the NBA-Identity, a large identity-aware basketball video captioning dataset with 9,726 videos covering 9 event types. The experimental results on NBA-Identity and VC-NBA-2022 demonstrate that our proposed model achieves advanced performance.
Live content is unavailable. Log in and register to view live content