Poster
MobileViCLIP: An Efficient Video-Text Model for Mobile Devices
Min Yang · Zihan Jia · Zhilin Dai · Sheng Guo · Limin Wang
[
Abstract
]
Abstract:
Although big models have achieved good results in increasing numbers of vision tasks, efficient lightweight neural networks have received increasing attention due to their faster reasoning speed and easier deployment on mobile devices. However, existing video models still focus on the larger ViT architecture, and few works attempt to build efficient architecture. Since many efficient contrastive language-image pre-training (CLIP) models have shown strong zero-shot classification and retrieval capability, we attempt to fill the gap in video-text understanding models and propose a fast and efficient video-text model \textbf{MobileViCLIP} with strong zero-shot reasoning capability that can be deployed on mobile devices. In particular, our MobileViCLIP-Small obtains similar zero-shot retrieval performance as InternVideo2-L14 on text-to-video dataset MSR-VTT while being $46.7\times$ faster when deployed on the mobile device. Furthermore, MobileViCLIP-Small can generalize to zero-shot action recognition task and obtains 1.0\% better Top-1 accuracy than InternVideo2-S14 while being $5.6\times$ faster on the mobile device.
Live content is unavailable. Log in and register to view live content