Poster
Zero-Shot Compositional Video Learning with Coding Rate Reduction
Heeseok Jung · Jun-Hyeon Bak · Yujin Jeong · Gyugeun Lee · Jinwoo Ahn · Eun-Sol Kim
In this paper, we propose a novel zero-shot compositional video understanding method inspired by how young children efficiently learn new concepts and flexibly expand their existing knowledge framework. While recent large-scale visual language models (VLMs) have achieved remarkable advancements and demonstrated impressive performance improvements across various tasks, they require massive amounts of data and computational resources. However, despite their high benchmark performance, they often fail to solve simple zero-shot composition tasks. Moreover, VLMs designed for video data demand even greater computational resources. We introduce a new video representation learning method inspired by human compositional learning to address these challenges. Specifically, we demonstrate that achieving zero-shot compositional learning requires effective representation learning that disentangles given data into meaningful semantic units. We propose a novel method that learns such disentangled representations based on an information-theoretic measure. By optimizing coding rate reduction, we successfully learn spatio-temporally disentangled features from videos, one of the most challenging data. Our approach significantly enhances compositional generalizability, demonstrating its effectiveness in zero-shot learning scenarios.
Live content is unavailable. Log in and register to view live content