Skip to yearly menu bar Skip to main content


Poster

Zero-Shot Compositional Video Learning with Coding Rate Reduction

Heeseok Jung · Jun-Hyeon Bak · Yujin Jeong · Gyugeun Lee · Jinwoo Ahn · Eun-Sol Kim


Abstract:

In this paper, we propose a novel zero-shot compositional video understanding method inspired by how young children efficiently learn new concepts and flexibly expand their existing knowledge framework. While recent large-scale visual language models (VLMs) have achieved remarkable advancements and demonstrated impressive performance improvements across various tasks, they require massive amounts of data and computational resources. However, despite their high benchmark performance, they often fail to solve simple zero-shot composition tasks. Moreover, VLMs designed for video data demand even greater computational resources. We introduce a new video representation learning method inspired by human compositional learning to address these challenges. Specifically, we demonstrate that achieving zero-shot compositional learning requires effective representation learning that disentangles given data into meaningful semantic units. We propose a novel method that learns such disentangled representations based on an information-theoretic measure. By optimizing coding rate reduction, we successfully learn spatio-temporally disentangled features from videos, one of the most challenging data. Our approach significantly enhances compositional generalizability, demonstrating its effectiveness in zero-shot learning scenarios.

Live content is unavailable. Log in and register to view live content