Procedural videos are critical for learning new tasks. Temporal action segmentation (TAS), which classifies the action in every video frame, has become essential for understanding procedural videos. Existing TAS models, however, are limited to a fixed-set of tasks learned at training and unable to adapt to novel tasks at test time. Thus, we introduce the new problem of Multi-Modal Few-shot Temporal Action Segmentation (MMF-TAS) to learn models that can generalize to novel procedural tasks with minimal visual/textual examples. We propose the first MMF-TAS framework, by designing a Prototype Graph Network (PGNet). PGNet contains a Prototype Building Block that summarizes action information from support videos of the novel tasks via an Action Relation Graph, and encodes this information into action prototypes via a Dynamic Graph Transformer. Next, it employs a Matching Block that compares action prototypes with query videos to infer framewise action labels. To exploit the advantages of both visual and textual modalities, we compute separate action prototypes for each modality and combine the two modalities by a prediction fusion method to avoid overfitting on one modality. By extensive experiments on procedural datasets, we show that our method successfully adapts to novel tasks during inference and significantly outperforms baselines.
Live content is unavailable. Log in and register to view live content