ICCV Poster OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Poster

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu · Kun yuan · Yaling Shen · feilong tang · Xiaohao Xu · Lin Zhou · Wei Li · Ying Chen · Zhongxing Xu · Zelin Peng · Siyuan Yan · Vinkle Srivastav · Diping Song · Tianbin Li · Danli Shi · Jin Ye · Nicolas Padoy · Nassir Navab · Junjun He · Zongyuan Ge

[ Abstract ]

Abstract:

Vision-language pretraining (VLP) enables open-world generalization beyond predefined labels, a critical capability in surgery due to the diversity of procedures, instruments, and patient anatomies. However, applying VLP to ophthalmic surgery presents unique challenges, including limited vision-language data, intricate procedural workflows, and the need for hierarchical understanding, ranging from fine-grained surgical actions to global clinical reasoning. To address these, we introduce OphVL, a large-scale, hierarchically structured dataset containing over 375K video-text pairs, making it 15× larger than existing surgical VLP datasets. OphVL captures a diverse range of ophthalmic surgical attributes, including surgical phases, operations, actions, instruments, medications, disease causes, surgical objectives, and postoperative care recommendations. By aligning short clips with detailed narratives and full-length videos with structured titles, OphVL provides both fine-grained surgical details and high-level procedural context. Building on OphVL, we propose OphCLIP, a hierarchical retrieval-augmented VLP framework. OphCLIP leverages silent surgical videos as a knowledge base, retrieving semantically relevant content to enhance narrated procedure learning. This enables OphCLIP to integrate explicit linguistic supervision with implicit visual knowledge, improving ophthalmic workflow modeling. Evaluations across 11 benchmark datasets for surgical phase recognition and multi-instrument identification demonstrate OphCLIP’s robust generalization and superior performance, establishing it as a foundation model for ophthalmic surgery.

Live content is unavailable. Log in and register to view live content