ICCV 2025 Sunday 10/19

Timezone: Pacific/Honolulu

Schedule Sun Mon Tue Wed Thu

Registration Desk: Registration / Badge Pickup Sun 19 Oct 07:00 a.m.

Tutorial: Yujun Cai · Yiwei Wang · Kai-Wei Chang · Junsong Yuan · Ziwei Liu · Chi Zhang · Jun Liu · Ming-Hsuan Yang

Towards Comprehensive Reasoning in Vision-Language Models

Vision-Language Models (VLMs) have achieved remarkable progress in image captioning and visual question answering, yet developing genuine reasoning capabilities remains an open challenge. Unlike recent breakthroughs in reasoning-focused LLMs, many VLMs still rely primarily on pattern recognition and struggle with compositional logic. This tutorial provides a comprehensive overview of reasoning capabilities in VLMs, focusing on the transition from basic perception to complex inference. We will explore reasoning-oriented prompting and training techniques in multimodal contexts, reasoning-focused benchmarks, and architectural innovations for visual-textual fusion. Through lectures and hands-on demonstrations, participants will gain insights into current capabilities, persistent challenges in compositional generalization and explainability, and practical guidance for implementing reasoning mechanisms. This tutorial uniquely bridges advances in LLM reasoning with the visual domain, addressing the distinct challenges of spatial information processing and providing a roadmap toward more cognitively capable vision-language systems.

Bio s:

Workshop: Interactive Human-centric Foundation Models Sun 19 Oct 08:00 a.m.

Shixiang Tang, Yizhou Wang, Xin Chen, Wanli Ouyang, Shiyao Xu, Jing Liu, Emily Kim, Xiaowei Zhou, Taku Komura, Gül Varol, Nicu Sebe, Wampfler Rafael

While Human-Centric Foundation Models (HFM) excel at perceiving and generating human data, they remain passive, struggling with real-time interaction and adaptation. This limits real-world deployment. The emerging field of Interactive HFM (I-HFM) addresses this by enabling bidirectional engagement. I-HFMs operate across three critical dimensions: (a) interacting with users for intuitive content creation/refinement, (b) interacting with environments to learn and adapt like humans, and (c) interacting with other agents for collaborative task-solving. This interactivity transforms AI from passive models into proactive, human-like agents, bridging the gap towards responsive, socially intelligent AGI that integrates seamlessly into human societies.

Workshop on Benchmarking Multi-Target Tracking: Towards Spatiotemporal Action Grounding in Videos Sun 19 Oct 08:00 a.m.

Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Rajat Koner, Jindong Gu, Aljosa Osep, Prof. Dr. Thomas Seidl, Prof. Dr. Laura Leal-Taixé

The 8th BMTT Workshop focuses on action-aware multi-object tracking, aiming to unify temporal action localization and object tracking through natural language queries. While existing benchmarks often address these tasks separately, this workshop presents unified challenges to evaluate both capabilities. Participants are encouraged to develop models that can understand complex actions, follow detailed language instructions, and track multiple objects across time. The workshop aims to close the gap between vision and language, advancing multimodal video understanding and supporting research on scalable, real-world systems capable of fine-grained, action-driven reasoning in dynamic scenes.

Workshop: Vision-based AI for Digital Health: From Pixels to Practics Sun 19 Oct 08:00 a.m.

Hui Zhang, Bojian Ho, Yuanfang Guan, Guangchen Ruan

The workshop aims to unite researchers and practitioners at the intersection of vision-based AI and large language models (LLMs) to advance digital health innovation. By showcasing deep‑learning applications on high‑resolution imaging modalities (e.g., MRI, CT, retinal photography), we’ll explore how early disease detection and automated image review can boost diagnostic accuracy and streamline clinical workflows. We’ll also delve into emerging “Vision + LLM” systems that fuse visual understanding with natural‑language capabilities for automated report generation, intelligent literature retrieval, and interactive decision support. Through presentations and discussions, participants will identify challenges, exchange best practices, and chart pathways toward more personalized, data‑driven care.

2nd Workshop on Explainable Computer Vision: Quo Vadis? Sun 19 Oct 08:00 a.m.

Sukrut Rao, Robin Hesse, Quentin Bouniot, Sweta Mahajan, Amin Parchami-Araghi, Jayneel Parekh, Simone Schaub-Meyer, Florence d'Alché-Buc, Zeynep Akata, Stefan Roth, Bernt Schiele

This workshop aims to examine the state of the field of explainable AI (XAI) for computer vision, with the following goals: (1) discussion and dissemination of ideas at the cutting-edge of XAI research, and (2) a critical introspection on the challenges faced by the community and the way forward. The workshop includes papers, talks on recent advances, and a formal debate among invited speakers on the field’s core issues. We hope to encourage brainstorming in the community to bridge the gap from theory to practice and address challenges brought forth by the rise of large-scale foundation models, such as fundamentally rethinking what one wants from an explanation, obtaining it, performing appropriate evaluations, complying with regulatory requirements, and maintaining model performance.

Workshop: Learning to See: Advancing Spatial Understanding for Embodied Intelligence Sun 19 Oct 08:00 a.m.

Hongyang Li, Philipp Krähenbühl, Kashyap Chitta, Eric Jang, Andrei Bursuc, Huijie Wang

The world is three-dimensional. This fact was first seen by trilobites, the first organisms capable of sensing light. From that moment, nervous systems began to evolve, gradually transforming mere sight into insight, understanding, and action. All these combined gives rise to intelligence. Despite remarkable technological advancements in recent decades, modern embodied systems remain far from achieving full intelligence. They fall short in several key aspects: (i) contain information necessary for physical interaction, such as temporal dynamics of the scene; (ii) have a prior over semantic relevance, and should focus on task-relevant features like objects and their relationships; and (iii) be compact, avoiding the inclusion of irrelevant details, such as background elements. Attempts have been made, including integrating foundational models and utilizing large-scale data. Yet, the path to true intelligence remains long, with significant progress still required.

Workshop: Authenticity & Provenance in the age of Generative AI Sun 19 Oct 08:00 a.m.

Shruti Agarwal, Sarah Barrington, Maty Bohacek, Cristian Canton, Laura Cassani, Hany Farid, Luisa Verdoliva

Generative AI allows for the rapid and automatic generation of highly realistic audio, images, and videos (so-called deepfakes). The field of media forensics and digital provenance focus on detection and authentication of this content, thus helping in mitigating the potential risks. This workshop aims at bringing a heterogeneous group of specialists from academia, industry, and civil society together to discuss emerging threats, technologies, and mitigation strategies. The workshop will focus on the application of tools from computer vision, pattern recognition, and machine learning, as well as the development of novel approaches for verifying the integrity and tracing the origins of digital media, the creation of novel datasets for evaluation, large-scale evaluations of existing forensic techniques, and ethical/policy considerations around generative AI and forensic techniques.

Workshop: Affective & Behavior Analysis in-the-wild Sun 19 Oct 08:00 a.m.

Dimitrios Kollias, Stefanos Zafeiriou, Irene Kotsia, Greg Slabaugh

The ABAW Workshop is a premier platform highlighting the latest advancements in multimodal analysis, generation, modeling, and understanding of human behavior in unconstrained environments. It emphasizes cutting-edge systems that integrate facial expressions, body movements, gestures, natural language, voice to enable impactful research and practical applications. The workshop fosters interdisciplinary collaboration across fields (e.g. computer vision, AI, HCI, psychology, robotics, ethics & healthcare) and is a vital forum for building equitable, generalizable & human-centered AI systems. Finally, the Workshop also includes 3 challenges (Valence-Arousal Estimation, Compound Expression Recognition and Fine-Grained Violence Detection).

3rd Workshop on Computer Vision for Automated Medical Diagnosis Sun 19 Oct 08:00 a.m.

Fuying Wang, Sheng Liu, Qingyue Wei, Yi Lin, Lequan Yu, Yuyin Zhou, Yuzhe Yang, Angelica Aviles-Rivero, Hao Chen, Tingying Peng, Yifan Peng, Atlas Wang

The striding advances of computer vision techniques are revolutionizing many long-standing automatic medical diagnosis tasks. Emerging trends—such as Large Language Models (LLMs), Foundation Models (FMs), advanced learning paradigms (e.g., un-/semi-/self-supervised learning), and considerations of fairness and generalization—remain underexplored for secure and reliable automated medical diagnosis. Distinctly, this workshop emphasizes integrating insights from clinicians and radiologists alongside technical discussions to better advance the field.

2nd Workshop on the Challenge Of Out-of-Label Hazard Detection in Autonomous Driving Sun 19 Oct 08:00 a.m.

Ali K. AlShami, Ryan Rabinowitz, Maged Shoman, Jianwu Fang, Lukáš Picek, Shao-yuan Lo, Steve Cruz, Khang Lam, Jugal Kalita, Terrance E. Boult

The 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving (2COOOL) focuses on enhancing safety and robustness in autonomous systems by tackling challenges posed by unknown or out-of-distribution objects and behaviors. The workshop brings together researchers and practitioners from academia and industry, including our diverse team of co-organizers, to explore state-of-the-art solutions and techniques across various domains for real-world driving environments. It features expert keynotes, paper presentations, and a Kaggle challenge on generating hazard and accident reports from dashboard cameras. 2COOOL aims to advance the frontier of autonomous driving by fostering innovation to handle unexpected scenarios and build more robust ADAS systems.

Workshop: Computer Vision for Fashion; Art; and Design: Bridging Creativity and Responsible AI Sun 19 Oct 08:00 a.m.

Leonidas Lefakis. Ziad Al-Halah, Negar Rostamzadeh, Thomas Boquet, Julia Lasserre, Loris Bazzani, Ibtihel Amara, Sahar Mbarek, Reza Shirvany

The Computer Vision for Fashion, Art, and Design workshop series aims to to foster interdisciplinary discussions among researchers and practitioners in computer vision and machine learning, as well as artists, designers, sociotechnical researchers, policymakers, social scientists, and other cultural stakeholders. By creating a collaborative space, it aims to address complex challenges that arise at the intersection of generative AI, creativity, and ethics. This year the workshop includes, in addition to multiple invited talks by scientists working in the field, an Art Gallery, and a related Panel discussion.

Workshop: Camera Calibration and Pose Estimation Sun 19 Oct 08:00 a.m.

Zuzana Kukelova, Gabrielle Flood, Viktor Larsson, Torsten Sattler, Akihiro Sugimoto

This workshop focuses on the closely related problems of camera calibration and pose estimation. These are essential for many advanced 3D computer vision methods, including NeRFs, 3D Gaussian splatting, and scene understanding. The quality of these estimates greatly affects performance, yet many researchers treat them as black boxes. This workshop offers an opportunity for those using calibration and pose algorithms to learn about the latest methods and open challenges. It also provides a forum for researchers working on traditional and learning-based solutions to share ideas, improve methods, and expand the possibilities of 3D vision through better calibration and pose estimation.

The 1st International Workshop and Challenge on Disentangled Representation Learning for Real-world Applications Sun 19 Oct 08:00 a.m.

Xin Jin, Qiuyu Chen, Yue Song, Xihui Liu, Shuai Yang, Tao Yang, Ziqiang Li, Jianguo Huang, Yuntao Wei, Ba'ao Xie, Nicu Sebe, Wenjun (Kevin) Zeng

Disentangled Representation Learning shows promise for enhancing AI's fundamental understanding of the world, potentially addressing hallucination issues in language models and improving controllability in generative systems. Despite significant academic interest, DRL research remains confined to synthetic scenarios due to a lack of realistic benchmarks and unified evaluation metrics. DRL4Real Workshop aims to bridge this gap by introducing novel, realistic datasets and comprehensive benchmarks for evaluating DRL methods in practical applications. We will focus on key areas including controllable generation and autonomous driving, exploring how DRL can advance model robustness, interpretability, and generalization capabilities.

Workshop: Systematic Trust in AI Models: Ensuring Fairness; Reliability; Explainability; and Accountability in Machine Learning Frameworks Sun 19 Oct 08:00 a.m.

Arun George Zachariah, Michael Boone, Ryo Hachiuma, Nikki Pope, Shivika Prasanna, Khulud Alsultan

The STREAM Workshop aims to bring together researchers and practitioners working at the intersection of systems design and trustworthy AI. As AI technologies are increasingly deployed in critical domains such as healthcare, finance, and mobility, STREAM focuses on system-level approaches to embedding trustworthiness across the full pipeline, from data collection and architecture design to training, deployment, and evaluation.

The 4th DataCV Workshop and Challenge Sun 19 Oct 08:00 a.m.

Fatemeh Saleh, Liang Zheng, Qiang Qiu, José Lezama, Xin Zhao, Qiuhong Ke, Manmohan Chandraker, Xiaoxiao Sun, Yue Yao, Kevin W. Bowyer, Haiyu Wu

The 4th DataCV Workshop focuses on advancing data-centric perspectives in computer vision, shifting attention from algorithm-centric research to the analysis and understanding of vision datasets. We aim to explore dataset-level properties, representations, and similarities, as well as challenges in bias, fairness, and generalization. Topics include evaluating vision-language models, improving dataset quality through simulation, and reducing reliance on labeled data. The workshop encourages research on how dataset insights can guide model development, performance prediction, and ethical considerations. By fostering discussion and innovation in dataset analysis, DataCV promotes more robust, generalizable, and responsible vision systems.

Workshop: Foundations Models for V2X-Based Cooperative Autonomous Driving Sun 19 Oct 08:00 a.m.

Walter Zimmer, Ross Greer, Max Ronecker, Lars Ullrich, Arpita Vats, Chuheng Wei, Haibao Yu, Rui Song, Jiajie Zhang, Julie Stephany Berrio Perez, Zewei Zhou, Tianhui Cai, Yifan Liu, Haoxuan Ma, Xingcheng Zhou, Rahul Raja, Zhengzhong Tu, Holger Caesar, Alina Roitberg, Guoyuan Wu, Jiaqi Ma, Daniel Watzenig, Mohan Trivedi, Alois Knoll

DriveX explores the integration of foundation models and V2X-based cooperative systems to improve perception, planning, and decision-making in autonomous vehicles. While traditional single-vehicle systems have advanced tasks like 3D object detection, emerging challenges like holistic scene understanding and 3D occupancy prediction require more comprehensive solutions. Collaborative driving systems, utilizing V2X communication and roadside infrastructure, extend sensory range, provide hazard warnings, and improve decision-making through shared data. Simultaneously, Vision-Language Models (VLMs) offer generalization abilities, enabling zero-shot learning, open-vocabulary recognition, and scene explanation for novel scenarios. DriveX aims to bring together experts to explore these technologies, address challenges, and advance road safety.

Workshop on Computer Vision with Single-Photon Cameras Sun 19 Oct 08:15 a.m.

Atul Ingle, Sotiris Nousias, Mian Wei, Mel White

Single-photon cameras are an emerging class of camera technology with the potential to revolutionize the way today’s computer vision systems capture and process scene information, thanks to their extreme sensitivity, high speed capabilities, and increasing commercial availability. These cameras can be used for a wide range of applications: self-driving cars and autonomous robots, high-sensitivity cameras for night photography and fluorescence-guided surgeries, and high dynamic range cameras for industrial machine vision and biomedical imaging applications. This workshop will showcase the myriad ways in which single-photon cameras are used today in computer vision and inspire new unexplored applications.

Instance-Level Recognition and Generation Workshop Sun 19 Oct 08:30 a.m.

Andre Araujo,Bingyi Cao,Kaifeng Chen,Ondrej Chum,Noa Garcia,Guangxing Han,Giorgos Kordopatis-Zilos,Giorgos Tolias,Hao Yang,Nikolaos-Antonios Ypsilantis,Xu Zhang

The Instance-Level Recognition and Generation (ILR+G) Workshop focuses on computer vision tasks that operate at instance-level granularity, covering both recognition (ILR) and generation (ILG). Unlike category-level, ILR identifies and compares specific objects, scenes, or events, enabling open-world applications with a vast number of distinct classes. ILG, or personalized generation, aims to create content while preserving the identity of particular instances. This 7th edition explores potential synergies between ILR and ILG. The workshop features keynote talks by renowned speakers, invited papers, and a call for papers, aiming to bring together researchers working on instance-level tasks and inspire new research and collaborations.

Workshop: Memory and Vision Sun 19 Oct 08:30 a.m.

Zexue He,Jovana Kondic,Dmitry Krotov,Rogerio Feris

Memory is a core aspect of human intelligence, and artificial memory systems have recently seen a resurgence through foundational breakthroughs. At the same time, advances in computer vision, especially through generative AI, have enabled models to synthesize realistic imagery and understand complex scenes with remarkable generalization. Despite their shared relevance to cognition, memory and vision have evolved largely as separate fields. MemVis is the dedicated platform organized around the growing need to unify memory and vision, in the development of intelligent AI systems that can process, store, and recall visual information in a more human-like manner.

The Eighth International Workshop on Computer Vision for Physiological Measurement (CVPM) Sun 19 Oct 08:30 a.m.

Daniel McDuff, Wenjin Wang, Sander Stuijk, Tim Marks, Hassan Mansour, Vineet R. Shenoy

The Eighth International Workshop on Computer Vision for Physiological Measurement (CVPM) is the top venue for research on computer vision methods for measuring and modeling physiological processes. The goal of the workshop is to bridge the disciplines of computer vision and biomedical science and help effectively translate advances in AI into practice.

Workshop: Geometry-Free Novel View Synthesis and Controllable Video Models Sun 19 Oct 08:30 a.m.

Andrea Tagliasacchi, Sherwin Bahmani, Despoina Paschalidou, David Lindell, Konstantinos Derpanis, Marcus Brubaker, Boyang Deng, Haven (Haiwen) Feng, Qianqian Wang, Siyu Tang, Leonidas Guibas

This workshop focuses on recent advances in video generative models and their applications in 3D and 4D generation and reconstruction. Topics include camera- and motion-controlled video synthesis, large-scale 3D/4D reconstruction, neural rendering, and generative model-guided pipelines. A central focus is geometry-free novel view synthesis with video diffusion models, enabling spatial control without explicit 3D geometry. The program also covers the distillation of temporal models into spatial representations. By highlighting these developments, the workshop aims to chart the path toward more controllable, photorealistic, and efficient generative pipelines that unify video generation with 3D and 4D reconstruction.

3rd Workshop on Vision-based InduStrial InspectiON Sun 19 Oct 08:30 a.m.

Shancong Mou, Hao Yan, Zirui Liu, Juan Du, Gokberk Cinbis, Wan Wang

The VISION workshop will provide a platform for the exchange of scholarly innovations and emerging practical challenges in Vision-based Industrial Inspection. Through a series of keynote talks, technical presentations, and challenge competitions, this workshop aims to (i) bring together researchers from the interdisciplinary research communities related to computer vision-based inspection; and (ii) connect researchers and industry practitioners to synergize recent research progress and current needs in industrial practice.

Tutorial: Zhiyu Huang · Zewei Zhou · Zhihao Zhao

Beyond Self-Driving: Exploring Three Levels of Driving Automation

Self-driving technologies have demonstrated significant potential to transform human mobility. However, single-agent systems face inherent limitations in perception and decision-making capabilities. Transitioning from self-driving vehicles to cooperative multi-vehicle systems and large-scale intelligent transportation systems is essential to enable safer and more efficient mobility. Realizing such sophisticated mobility systems introduces significant challenges, requiring comprehensive tools and models, simulation environments, real-world datasets, and deployment frameworks. This tutorial will delve into key areas of driving automation, beginning with advanced end-to-end self-driving techniques such as vision-language-action (VLA) models, interactive prediction and planning, and scenario generation. The tutorial emphasizes V2X communication and cooperative perception in real-world settings, as well as datasets including V2X-Real and V2XPnP. It also covers simulation and deployment frameworks for urban mobility, such as MetaDrive, MetaUrban, and UrbanSim. By bridging foundational research with real-world deployment, this tutorial offers practical insights into developing future-ready autonomous mobility systems.

Bio s:

Zhiyu Huang is a postdoctoral scholar at the UCLA Mobility Lab. He earned his Ph.D. in Mechanical and Aerospace Engineering from Nanyang Technological University (NTU), where he conducted research in the AutoMan Lab. He has also worked as a research intern at NVIDIA Research's Autonomous Vehicle Group and as a visiting student researcher at UC Berkeley's Mechanical Systems Control (MSC) Lab. His research interests lie at the intersection of robotics, mobility, and artificial intelligence. He focuses on deep learning, reinforcement learning, and generative AI, with applications in perception, prediction, decision-making, and simulation for autonomous driving and human-machine interaction. Dr. Huang aims to develop intelligent systems that make human-centered, adaptive decisions for safer and more efficient transportation.

Workshop: RetailVision6 - Revolutionizing the World of Retail Sun 19 Oct 08:45 a.m.

Ehud Barnea,Yosi Keller,Marina Paolanti,Sean Ma,Austen Groener,Weijian Li,Quanfu Fan,Rocco Pietrini

Recent advances in computer vision have significantly impacted the retail sector, introducing new opportunities and challenges across both physical and online domains. This workshop explores key problems such as shopper-product interaction, fine-grained recognition of visually similar and frequently changing products, and large-scale visual search across over 100,000 product classes. It also showcases advancements in generative models for tasks like product image synthesis and virtual try-on. These are just some of the challenges in the retail domain. By highlighting recent progress and open research directions, the workshop aims to bring together researchers and practitioners to advance the state of computer vision in retail.

Workshop: The Challenge of Detecting Synthetic Manipulations in ID Documents Sun 19 Oct 08:50 a.m.

Pavel Korshunov,Nevena Shamoska,Magdalena Połać,Vedrana Krivokuca,Vidit,Amir Mohammadi,Christophe Ecabert,Sébastien Marcel

The DeepID challenge aims to advance state of the art in detection of digitally manipulated ID documents. Recent increase in fraudulent attempts to bypass know your customer (KYC) services with generated or manipulated images of ID documents calls for automated robust detection methods. In this challenge, we provided participants with a training dataset of fantasy ID cards containing both bona fide and manipulated samples (faces swapped and text inpainted). For evaluation, we created a separate test set of fantasy ID cards and also a private 20K set of real world ID documents with genuine bona fide and digitally manipulated versions. We evaluated the docker submissions from more than 25 participated teams using an air gapped machine on the two datasets. The workshop will feature two keynote talks from renown researchers in media forensics and the top winning teams of the challenge.

Workshop: Structural Priors for Vision Sun 19 Oct 08:50 a.m.

Sangwoo Mo,Congyue Deng,Hila Chefer,Daniel Zoran,Kaichun Mo,Leonidas Guibas,Stella Yu

In recent years, there has been a growing trend toward training data-centric, large-scale foundation models that reduce reliance on structural priors. However, is simply scaling up Transformers truly the ultimate solution for computer vision? In this workshop, we aim to reintroduce structural priors and explore how they can further push the boundaries of foundation models. Our workshop provides an interdisciplinary space for sharing ideas across domains. For example, scene-aware 2D perception can enhance 3D modeling and robotic manipulation, while geometric reasoning can enhance the visual grounding of 2D perception and multimodal models. Through these interactions, we aim to better define the role of priors in vision foundation models.

Workshop: Story-Level Movie Understanding and Audio Description Sun 19 Oct 08:50 a.m.

Junyu Xie, Ridouane Ghermi, Tengda Han, Max Bain, Arsha Nagrani, Gül Varol, Weidi Xie, Vicky Kalogeiton, Ivan Laptev, Andrew Zisserman

The SLoMO workshop brings together researchers focused on the understanding of long-form, edited videos—such as movies and TV episodes. We spotlight two central research directions: (i) Audio Description (AD) Generation: This track explores the generation of concise and coherent descriptions that complement the original audio for blind and visually impaired (BVI) audiences. We have invited four leading experts in movie understanding and AD generation to share their insights and recent advancements in the field. (ii) Movie Question Answering: This track evaluates models’ capabilities in narrative comprehension, emphasizing story-level understanding. As part of this effort, we host the Short-Films 20K (SF20K) Competition, which aims to drive progress in story-level video understanding using the newly introduced SF20K dataset.

Workshop: Generative AI for Audio-Visual Content Creation Sun 19 Oct 08:55 a.m.

Masato Ishii,Takashi Shibuya,Yuki Mitsufuji,Ho Kei Cheng,Alexander Schwing,Prem Seetharaman,Oriol Nieto,Justin Salamon,David Bourgin,Bryan Russell,Ziyang Chen,Sanjoy Chowdhury

Seamless integration of audio and visual elements is crucial for creating immersive and engaging content. Audio-visual generation, involving the synthesis of one modality from the other or both jointly, has become a key research area. This capability holds significant potential for applications like virtual reality, gaming, film production, and interactive media, using advanced generative models to enhance multimedia quality and realism. This workshop highlights the growing importance of audio-visual generation in modern content creation, bringing together researchers and practitioners from academia and industry to explore the latest advances, challenges, and emerging opportunities in this dynamic field.

Workshop: Multispectral Imaging for Robotics and Automation Sun 19 Oct 09:00 a.m.

Shiho Kim, Yagiz Nalcakan, Rui Fan, Kailun Yang, Yalın Baştanlar, Ömer Şahin Taş, Jun Won Choi, Ukcheol Shin, Michal Kovac

The Multispectral Imaging for Robotics and Automation (MIRA) workshop brings together researchers and practitioners at the intersection of multispectral imaging, computer vision, and robotics. By leveraging data beyond the visible spectrum, multispectral imaging enables robust perception in challenging conditions, supporting applications from autonomous driving and industrial inspection to agricultural automation and search and rescue. MIRA aims to foster interdisciplinary collaboration across academia and industry, highlighting advances in sensor technology, spectral image processing, and downstream tasks like detection, segmentation, and decision-making. We welcome contributions exploring novel methods, applications, and datasets that advance the state of multispectral robotics.

Workshop on Biomedical Image and Signal Computing for Unbiasedness; Interpretability; and Trustworthiness Sun 19 Oct 09:00 a.m.

Romeo Lanzino, Bardh Prenkaj, Joanna Materzynska, Ananya Joshi, Silvia Zottin, Axel De Nardin, Tsui-Wei (Lily) Weng, Fabio Galasso, Gian Luca Foresti, Luigi Cinque, Roberto Cipolla

This workshop focuses on the foundational challenges of building AI systems that are unbiased, interpretable, and trustworthy. It aims to uncover the origins of algorithmic and data bias, advance the science of interpretability, and explore rigorous evaluation methods to ensure AI reliability. By bringing together researchers across biomedical imaging and signal processing, the workshop highlights novel methodologies and theoretical insights, emphasizing UIT as a scientific discipline rather than just an application concern. The event will showcase recent advances and foster discussions on future directions for inherently fair and transparent AI systems.

Tutorial: Huaizu Jiang · chuan guo · Yiming Xie · Lingjie Liu · Zhiyang Dou

3D Human Motion Generation and Simulation

3D human motion generation and simulation is an important area of research with applications in virtual reality, gaming, animation, robotics, and AI-driven content creation. Recent advances in deep learning have made it possible to automate motion generation, reducing the need for expensive motion capture and manual animation. Techniques such as diffusion models, generative masking, and variational autoencoders (VAEs) have been used to synthesize diverse and realistic human motion. Transformer-based models have improved the ability to capture temporal dependencies, leading to smoother and more natural movement. In addition, reinforcement learning and physics-based methods have helped create physically consistent and responsive motion, which is useful for applications like robotics and virtual avatars. This tutorial will bridge the gap between computer vision, graphics, and robotics, providing a comprehensive guide to the latest methods, practical applications, and future challenges. This tutorial will be organized into six core parts, guiding you from foundational knowledge to advanced research frontiers: (1) Human Motion Generation Basics: introducing fundamentals, key concepts and data representations; (2) Kinematic-Based Generation Methods: explore popular data-driven techniques that learn from motion capture datasets to produce lifelike animations; (3) Physics-Based Generation Methods: dive into methods that use reinforcement learning and physics simulations to create physically consistent and responsive motion; (4) Controllability of Human Motion Generation: learn how to direct and control motion synthesis using inputs like text, audio, or specific goals; (5) Human-Object/Human/Scene Interactions: cover advanced scenarios involving complex interactions with objects, other people, and the surrounding environment, and (6) Open Research Problems: discussing the major unsolved challenges and exciting opportunities for future work in the field.

Bio s:

Zhiyang Dou

I am a Ph.D. student at MIT CSAIL, supervised by Prof. Wojciech Matusik. I will be affiliated with the Computational Design and Fabrication Group and the Computer Graphics Group. I will obtain my MPhil degree in the Computer Graphics Group at The University of Hong Kong, supervised by Prof. Taku Komura. I received my B.Eng. degree with honors from Shandong University, advised by Prof. Shiqing Xin. I was a visiting scholar at the University of Pennsylvania, working with Prof. Lingjie Liu at the Graphics Lab and GRASP Lab. I also collaborate closely with Prof. Cynthia Sung in the Department of Mechanical Engineering and Applied Mechanics at UPenn. Research Interests: Computer Graphics, Character Animation, Geometric Modeling and Processing, Simulation, Human Behavior Analysis (Capture, Modeling and Simulation).

Workshop: Binocular Egocentric-360 Multi-modal Scene Understanding in the Wild Sun 19 Oct 09:00 a.m.

Jianbo Jiao, Shangzhe Wu, Dylan Campbell, Yunchao Wei, Lu Qi, Yasmine Mellah, Aleš Leonardis, Chenyuan Qu, Han Hu, Qiming Huang, Hao Chen

This workshop mainly looks at multi-modal scene understanding and perception in a human-like manner. Specifically, we will focus on binocular/stereo egocentric and 360° panoramic perspectives, which measure both first-person views and third-person panoptic views, mimicking a human in the scene, by combining with multi‑modal cues such as spatial audio, textual descriptions, and geo‑metadata. This workshop will cover but not be limited to the following topics: Embodied 360° scene understanding & egocentric visual reasoning; Multi-modal scene understanding; Stereo Vision; Open‑world learning & domain adaptation.

Workshop: The Third Perception Test Challenge Sun 19 Oct 09:00 a.m.

Joe Heyward, Nikhil Parthasarathy, Joao Carreira, Dima Damen, Andrew Zisserman, Viorica Patraucean, Eunice Yiu, Shiry Ginosar, Saman Motamed, Priyank Jaini

The 3rd Perception Test challenge comprehensively evaluates the perception capabilities of large multimodal models using the Perception Test benchmark. This year, novel tracks unify diverse tasks under common interfaces: joint object/point tracking, joint action/sound localisation, and unified multiple-choice videoQA (integrating non-semantic tasks via inpainted queries). A new VLM interpretability track is included to investigate model strengths and failures. Guest tracks cover image understanding (KiVA) and video generation (Physics-IQ). Our workshop provides a venue to evaluate all foundation vision models—discriminative, generative, image- or video-based. Prizes up to 50k EUR are available.

Embedded Vision Workshop Sun 19 Oct 09:00 a.m.

Tse-Wei Chen, Branislav Kisacanin, Ahmed Nabil Belbachir, Marius Leordeanu

Embedded vision is an active field of research, bringing together efficient learning models with fast computer vision and pattern recognition algorithms, to tackle many areas of robotics and intelligent systems that are enjoying an impressive growth today. Such strong impact comes with many challenges that stem from the difficulty of understanding complex visual scenes under the tight computational constraints required by real-time solutions on embedded devices. The Embedded Vision Workshop will provide a venue for discussing these challenges by bringing together researchers and practitioners from the different fields outlined above.

Workshop: Multi-modal Localization and Mapping Sun 19 Oct 09:00 a.m.

Timothy D Barfoot, Luca Carlone, Daniel Cremers, Frank Dellaert, Ayoung Kim, Yan Xia, Niclas Zeller

Multi-modal Localization and Mapping is an essential component of computer vision, with diverse applications in fields such as autonomous robotics, augmented reality, and beyond. This workshop aims to unite researchers, practitioners, and enthusiasts to explore the latest advancements, challenges, and innovations in multi-modal localization and mapping. By leveraging information from various sensors (e.g. camera, IMU, LiDAR, radar, and language), multi-modal approaches can significantly enhance localization and mapping accuracy in complex environments.

2nd Beyond Euclidean Workshop: Hyperbolic and Hyperspherical Learning for Computer Vision Sun 19 Oct 09:00 a.m.

Georgios Leontidis, Aiden Durrant, Fabio Galasso, Michael Kampffmeyer, Pascal Mettes, Leyla Mirvakhabova, Adín Ramírez Rivera, Indro Spinelli, Stella Yu

Within deep learning, Euclidean geometry is the default basis for deep neural networks, yet the naive assumption that such a topology is optimal for all data types and tasks does not necessarily hold. A growing body of evidence suggests that data and the representations we aim to learn can be better captured through learning in corresponding geometries that exhibit non-Euclidean structures. Interest in non-Euclidean deep learning has grown dramatically in recent years, driven by advancing methodologies, libraries, and applications. The 2nd Beyond Euclidean workshop brings together computer vision researchers and keynote speakers who share an interest in exploring non-Euclidean geometry.

Workshop: Foundation Data for Industrial Tech Transfer Sun 19 Oct 09:00 a.m.

Yoshihiro Fukuhara, Hirokatsu Kataoka, Püren Güler, Shunsuke Kitada, Xavier Boix, Dan Hendrycks, Keisuke Tateno, Shinichi Mae, Tatsuya Komatsu, Nishant Rai, Ryo Nakamura, Risa Shinoda, Takahiro Itazuri, Yoshiki Kubotani, Guarin Flück, Wadim Kehl, Kazuki Kozuka, Philipp Wirth

Recently, transformer-based foundation models have excelled across a wide range of recognition and generation benchmarks, yet real industrial impact requires robust tech transfer. Adapting them to heterogeneous industries demands domain-specific fine-tuning, reliable MLOps, and abundant, high-quality data. Conventional IID benchmarks are increasingly saturated, prompting evaluations that probe out-of-distribution and long-tail behavior. Both challenges hinge on curating and exploiting broader, deeper — “Foundation Data.” This workshop gathers academia and industry to examine methods for constructing high-quality datasets, refine model-adaptation pipelines, and design novel evaluation tasks grounded in Foundation Data, aiming to unlock new horizons in AI research and application.

Joint Workshop on Marine Vision Sun 19 Oct 09:00 a.m.

David Nakath, Malte Pedersen, Alexandra Branzan Albu, Anthony Hoogs, Derya Akkaynak, Maia Hoeberechts, Kevin Köser, Thomas B. Moeslund, Joakim B. Haurum, Justin Kay, Rupa Kurinchi-Vendhan

This workshop is organized as a collaboration between the 6th Workshop on Computer Vision for Analysis of Underwater Imagery (CVAUI) and the 3rd Automated Analysis of Marine Visual Data for Environmental Monitoring (AAMVEM). Visually monitoring marine environments poses a vastly different task compared to monitoring terrestrial environments. It is physically challenging to acquire underwater data, and the data typically have low signal-to-noise ratios due to the scattering nature of the water body. The aim of this workshop is to deepen the understanding of the challenges related to marine monitoring and to advance computer vision techniques to address them.

Workshop: Driving Simulation from Real-World Data: How Well Can We Render and Drive? Sun 19 Oct 09:00 a.m.

Yiyi Liao, Hongyu Zhou, Yichong Lu, Bingbing Liu, Hongbo Zhang, Jiansheng Wei, Ziqian Ni, Yiming Li, Andreas Geiger

This workshop brings together researchers in autonomous driving, computer vision, and graphics to advance the development of real-world data-driven driving simulators, as well as the autonomous driving algorithms in these photorealistic simulation environments. By tackling novel view synthesis and closed-loop autonomy in photorealistic simulations, we aim to push scalable, high-fidelity simulation forward. To promote community engagement and benchmarking, we also host two challenges: extrapolated novel view synthesis for urban scenes and closed-loop evaluation in photorealistic simulators.

The 2nd AI for Visual Arts Workshop and Challenges Sun 19 Oct 09:00 a.m.

Deblina Bhattacharjee, Bingchen Zhao, Rahul Raja

The AI4VA workshop at ICCV explores the intersection of artificial intelligence and the visual arts, including art, design, exhibitions, photography, and film. It brings together artists, art historians, ethicists, and researchers to foster cross-disciplinary innovation. Topics include generative art, AI for art history, 3D reconstruction from artworks, human pose estimation in art, VQA and captioning for artworks, multimodal interaction, AR/VR for art, and multimedia content analysis. A key aim is fostering participation across diverse creators and researchers. A special focus is AI for Cultural and Artistic Heritage, highlighting advances in analysing, restoring, and interpreting artefacts using multimodal AI across visual, textual, and historical data.

Workshop: Human-Interactive Generation and Editing Sun 19 Oct 09:00 a.m.

Xi Chen, Shaoteng Liu, Jinbo Xing, Xin Yu, Yuanhao Cai, Tianyu Wang, Xiaojuan Qi, Hengshuang Zhao, Scott Cohen, Radu Timofte, Alan Yuille, Zhe Lin

The rapid evolution of generative AI has reshaped content creation across images, video, and 3D/4D visuals. This workshop focuses on cutting-edge methodologies, practical applications, and open challenges in image/video/3D/4D generation and related editing tasks with an emphasis on flexible and friendly human interactions and multi-modal control signals. This workshop will serve as a platform for researchers and practitioners to discuss key topics related to visual content creation and editing with versatile interactions.

Tutorial: Qing Qu · Zhihui Zhu · Sam Buchanan · Liyue Shen · Peihao Wang · Yi Ma

Learning Deep Low-Dimensional Models from High-Dimensional Data: From Theory to Practice

Over the past decade, the advent of deep learning and large-scale computing has immeasurably changed the ways we process, interpret, and predict with data in imaging and computer vision. The ``traditional'' approach to algorithm design, based around parametric models for specific structures of signals and measurements---say sparse and low-rank models---and the associated optimization toolkit, is now significantly enriched with data-driven learning-based techniques, where large-scale networks are pre-trained and then adapted to a variety of specific tasks. Nevertheless, the successes of both modern data-driven and classic model-based paradigms rely crucially on correctly identifying the low-dimensional structures present in real-world data, to the extent that we see the roles of learning and compression of data processing algorithms---whether explicit or implicit, as with deep networks---as inextricably linked. As such, this tutorial provides a timely tutorial that uniquely bridges low-dimensional models with deep learning in imaging and vision. This tutorial will show how (i) these low-dimensional models and principles provide a valuable lens for formulating problems and understanding the behavior of modern deep models in imaging and computer vision, and (ii) how ideas from low-dimensional models can provide valuable guidance for designing new parameter efficient, robust, and interpretable deep learning models for computer vision problems in practice. The tutorial will start by introducing fundamental low-dimensional models (e.g., basic sparse and low-rank models) with motivating computer vision applications. Based on these developments, we will discuss strong conceptual, algorithmic, and theoretical connections between low-dimensional structures and deep models, providing new perspectives to understand state-of-the-art deep models in terms of learned representations and generative models. Finally, we will demonstrate that these connections can lead to new principles for designing deep networks and learning low-dimensional structures in computer vision, with both clear interpretability and practical benefits.

Bio s:

Qing Qu

Dr. Qu received my B.E. degree from Tsinghua University, Beijing, China, in 2011, and obtained his Ph.D. degree from Columbia University with Prof. John Wright in 2018. He was a Moore-Sloan fellow at NYU Center for Data Science from 2018 to 2020. His work has been recognized by a couple of awards, including a Microsoft Ph.D. Fellowship in machine learning in 2016, an NSF Career Award in 2022, an Amazon AWS AI Award in 2023, a UM CHS Junior Faculty Award in 2025, and a Google Research Scholar Award in 2025. He was one of the founding organizers of the Conference on Parsimony and Learning (CPAL), area chairs of ICML, NeurIPS, and ICLR, and action editor of TMLR.

Peihao Wang

I am a PhD student at the Department of Electrical and Computer Engineering, The University of Texas at Austin. I am doing scientific research in the areas of machine learning and computer vision, under the supervision of Prof. Atlas Wang. My research investigates the underlying geometric, symbolic, and topological structures of data, physics, and inference to jointly design theoretically sound neural architectures, training paradigms, and reasoning algorithms for language models, generative models, and physically grounded computer vision.

Tutorial: Marcos Conde · Radu Timofte

A Tour Through AI-powered Photography and Imaging

Computational Photography and low-level vision are pivotal research areas within Computer Vision, significantly impacting both academia and industry. Despite their importance, progress in these fields often lags behind areas like generative AI, primarily due to the scarcity of standardized datasets, clear benchmarks, and limited transparency from camera manufacturers. This tutorial bridges the gap between academic research and industry applications by providing an in-depth, hands-on exploration of computational photography and imaging using deep learning. Collaboratively presented by leading academic researchers and prominent industry experts from Sony, this tutorial systematically covers learned Image Signal Processors (ISPs), cutting-edge transformer and convolutional neural network architectures for image restoration and enhancement, and the development of realistic synthetic data generation pipelines. Attendees will acquire practical skills in dataset creation, realistic pipeline simulation, and evaluation protocols, empowering them with the tools and insights needed to accelerate innovation in this field.

Bio s:

Tutorial: Jiawen Zhu · Chengjie Wang · Guansong Pang · Peng Wu

Foundation Models in Visual Anomaly Detection: Advances, Challenges, and Applications

In recent years, foundation models have emerged as transformative tools in computer vision, offering powerful zero-shot and few-shot learning capabilities across a wide range of tasks. Their integration into visual anomaly detection—a critical and high-stakes field spanning healthcare, industrial inspection, security, and autonomous systems—has opened new frontiers in both research and real-world applications. This tutorial aims to deliver a comprehensive and timely overview of the role of foundation models in visual anomaly detection. We will cover multiple visual modalities, including 2D images, 3D images, and videos—each presenting unique challenges and necessitating modality-specific solutions. Specifically, we will delve into the entire pipeline, from data (pre-)training and prompt engineering to methodological innovations, inference strategies, and deployment in real-world environments. Key topics include zero- and few-shot learning, pseudo-labeling, anomaly generation, and multi-modal alignment between vision and language. To facilitate a deep and practical understanding of these areas, the tutorial will bring together leading experts from both academia and industry. Through in-depth technical presentations and discussions, participants will gain valuable insights into the latest advances, real-world applications, and open challenges shaping this rapidly evolving field.

Bio s:

Workshop on Safe and Trustworthy Multimodal AI Systems Sun 19 Oct 09:00 a.m.

Carlos Hinojosa, Yinpeng Dong, Adel Bibi, Jindong Gu, Yichi Zhang, Wenxuan Zhang, Lama Alssum, Andres Villa, Juan Carlos L. Alcazar, Chen Zhao, Lingjuan Lyu, Mohamed Elhoseiny, Bernard Ghanem, Philip Torr

Multimodal systems are transforming AI by enabling models to understand and act across language, vision, and other modalities, driving advances in robotics, autonomous driving, and scientific discovery. However, these capabilities raise serious safety and trustworthiness concerns, as traditional safeguards often fall short in multimodal contexts. The Workshop on Safe and Trustworthy Multimodal AI Systems (SaFeMM-AI) at ICCV 2025 brings together the computer vision community to address challenges including hallucinations, privacy leakage, and jailbreak vulnerabilities, and to promote the development of safer, more robust, and reliable multimodal models that can handle unsafe or adversarial inputs and consistently produce trustworthy outputs.

Workshop: Computer Vision for Developing Countries Sun 19 Oct 09:00 a.m.

Cuong Dao, Du Tran, Tuan-Anh Vu, Williem, Siddhartha Gairola, Ujjwal Verma, Vannkinh Nom

The Computer Vision for Developing Countries (CV4DC) workshop aims to create a supportive environment where students/researchers in the field of computer vision and related areas in AI can connect with each other, share their latest work, and expand their network for potential future collaborations and mentorships. This workshop empowers students and researchers from underrepresented, developing countries by providing opportunities to network, learn from the field experts, and share their work. We believe giving opportunities to students and researchers from lesser-known countries will foster diversity in computer research, leading to richer and more innovative contributions to the field.

Artificial Social Intelligence Workshop Sun 19 Oct 09:15 a.m.

Fiona Ryan, Leena Mathur, Anshul Gupta, Evonne Ng, Shiry Ginosar, Sangmin Lee, Paul Liang, Judy Hoffman, James M. Rehg, Louis-Philippe Morency

Humans use social intelligence to interpret and navigate interactions with other people and agents in our shared world. As AI systems become pervasive in human social situations, it is crucial to improve the social intelligence of these systems in order for them to seamlessly work with, for, and around humans. This workshop aims to bring together researchers from computer vision and other communities to collaborate towards building computational foundations for core social intelligence abilities. This edition of our workshop centers discussions, keynotes, and paper presentations around the topics of reasoning, multimodality, and embodiment in socially-intelligent AI.

Tutorial: Yangguang Li · Angela Dai · Minghao Chen · Zhaoxi Chen

Foundation Models for 3D Asset Synthesis.

In recent years, thanks to the continuous innovation and progress of diffusion technology, significant advancements have been made in image and video generation. By inputting textual descriptions or images, we can generate high-quality images or videos, which greatly enhance creative efficiency and imagination. However, progress in the 3D generation field has been relatively slow. Initially, optimization routes, represented by DreamFusion, were explored. This was followed by the exploration of reconstruction routes, such as LRM. It was only later that diffusion based on 3D generation techniques, similar to those in image and video generation, were gradually developed.In addition, based on the token-by-token prediction form similar to LLM, 3D generation based on autoregressive method has gradually made significant progress. Therefore, this tutorial focuses on the topic of 3D asset generation using diffusion and autoregression, specifically including: (1) Geometry generation modeling based on the diffusion paradigm; (2) Geometry generation modeling based on the autoregression paradigm; (3) Texture generation modeling based on the diffusion paradigm.

Bio s:

Workshop: Short-Form Video Understanding: The Next Frontier in Video Intelligence Sun 19 Oct 01:00 p.m.

Uttaran Bhattacharya,Ishita Dasgupta,Mehrab Tanjim,Chen-Yi Lu,Kunjal Panchal,Dinesh Manocha

Short-form videos (SVs) have proliferated as primary sources for entertainment, information, advertising, and social communication. Marketers are increasingly turning to SVs to reach their customers, and creative artists have begun to view SVs as a separate form of art and media for designing their content. Currently, SVs account for 90% of internet traffic and are estimated to be about 2.5 times more engaging than longer videos, driving their widespread popularity and diversity. Our workshop aims to consolidate efforts in SV understanding, highlight specific challenges, map the research landscape, and establish a foundation for future development in this rapidly expanding domain.

Tutorial: Changhoon Kim · Yezhou Yang · Sijia Liu

Responsible Vision-Language Generative Models

Vision-language generative models, such as text-to-image and image-to-text systems, have rapidly transitioned from research prototypes to widely deployed tools across domains like education, journalism, and design. However, their real-world adoption has introduced critical challenges surrounding robustness, controllability, and ethical risks—including issues like prompt misalignment, unauthorized content generation, adversarial attacks, and data memorization. This tutorial provides a comprehensive overview of these concerns and emerging solutions by covering recent advances and failure modes in state-of-the-art models, robust concept erasure techniques in diffusion models, and adversarial vulnerabilities and defenses in image-to-text systems. Through a blend of theoretical foundations, participants will examine failure scenarios, explore attack and defense strategies, and gain practical insights into enhancing the trustworthiness of multimodal generative models. Designed for researchers and practitioners in vision, language, and AI safety, this tutorial uniquely focuses on the responsible deployment of these models—bridging technical rigor with societal impact and offering guidance for future research directions in secure and reliable generative AI.

Bio s:

Changhoon Kim is an Assistant Professor at Soongsil University, specializing in robust and reliable AI. Previously, he was a Postdoctoral Scientist with the AWS Bedrock Science team, where he focused on enhancing the robustness of generative models. He completed his Ph.D. in Computer Engineering at Arizona State University under the advisement of Professor Yezhou Yang. His primary research focuses on the creation of secure machine learning systems. He has dedicated his efforts to developing user-attribution methods for generative models, a critical area of research in the age of AI-generated hyper-realistic content for tracing malicious usage, and machine UNlearning for removing private or harmful content from AI models. Kim's pioneering research has been recognized at prestigious conferences such as ICLR, ICML, ECCV, and CVPR. Additionally, he holds a U.S. patent for user-attribution in generative models. To further contribute to the community, he organizes tutorials and workshops at leading conferences to emphasize the importance of secure generative AI.

Workshop: Multi-Modal Foundation Models for Cancer Detection and Prevention Sun 19 Oct 01:00 p.m.

Ali Diba, Biagio Brattoli, Thijs Kooi, Tae Soo Kim, Sergio Pereira, Donggeun Yoo, Kayhan Batmanghelich, Yun Liu, Daniel Golden, Pranav Rajpurkar, Eun Kyoung Hong, Zelda Mariet, Shekoofeh Azizi

This workshop explores how multi-modal foundation models can revolutionize cancer care by integrating AI, computer vision, and machine learning. By leveraging diverse data types—such as medical imaging, genomics, and EHRs—these models enable earlier detection, personalized treatment, and better outcome prediction. Pre-trained on large datasets and fine-tuned for specific tasks, they offer adaptability across cancer types and clinical settings. The event brings together experts from academia, industry, and healthcare to share research, tackle challenges in data integration and model interpretability, and promote clinical translation. The goal is to advance cancer research and accelerate the real-world impact of AI in oncology.

Tutorial: Shaohui Liu · Anusha Krishnan · Jakob Engel · Marc Pollefeys

Benchmarking Egocentric Visual-Inertial SLAM at City Scale

Simultaneous localization and mapping (SLAM) is a fundamental technique with applications spanning robotics, spatial AI, and autonomous navigation. It addresses two tightly coupled challenges: localizing the device while incrementally building a coherent map of the surroundings. Localization, or positioning, involves estimating a 6 Degrees-of-Freedom (6-DoF) pose for each image in a continuous sequence, typically aided by other sensor data, while mapping involves constructing an evolving representation of the surrounding environment. This tutorial specifically addresses the task of accurate positioning for large-scale egocentric data using visual-inertial SLAM and odometry (VIO). It offers a comprehensive overview of the challenges faced by VIO/SLAM methods on egocentric data and introduces a new dataset and benchmark that can serve as a robust testbed for benchmarking these systems. With the help of well-positioned speakers, this tutorial explores the new benchmarking approach by analyzing failure cases, identifying limitations, and highlighting open problems in open-source academic VIO/SLAM systems. Additionally, it provides hands-on experience using the dataset and evaluation tools for researchers to get started with their own SLAM evaluations.

Bio s:

SEA: 1st workshop on Sustainability with Earth observation and AI Sun 19 Oct 01:00 p.m.

Zhuo Zheng, Junjue Wang, Xiaoyan Lu, Xinyu Dou, Gengchen Mai, Yanfei Zhong, Liangpei Zhang, Marshall Burke, David Lobell, Stefano Ermon

The workshop brings together researchers, practitioners, and policy‑makers to advance the state‑of‑the‑art in applying artificial intelligence to Earth observation for sustainability challenges. Technically, this workshop explores how state-of-the-art EO data-tailored foundation models, efficient architectures, and novel learning paradigms can be leveraged or adapted to tackle pressing sustainability challenges. Topics include, but are not limited to, climate monitoring, disaster response, biodiversity, agriculture, urban development, clean energy, and social economics.

Workshop: Multimodal Continual Learning Sun 19 Oct 01:00 p.m.

Yunhui Guo, Yapeng Tian, Mingrui Liu, Sayna Ebrahimi, Henry Gouk, Sarthak Maharana

In recent years, advances in machine learning and computer vision have driven continual learning (CL), allowing models to learn new tasks incrementally while retaining prior knowledge without full retraining. Early CL focused on unimodal data like images for classification, but powerful multimodal models now unify images, videos, text, and audio. Multimodal continual learning (MCL) must tackle unique challenges, including modality-specific forgetting, imbalance, and maintaining cross-modal links. This MCL workshop will address these issues, highlight new research directions, and promote collaboration among researchers, practitioners, and industry, advancing inclusive, efficient continual learning for modern AI systems.

Workshop on Graphic Design Understanding and Generation Sun 19 Oct 01:00 p.m.

Kota Yamaguchi, Cherry Zhao, Rajiv Jain, Sanket Biswas, Akshay Gadi Patil, Yuhui Yuan

The workshop on Graphic Design Understanding and Generation (GDUG) aims to bring together researchers, creators, and practitioners to discuss the important concepts, technical perspectives, limitations, and ethical considerations surrounding recognition and generative approaches to graphic design and documents. While recent advances in generative AI are making impressive strides in creative domains, there is a disconnect between research attempts and the real-world workflow that involves graphics design, such as the creation of a website, posters, online advertisements, social media posts, infographics, or presentation slides, where creators do not paint pixels but instead work with structured documents, such as layered object representation, stylistic attributes, and typography.

Workshop: Comic Intelligence Quotient: Advances and Challenges in AI-driven Comic Analysis Sun 19 Oct 01:00 p.m.

Ragav Sachdeva, Emanuele Vivoli, Artemis Llabrés, Deblina Bhattacharjee, Dimosthenis Karatzas, Andrew Zisserman

Comics are a uniquely compelling visual storytelling medium, blending images and text, but they present significant challenges for Artificial Intelligence. Unlike natural images, comics rely on abstract, stylized panels and implicit transitions that demand complex inference, causing even state-of-the-art vision-language models to struggle with tasks like panel sequencing and cross-panel reasoning. This workshop brings together researchers from computer vision, cognitive science, and multimedia analysis to advance AI-driven comic understanding. Through talks and discussions, we will explore new methodologies for multimodal reasoning.

2nd Workshop and Challenge on Unlearning and Model Editing Sun 19 Oct 01:00 p.m.

Kartik Thakral, Diego Garcia-Olano, Tal Hassner, Iacopo Masi, Mayank Vatsa

The 2nd Workshop and Challenge on Unlearning and Model Editing (U&ME) is a half-day event at ICCV 2025 in Hawaii on October 19, 2025, in the afternoon, and focuses on the growing need for new, efficient, and effective techniques for editing trained models, especially large generative models. Such models have practically unlimited functionality in the output they can generate. To provide this functionality, generative models require massive amounts of data and enormous compute costs to train, making it prohibitively expensive to retrain them whenever the need arises: when safety risks are uncovered, when deploying them to compute or storage-restricted platforms, or simply due to changing requirements. In particular, ensuring these models are safe and compliant with regulations can be difficult due to their broad range of capabilities and a continuously evolving regulatory landscape.

Workshop: Transparent & Reflective objects In the wild Challenges Sun 19 Oct 01:00 p.m.

Alex Costanzino, Pierluigi Zama Ramirez, Fabio Tosi, Matteo Poggi, Luigi Di Stefano, Jean-Baptiste Weibel, Doris Antensteiner, Markus Vincze, Benjamin Busam, Guangyao Zhai, Weihang Li, Junwen Huang

Depth and pose estimation are critical for enabling machines to interact effectively with the real world. Depth estimation provides the spatial structure of a scene, pose estimation localises and orients objects within it, both fundamental for robotics, augmented reality, and 3D understanding. Traditional approaches achieved impressive results on standard benchmarks like KITTI and Middlebury. However, when these methods encounter reflective and transparent objects, their performance degrades significantly. This limitation is particularly problematic as these challenging materials are common in everyday environments. TRICKY 2025 features two complementary challenges encouraging the development of next-generation algorithms capable of advanced reasoning on non-Lambertian objects.

Workshop: Women in Computer Vision Sun 19 Oct 01:00 p.m.

Aishwarya Jadhav,Supatta Viriyavisuthisakul,Elena Govi,Chaitra Desai,Ariana Bermudez Venegas,Uma Mudenagudi,Anuhya Thota

Women in Computer Vision Workshop (WiCV@ICCV 2025) aims to promote and increase the participation of female-identifying researchers in the computer vision community. The workshop features technical talks, poster sessions, a panel discussion, and a mentoring dinner to foster networking, visibility, and collaboration. WiCV provides a platform to present cutting-edge research, share career insights, and discuss challenges faced by women in CV. The event is open to all ICCV attendees and strongly encourages junior researchers and students to participate. Through community support and industry sponsorship, WiCV continues its mission to build a more inclusive and diverse research ecosystem.

Workshop on Knowledge-Intensive Multimodal Reasoning Sun 19 Oct 01:00 p.m.

Arman Cohan, Xiangliang Zhang, Manling Li, Yapeng Tian, Minhao Cheng, Zeynep Akata, Yilun Zhao, Haowei Zhang, Tianyu Yang, Zhenting Qi, Yuyang Liu, Zhiyuan Hu, Simeng Han, Rui Xiao, Xiangru Tang

This workshop aims to advance the frontier of multimodal AI systems that can effectively reason across specialized domains requiring extensive domain knowledge. Recent advancements in multimodal AI—combining information from text, images, audio, and structured data—have unlocked impressive capabilities in general-purpose reasoning. However, significant challenges persist when these systems encounter scenarios demanding deep domain expertise in fields such as medicine, engineering, and scientific research. Such contexts require expert-level perception and reasoning grounded in extensive subject knowledge, highlighting the need for specialized strategies to handle domain-specific complexity. Through invited talks, panel discussions, and interactive poster sessions, researchers and practitioners from diverse backgrounds will share the latest developments, ongoing hurdles, and promising future directions for knowledge-intensive multimodal reasoning. The workshop aims to foster collaboration and stimulate innovation towards the development of next-generation multimodal AI systems capable of reliable, transparent, and contextually grounded reasoning in specialized, high-stakes environments.

Workshop: End-to-End 3D Learning Sun 19 Oct 01:00 p.m.

Zhiwen Fan, Qianqian Wang, Yuanbo Xiangli, Wenyan Cong, Yiqing Liang, Jiachen Li, Zhengzhong Tu, Georgios Pavlakos, Yan Wang, Achuta Kadambi

End-to-End 3D Learning (E2E3D) investigates unified, fully differentiable frameworks to map raw sensor data into comprehensive 3D representations. By merging multiple handcrafted stages into a single trainable pipeline, E2E3D strives to scale spatial understanding. Topics include self-supervised pretraining of large-scale 3D foundation models, efficient real-time inference on resource-limited platforms, and automated, high-fidelity 3D annotation methods. We showcase applications in autonomous driving, robotics, AR/VR, and scientific imaging—demonstrating how integrated 3D systems enhance perception, content generation, and science. Through cross-disciplinary talks, posters, and panels, participants will help define the next generation of robust, real-world 3D AI.

The 6th Face Anti-Spoofing Workshop: Unified Physical-Digital Attacks Detection Sun 19 Oct 01:00 p.m.

Jun Wan,Jiankang Deng,Jun Lan,Weiqiang Wang,Sergio Escalera,Hugo Jair Escalante,Xiaoming Liu,Ajian Liu,Hui Ma,Yanyan Liang,Zhen Lei,Isabelle Guyon

Face Anti-Spoofing (FAS) has become an important part of ensuring the reliability of biometric authentication systems. However, achieving unified detection of physical and digital attacks remains a serious challenge. Physical presentation attacks often introduce artifacts such as color distortion and moiré, while digital forgeries often tamper with facial images at the pixel level in an imperceptible way. To advance the development of this field, we released a massively expanded dataset, UniAttackData+, at the 6th Face Anti-Spoofing Workshop (ICCV 2025). The dataset covers 2,875 participants from three different ethnic groups (Africa, East Asia, and Central Asia), and a total of 18,250 real videos were collected under various lighting, background, and acquisition device conditions. For each participant, we designed and applied 54 attack methods (including 14 physical attacks and 40 digital attacks), generating a total of 679,097 forged videos, providing a rich, diverse, and challenging data resource for unified attack detection.

The 2nd Workshop on Efficient Computing under Limited Resources: Visual Computing Sun 19 Oct 01:00 p.m.

Jinyang Guo, Zhenghao Chen, Yuqing Ma, Yifu Ding, Xianglong Liu, Jinman Kim, Wanli Ouyang, Dacheng Tao

This workshop explores efficient methodologies in visual computing, focusing on data-efficient techniques (e.g., image/video compression), label-efficient strategies (e.g., zero/few-shot learning), and model-efficient approaches (e.g., sparsification, quantization). By bringing together experts in these areas, we aim to foster the exchange of recent findings and discuss future directions. Given the growing importance of efficiency in practical deployments, this topic has attracted significant research interest. The workshop provides a platform for presenting novel perspectives and addressing core challenges in visual computing, ultimately driving advancements that bridge academic research with real-world applications.

Workshop: Visual Quality Assessment Competition Sun 19 Oct 01:00 p.m.

Chris Wei Zhou, Jian Wang, Sizhuo Ma, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, Zhengzhong Tu, Hadi Amirpour, Shiqi Wang, Hanwei Zhu, Yixiao Li, Fan Huang, Shuo Xing, Fengjun Guo, Xin Li, Wei-Ting Chen, Xiaoshuai Hao, Ying Chen, Huasheng Wang, Pengxiang Xiao

The Visual Quality Assessment Competition (VQualA) Workshop at ICCV 2025 aims to advance perceptual quality evaluation in computer vision by addressing the limitations of traditional metrics such as PSNR and SSIM. Leveraging deep learning, generative models, and multimodal large language models (MLLMs), the workshop emphasizes human-aligned assessments. It features seven diverse challenges spanning low-level vision, document enhancement, face image quality, AIGC video evaluation, and visual comparison via MLLMs. Through both scalar metrics and comparative reasoning tasks, VQualA fosters more interpretable, robust, and perceptually meaningful evaluation. It unites academic and industrial communities to push the frontier of visual quality assessment forward.

10th International Workshop on Recovering 6D Object Pose Sun 19 Oct 01:00 p.m.

Martin Sundermeyer, Tomáš Hodaň, Médéric Fourmy, Van Nguyen Nguyen, Junwen Huang, Stephen Tyree, Jonathan Tremblay, Eric Brachmann, Sindi Shkodrani, Bertram Drost, Carsten Steger, Vincent Lepetit, Carsten Rother, Stan Birchfield, Jiří Matas

The R6D workshop discusses topics related to model-based and model-free 6D object pose estimation which are relevant for applications such as robotic manipulation and augmented reality. The 10th workshop edition is organized in conjunction with the BOP Challenge 2025 that benchmarks the latest pose estimation methods in challenging settings including the new BOP-Industrial datasets. Find out about the latest trends and remaining challenges in object-centric 3D vision and learn how the latest methods perform in the wild on real robots.

Workshop: Representation Learning with Very Limited Resources: When Data; Modalities; Labels; and Computing Resources are Scarce Sun 19 Oct 01:00 p.m.

Hirokatsu Kataoka, Yuki M. Asano, Iro Laina, Rio Yokota, Nakamasa Inoue, Rintaro Yanagi, Partha Das, Connor Anderson, Ryousuke Yamada, Daichi Otsuka, Yoshihiro Fukuhara

Modern vision and multimodal models depend on massive datasets and heavy compute, magnifying costs, energy use, bias, copyright, and privacy risks. The “DeepSeek shock” of January 2025 spotlighted the urgency of learning powerful representations under tight resource limits. Now in its third edition, our workshop continues to explore strategies for robust representation learning when data, labels, modalities, parameters, or compute are scarce. We focus on techniques such as synthetic and distilled data, self-supervision, transfer learning, sparsity, and low-rank adaptation that squeeze maximum performance from minimal resources.

Neural SLAM Workshop Sun 19 Oct 01:00 p.m.

Martin R. Oswald, Matteo Poggi, Fabio Tosi, Youmin Zhang, Yiyi Liao, Vladimir Yugay, Yue Li

Over the past two decades, SLAM (Simultaneous Localization and Mapping) has evolved significantly, transitioning from traditional methods to deep learning and, more recently, to Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS). Since 2021, a surge of over 200 papers has reshaped the field, enabling new applications like realistic novel view synthesis. However, this rapid progress also raises challenges, such as lack of standardized benchmarks and understanding of key design choices. This workshop aims to unite researchers interested in dense neural SLAM, fostering discussion through keynotes, posters, and panels to explore emerging trends and future directions.

5th Workshop and Challenge on Open-World 3D Scene Understanding Sun 19 Oct 01:30 p.m.

Francis Engelmann, Ayca Takmaz, Alex Delitzas, Elisabetta Fedele, Anna-Maria Halacheva, Katerina Adam, Yang Miao, Jan-Nico Zaech, Zuria Bauer, Johanna Wald, Danda Pani Paudel, Or Litany, Federico Tombari, Marc Pollefeys, Leonidas Guibas

The ability to perceive, understand, and interact with 3D scenes is crucial for applications in AR/VR, robotics, healthcare, and beyond. Current 3D scene understanding models are largely limited to low-level recognition tasks such as object detection or semantic segmentation, and struggle to generalize beyond predefined training labels. Recently, large VLMs such as LLAVA have demonstrated impressive capabilities. Initial works have shown their potential to extend 3D scene understanding not only to open vocabulary recognition, but also reasoning about affordances, activities, and properties of unseen environments. This workshop aims to define tasks, metrics, and benchmarks to advance this emerging direction.

13th International Workshop on Assistive Computer Vision and Robotics Sun 19 Oct 01:30 p.m.

Giovanni Maria Farinella, Antonino Furnari, Marco Leo, Gerard G. Medioni, Francesco Ragusa, Mohan Trivedi

Designing systems with humans in the loop to assist users is an active research area with potential societal impact. Investigations require many innovations, tools, and evaluation criteria, even compared to fully autonomous systems. Implementing such systems demands significant effort to achieve reliability and raises issues related to usability, privacy, and acceptability. Moreover, multidisciplinary competencies are needed to adapt algorithms to industrial, social, medical, and economic constraints. The goal is to provide a view of how recent findings in computer vision and robotics are changing assistive technologies, emphasizing related issues and how researchers in various fields have addressed them.

First Workshop on Skilled Activity Understanding; Assessment and Feedback Generation Sun 19 Oct 02:00 p.m.

Paritosh Parmar, Angela Yao, Brendan Morris, Basura Fernando

Imagine a world where computer vision-based systems can analyze a video of an athlete, a surgeon, a patient, or a factory worker and instantly provide expert-level actionable feedback---correcting techniques, identifying inefficiencies, and helping people refine their skills in real time. Thanks to rapid progress in video understanding, this vision is becoming reality. AI-powered systems can now analyze complex human activities, assess performance, and generate intelligent feedback, unlocking new possibilities in sports, healthcare, manufacturing, education, rehabilitation, and beyond. Through Expert Keynotes and Invited Contributions, this workshop will explore the cutting edge of skilled activity understanding, assessment, and feedback generation, bridging research and real-world applications. More info at https://sauafg-workshop.github.io