ICCV 2025 Monday 10/20

Timezone: Pacific/Honolulu

Schedule Sun Mon Tue Wed Thu

Registration Desk: Registration / Badge Pickup Mon 20 Oct 07:00 a.m.

Advances in Image Manipulation Workshop and Challenges Mon 20 Oct 08:00 a.m.

Radu Timofte, Andrey Ignatov, Marcos V. Conde, Dmitriy Vatolin, George Ciubotariu, Georgii Perevozchikov, Andrei Dumitriu, Florin Vasluianu, Chao Wang, Nikolai Karetin, Nikolay Safonov, Alexander Yakovenko

Image manipulation, restoration, and enhancement are key computer vision tasks serving as an important frontend of further tasks. Each step forward eases the use of images by people or computers. Not surprisingly then, there is an ever growing range of applications in fields such as surveillance, automotive, etc. or for mobile and wearable devices. 6th AIM workshop provides an overview of the advances in those areas, an opportunity for academic and industrial attendees to interact and explore collaborations. 32 papers and 18 associated competitions are gauging state-of-the-art on topics such as super-resolution, denoising, deblurring, ISP, segmentation, efficient models or quality assessment.

Workshop: Computer Vision for Biometrics; Identity & Behavior Science Mon 20 Oct 08:00 a.m.

Abhijit Das, Mayank Vatsa, Richa Singh, Arun Ross, Vitomir Štruc, Antitza Dantcheva, Raghavendra Ramachandra

With growing global security concerns, biometric-based authentication and identification have become indispensable due to their reliability and robustness. Beyond physical biometrics, behavior understanding is emerging as a critical domain, aiming to interpret complex behavioral patterns that arise during interactions. Integrating both biometric and behavioral insights can lead to more secure, adaptive, and context-aware identity verification systems. Computer vision plays a pivotal role in analyzing and synthesizing biometric, identity, and behavior data. Recent advancements in research, driven by deep learning and multimodal analysis, have significantly expanded the field. However, numerous challenges remain, including effective joint modeling of multimodal cues occurring at different time scales, handling the inherent uncertainty of machine-detectable behavioral evidence, and addressing long-term dependencies in human behavior and identity recognition. This workshop aims to bring together leading researchers, industry experts, and government agencies to discuss the latest breakthroughs. It will serve as a platform to explore cutting-edge solutions, share innovative methodologies, and address the open challenges in this evolving field.

Tutorial: James Fort

Fourth Hands-on Egocentric Research Tutorial with Project Aria, from Meta

Project Aria is a research device that is worn like a regular pair of glasses, for researchers to study the future of computer vision with always-on sensing. Sensors in Project Aria capture egocentric video and audio, in addition to eye-gaze, inertial, and location information. On-device compute power is used to encrypt and store information that, when uploaded to separate designated back-end storage, helps researchers build the capabilities necessary for AR to work in the real world. In this fourth tutorial, in addition to sharing research from academic partner program members, we will also provide an introduction to the second generation of Aria glasses ’Aria Gen 2’, announced in February. As part of this introduction, we will provide live hands-on demo of Aria Research Kit (including Gen 2 glasses), describe how researchers can gain access to the Project Aria academic program, and demonstrate how open-source tools can be used to accelerate research for specific research challenges, including visual and non-visual localization and mapping, static and dynamic object detection and spatialization, human pose estimation, and building geometry estimation. We will review new open datasets from Meta and academic partners partners, including a dataset of 6000+ 3D objects with Aria captures for each object to facilitate novel research on egocentric 3D object reconstruction, and a review of ego-perception challenges and benchmarks associated with all datasets, including a demonstration of methods for approaching each challenge.

Bio :

The 12th IEEE International Workshop on Analysis and Modeling of Faces and Gestures Mon 20 Oct 08:00 a.m.

Joseph P. Robinson, Yun Fu, Sheng Li, Ming Shao, Yu Yin, Zhiqiang Tao

AMFG 2025 invites cutting-edge work in face, gesture, and multimodal recognition, where deep learning has unlocked unprecedented gains—but also raised concerns around generalization, transparency, and robustness. As models saturate benchmarks yet falter in real-world scenes with occlusion, motion, or lighting shifts, new challenges demand innovative solutions. Topics include detection and tracking, neural rendering, generative modeling, vision-language systems, kinship and soft biometrics, cross-modal fusion, benchmark creation, and ethical AI. With applications spanning HCI, surveillance, AR/VR, and behavioral science, AMFG aims to push beyond recognition into systems that interpret, adapt, and interact. Submit your work and shape the future of embodied vision.

Workshop on Neuromorphic Vision (NeVi): Advantages and Applications of Event Cameras Mon 20 Oct 08:00 a.m.

Federico Becattini · Luca Cultrera · CHIARA BARTOLOZZI

Neuromorphic vision sensors, or event cameras, mimic biological vision by asynchronously detecting changes in illumination, enabling high temporal resolution, low power consumption, and no motion blur. These unique features support advanced applications in robotics, autonomous vehicles, and human behavior analysis, especially for motion-centric tasks. Event cameras excel in low-light conditions and fast dynamics, enabling real-time obstacle avoidance, emotion recognition, defect detection, and more. Their microsecond latency and high dynamic range offer significant advantages over conventional cameras. Moreover, their inherent data sparsity contributes to privacy preservation.

Workshop: Closing the Loop Between Vision and Language (Decade Mark) Mon 20 Oct 08:00 a.m.

Mohamed Elhoseiny, Angel Chang, Anna Rohrbach, Marcus Rohrbach, Xin Eric Wang, Krishna Kumar, Kilichbek Haydarov, Eslam Abdelrahman, Austin Wang, Yiming Zhang, Tobias Wieczorek, Qianqi (Jackie) Yan

This workshop explores the intersection of Computer Vision and NLP, focusing on joint vision-language understanding. Recent advances, particularly in large-scale multimodal pretraining with transformers, have driven progress in various tasks. Topics include visual-linguistic representation learning, VQA, captioning, visual dialog, referring expressions, vision-and-language navigation, embodied QA, and text-to-image generation. We emphasize joint video-language understanding due to its unique challenges. Additionally, we welcome critical work on dataset and algorithmic bias, generalization issues, and efforts toward transparency and explainability.

Workshop: Computer Vision for Materials Science Mon 20 Oct 08:00 a.m.

Alexei Skurikhin, Alexander Hagen, Kai He, Kari Sentz, Joshua Stuckner, Katherine Sytwu

Computer vision and machine learning are critical tools to support large-scale materials characterization and development of new materials. Quantified structure features that are extracted from the data can be leveraged in statistical and machine learning models that establish processing-structure-property-performance (PSPP) relationships to identify non-linear and unintuitive trends in the high dimensional materials development space further accelerating materials development. The aim of workshop is to bring together cross-disciplinary researchers to demonstrate recent advancements in machine learning, computer vision, and materials microscopy, and discuss open problems such as representation learning, uncertainty quantification, and explainability in materials microscopy analysis.

Personalization in Generative AI Workshop Mon 20 Oct 08:00 a.m.

Pinar Yanardag, Rinon Gal, Daniel Cohen-Or, Tuna Han Salih Meral, Enis Simsar, Nupur Kumari, Aysegul Dundar, Federico Tombari

Personalization in Generative AI Workshop (P13N) is a full-day workshop that brings together leading researchers and industry experts to explore cutting-edge personalization techniques in generative AI. The event will feature paper presentations, panel discussions, and a competition focusing on personalized generative models across images, and videos. Topics include advanced optimization methods for personalizing diffusion models, multi-subject composition, cross-modal personalization, AR/VR personalization, dataset curation and benchmarking, as well as ethical and privacy considerations.

Workshop: Computer Vision in Advertising and Marketing Mon 20 Oct 08:00 a.m.

Alicja Kwasniewska, Subarna Tripathi, Maciej Szankin, Tz-Ying (Gina) Wu, Mateusz Ruminski, Sayantan Mahinder

The workshop will explore cutting-edge computer vision applications in digital advertising and marketing, covering fundamental visual understanding tasks, marketing optimization systems, brand intelligence, responsible AI practices, creative generation techniques, and emerging technologies that are transforming how brands connect with audiences through visual content. Key focus areas include multimodal data processing, visual similarity analysis, real-time bidding optimization, dynamic creative optimization, brand safety monitoring, and privacy-preserving analytics. The program will also address generative AI applications in advertising, automated visual optimization, and personalized content creation, while emphasizing ethical considerations and bias mitigation in marketing technology.

Workshop: Embodied Spatial Reasoning Mon 20 Oct 08:00 a.m.

Wufei Ma, Yu-Cheng Chou, Xiwei Xuan, Artur Jesslen, Adam Kortylewski, Celso M. de Melo, Zhaoyang Wang, Rama Chellappa, Alan Yuille, Jieneng Chen

The 1st Embodied Spatial Reasoning Workshop at ICCV 2025 explores the integration of spatial understanding in intelligent agents. Focus areas include Embodied AI, which allows agents to perceive, reason, and act in real-world or simulated environments, and Spatial Reasoning, which involves interpreting spatial relations and sensory feedback. The workshop also delves into the development of Embodied World Models for building spatially coherent, semantically grounded internal representations, and Robot Spatial Reasoning, addressing challenges in planning and acting under uncertainty and task constraints. The goal is to advance robust, generalizable spatial reasoning for embodied agents.

Workshop: Generating Digital Twins from Images and Videos Mon 20 Oct 08:00 a.m.

Andrew Melnik, Chen Geng, Yujin Chen, Lei Ke, Qirui Wu, Jiayi Liu

In this workshop, we focus on 3D models enriched with processes and semantic connections, similar to those in computer game and robotic environments. These models can range in fidelity from simplified 3D representations (Digital Cousins) to highly accurate reconstructions of real-world counterparts (Digital Twins). 3D Gaussian Splatting and Diffusion Models, have demonstrated impressive success in generating 3D representations from images and video. The next frontier in 3D representation is enriching models by integrating both physical and semantic object properties through generative AI and retrieval-based approaches.

Digital Twin Generation from Visual Data: A Survey https://arxiv.org/abs/2504.13159

Workshop: What is Next in Multimodal Foundation Models? Mon 20 Oct 08:00 a.m.

Roei Herzig, Rogerio Feris, David M. Chan, Leonid Karlinsky, Tsung-Han Patrick Wu, Jiaxin Ge, Dantong Niu, Eli Schwartz, Assaf Arbelle, Nimrod Shabtay, Bo Wu, Jehanzeb Mirza, Wei Lin

The intersection of foundation models and multimodal learning is a significant and widely discussed topic that complements the main ICCV conference. This workshop aims to encourage an interdisciplinary discussion on recent advancements, ongoing challenges, and future directions in multimodal foundation models, which have achieved breakthroughs by applying techniques across computer vision, natural language, and robotics.

Workshop: Multimodal reasoning and slow thinking in large model era: towards system2 and beyond Mon 20 Oct 08:00 a.m.

Chen Cheng, Chen Change Loy, David Clifton, Luc Van Gool, Shengwu Xiong, Peng Xu, Jiajun Zhang

This workshop aims to bridge the gap between computer vision and large language/reasoning models, focusing on complex tasks requiring advanced reasoning capabilities. We will explore how models can comprehend complex relationships through slow-thinking approaches like Neuro-Symbolic reasoning, Chain-of-Thought, and Multi-step Reasoning, pushing beyond traditional fixed tasks to understand object interactions within complex scenes. The goal is to bring together perspectives from computer vision, multimodal learning, and large language models to address outstanding challenges in multimodal reasoning and slow thinking in the context of large reasoning models, fostering more flexible and robust understanding in AI systems.

Workshop: 9th AI City Challenge Mon 20 Oct 08:00 a.m.

Zheng Tang, Shuo Wang, David C. Anastasiu, Ming-Ching Chang, Anuj Sharma, Norimasa Kobori, Jun-Wei Hsieh, Tomasz Kornuta, Rama Chellappa

The ninth AI City Challenge advanced real-world AI applications in transportation, automation, and safety, attracting 245 teams from 15 countries—a 17% increase. Featuring four tracks, the 2025 edition introduced challenges in 3D multi-camera tracking, traffic video question answering, warehouse spatial reasoning, and efficient fisheye camera detection. Tracks 1 and 3 utilized synthetic data from NVIDIA Omniverse. The evaluation platform ensured fair benchmarking with submission limits and held-out test sets. Public dataset releases reached over 30,000 downloads. Final rankings, announced post-competition, highlighted strong global participation and new benchmarks across multiple tasks, driving progress in intelligent visual perception and reasoning.

Workshop on Curated Data for Efficient Learning Mon 20 Oct 08:00 a.m.

George Cazenavette, Kai Wang, Zekai Li, Xindi Wu, Tongzhou Wang, Peihao Wang, Ruihan Gao, Bo Zhao, Zhangyang Wang, Jun-Yan Zhu

The ICCV 2025 Workshop on Curated Data for Efficient Learning (CDEL) seeks to advance the understanding and development of data-centric techniques that improve the efficiency of training large-scale machine learning models. As model sizes continue to grow and data requirements scale accordingly, this workshop brings attention to the increasingly critical role of data quality, selection, and synthesis in achieving high model performance with reduced computational cost. Rather than focusing on ever-larger datasets and models, CDEL emphasizes the curation and distillation of high-value data—leveraging techniques such as dataset distillation, data pruning, synthetic data generation, and sampling optimization. These approaches aim to reduce redundancy, improve generalization, and enable learning in data-scarce regimes. The workshop will bring together researchers and practitioners from vision, language, and multimodal learning to share insights and foster collaborations around efficient, scalable, and sustainable data-driven machine learning.

Workshop: Vision-Language Modeling in 3D Medical Imaging Mon 20 Oct 08:00 a.m.

Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Ezequiel De la Rosa, Anjany Sekuboyina, Murong Xu, Chinmay Prabhakar, Christian Bluethgen, Ayse Gulnihan Simsek, Omer Faruk Durugol, Sevval Nil Esirgun, Muhammed Furkan Dasdelen, Neslihan Simsek, Gulhan Ertan Akan, Mehmet Kemal Ozdemir, Melih Akan, Chenyu Wang, Weicheng Dai, Kayhan Batmanghelich, Xiaoman Zhang, Mohammed Baharoon, Luyang Luo, Pranav Rajpurkar, Pedro R. A. S. Bassi, Yixiong Chen, Wenxuan Li, Alan Yuille, Zongwei Zhou, Hadrien Reynaud, Bernhard Kainz, Chaoyi Wu, Weidi Xie, Benjamin Hou, Zhiyong Lu, Daguang Xu, Dong Yang, Pengfei Guo, Marc Edgar, Bjoern Menze

The VLM3D workshop brings together pioneers in vision-language modeling and 3D medical imaging to tackle the limitations of current models, which remain primitive and clinically unfit for real-world deployment. Through keynotes, discussions, and a dedicated benchmark challenge, we will explore why today’s AI struggles with the complexity of 3D data and how to advance towards robust, deployable solutions. We aim to bridge the gap between research and clinical practice by defining critical next steps for generating reliable, interpretable, and clinically useful models. Join us to help shape the future of AI-driven 3D medical imaging.

Workshop: Wild3D: 3D Modeling; Reconstruction; and Generation in the Wild Mon 20 Oct 08:00 a.m.

Wei-Chiu Ma, Shenlong Wang, Yufei Ye, Linyi Jin, Lea Müller, Lingjie Liu, Despoina Paschalidou, Qixing Huang, Shubham Tulsiani, David Fouhey

Despite recent advances in 3D modeling, reconstruction, and generation, many methods remain limited to static scenes or dense viewpoints, making them less effective in real-world, dynamic, and often sparse or noisy settings. This workshop aims to gather researchers and practitioners focused on modeling, reconstructing, and generating dynamic 3D objects or scenes under challenging, in-the-wild conditions. Leveraging progress in 3D learning, the abundance of 2D/3D data, and powerful generative models, now is an opportune time to make 3D vision more robust and accessible. The workshop encourages contributions from standard 3D topics and broader 4D directions involving dynamics and video generation.

Workshop: Foundation & Generative Models in Biometrics Mon 20 Oct 08:00 a.m.

Hatef Otroshi Shahreza, Vitomir Štruc, Luisa Verdoliva, Zhen Lei, Arun Ross, Sébastien Marcel

The ICCV 2025 Workshop on Foundation and Generative Models in Biometrics aims to bring together researchers to discuss state-of-the-art advancements, applications, and challenges in using foundation and generative models for biometric recognition, analysis, and security. While foundation models have gained significant attention in recent years, their applications in biometrics remain relatively underexplored. This workshop seeks to encourage discussions that inspire innovation and address the challenges of applying these advanced models in real-world biometric systems. The program will feature invited talks and paper presentations.

Workshop on Advanced Perception for Autonomous Healthcare Mon 20 Oct 08:00 a.m.

Ziyan Wu, Meng Zheng, Benjamin Planche, Zhongpai Gao, Anwesa Choudhuri, Terrence Chen

This workshop explores cutting-edge technologies, focusing on AI and computer vision, to advance autonomous, efficient, and patient-centered healthcare. It addresses challenges like medical errors and the need for precise diagnostics amid rapid technological advancements. The event facilitates collaboration among AI researchers, clinicians, and industry professionals through invited talks and paper presentations. These cover theoretical and practical applications of visual perception technologies to enhance workflow efficiency, diagnostic accuracy, reduce errors, and improve patient care. By fostering partnerships, the workshop tackles issues like staff shortages and rising healthcare costs, promoting innovative solutions for a more effective healthcare system.

Workshop: Ego-Exo Sensing for Smart Mobility Mon 20 Oct 08:00 a.m.

Yi-Ting Chen, Katie Z Luo, Wei-Chiu Ma, Stephany Berrio Perez, Boris Ivanovic, Min-Hung Chen, Zhenzhen Liu, Zi-Hui Li

This workshop explores a holistic approach to ego-exo sensing, integrating vehicle sensors, roadside cameras, aerial imagery, and V2V communications to advance transportation intelligence and drive progress toward smart mobility. We examine how ego-exo sensing networks enhance safety-critical scenario detection and generation, comprehensive environmental perception, cooperative driving, and multi-agent decision making, among other crucial tasks shaping the future of mobility. This workshop bridges siloed research efforts to create unified approaches for heterogeneous sensor fusion that will define next-generation mobility systems.

Workshop on Cultural Continuity of Artists: Leveraging Artistic Legacies for AI-Driven Cultural Heritage Mon 20 Oct 08:00 a.m.

Taehoon Kim,Daiwon Hyun,Nojun Kwak,Youngjoon Yoo,Sangdoo Yun,Everine Jo

The Workshop on Cultural Continuity of Artists (WCCA) brings together researchers, creators, and cultural institutions to explore how computer vision, multimodal AI, and XR technologies can safeguard and reinterpret artistic legacies. Our inaugural edition, co‑located with ICCV 2025, highlights the visionary South Korean fashion designer André Kim and introduces a rich, newly curated dataset from his archives.

Workshop: Large-scale Video Object Segmentation Mon 20 Oct 08:00 a.m.

Henghui Ding, Lingyi Hong, Chang Liu, Ning Xu, Linjie Yang, Yuchen Fan

The 7th LSVOS Workshop focuses on advancing research in Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS). For VOS, we set two tracks, one is based on Complex Video Object Segmentation (MOSEv2) dataset, the other is based on the MOSEv1 and LVOS datasets, targeting long-term videos and complex real-world scenes with challenges like object disappearance and reappearance, inconspicuous small objects, heavy occlusions, and crowded environments. The RVOS track continues with the MeViS dataset, which emphasizes motion-based language expressions and demands fine-grained temporal reasoning. In addition to the challenges, the workshop hosts invited talks from leading researchers, covering topics such as vision-and-language, motion understanding, cognitive modeling, and embodied intelligence in video understanding.

Workshop: Robust and Interactable World Models in Computer Vision Mon 20 Oct 08:00 a.m.

Shixiang Tang, Thu Nguyen-Phuoc, Zhenfei Yin, Amir Bar, Pengyu Zhang, Xu Jia, Yutong Bai, Lian Xu, Francesco Ferroni, Flora Salim, Tinne Tuytelaars, Jiajun Wu, Huchuan Lu, Yanyong Zhang, Philip H.S. Torr, Trevor Darrell

The workshop will focus on physical reliability and effective interactivity in world models for applications requiring precise physical reasoning and dense environmental interactions, such as robotics, autonomous systems, and multi-agent interactions. Beyond generating realistic predictions, world models must enforce physical consistency through differentiable physics, hybrid modeling, and adaptive simulation techniques. By bringing together researchers from machine learning, computer graphics, and physics-based modeling, the workshop will explore classical and cutting-edge approaches to aligning world models with real-world physics and extending them beyond simulation.

The Third Workshop on AI for 3D Content Creation Mon 20 Oct 08:00 a.m.

Hezhen Hu, Georgios Pavlakos, Despoina Paschalidou, Nikos Kolotouros, Davis Rempe, Angel X. Chang, Kai Wang, Amlan Kar, Kaichun Mo, Daniel Ritchie, Leonidas Guibas

Generating realistic 3D content has been a long-standing problem and graphics, which has recently attracted increasing attention. This workshop aims to bring together researchers to explore recent advances and future directions toward building fully controllable 3D content generation pipelines. We focus on four key aspects: (1) Representations suitable for generating high-quality and controllable 3D assets; (2) Modeling techniques that enable scalable, diverse, and photorealistic generation of humans, objects, and scenes; (3) Interaction modeling for capturing dynamic human-object relations with physical realism; (4) Applications of 3D content creation in areas such as embodied AI, construction and digital design.

Workshop on Distillation of Foundation Models for Autonomous Driving Mon 20 Oct 08:00 a.m.

Burhan Yaman, Yunsheng Ma, Xin Ye, Can Cui, Mahmut Yurt, Selim Engin, Sungyeon Park, Deyuan Qu, Xu Cao, Wenqian Ye, Chun-Hao Liu, Qi Chen, Yezhou Yang, Ziran Wang

The 2nd Workshop on Distillation of Foundation Models for Autonomous Driving (WDFM-AD) focuses on advancing the state of the art in deploying large foundation models—such as vision-language models (VLMs) and generative AI (GenAI) models—into autonomous vehicles through efficient distillation techniques. Building on the success of our previous workshops on large language and vision models for autonomous driving, WDFM-AD aims to bring together researchers and industry professionals to explore innovative approaches that accelerate the safe, efficient, and scalable adoption of cutting-edge AI technologies in autonomous vehicles.

Workshop: Multi-Modal Reasoning for Agentic Intelligence Mon 20 Oct 08:00 a.m.

Zhenfei Yin, Naji Khosravan, Tao Ji, Yin Wang, Roozbeh Mottagi, Iro Armeni, Zhuqiang Lu, Annie S. Chen, Yufang Liu, Zixian Ma, Mahtab Bigverdi, Amita Kamath, Chen Feng, Lei Bai, Gordon Wetzstein, Philip Torr

AI agents powered by Large Language Models (LLMs) have shown strong reasoning abilities across tasks like coding and research. With the rise of Multimodal Foundation Models (MFMs), agents can now integrate visual, textual, and auditory inputs for richer perception and decision-making. This workshop explores the development of Multimodal AI Agents across four categories: Digital, Virtual, Wearable, and Physical. We will discuss their applications in science, robotics, and human-computer interaction, as well as key challenges in cross-modal integration, real-time responsiveness, and interpretability. The goal is to advance robust, context-aware agents for complex, real-world environments.

Tutorial: Aditya Chattopadhyay · Rene Vidal · Jeremias Sulam

Foundations of Interpretable AI

In recent years, interpretability has emerged as a significant barrier to the widespread adoption of deep learning techniques, particularly in domains where AI decisions can have consequential impacts on human lives, such as healthcare and finance. Recent attempts at interpreting the decisions made by a deep network can be broadly classified in two categories, (i) methods that seek to explain existing models (post-hoc explainability), and (ii) methods that seek to build models that are explainable by design. This tutorial aims to provide a comprehensive overview of both approaches along with a discussion on their limitations. More specifically, this tutorial will consist of three lectures covering the following topics: Post-hoc explainability methods; Explaining deep networks using Shapley values and statistical testing; Explainable-by-design deep networks.

Bio s:

Rene Vidal

René Vidal is the Penn Integrates Knowledge and Rachleff University Professor of Electrical and Systems Engineering & Radiology and the Director of the Center for Innovation in Data Engineering and Science (IDEAS) at the University of Pennsylvania. He is also an Amazon Scholar, an Affiliated Chief Scientist at NORCE, and a former Associate Editor in Chief of TPAMI. His current research focuses on the foundations of deep learning and trustworthy AI and its applications in computer vision and biomedical data science. His lab has made seminal contributions to motion segmentation, action recognition, subspace clustering, matrix factorization, global optimality in deep learning, interpretable AI, and biomedical image analysis. He is an ACM Fellow, AIMBE Fellow, IEEE Fellow, IAPR Fellow and Sloan Fellow, and has received numerous awards for his work, including the IEEE Edward J. McCluskey Technical Achievement Award, D’Alembert Faculty Award, J.K. Aggarwal Prize, ONR Young Investigator Award, NSF CAREER Award as well as best paper awards in machine learning, computer vision, signal processing, controls, and medical robotics.

Workshop: Generative AI for Storytelling Mon 20 Oct 08:00 a.m.

Andrew Shin, Yusuke Mori, Hiroaki Yamane, Hana Kopecka, Hajime Murai, Xianchao Wu, Lin Gu, Haitao Yu

Generative AI excels at producing stunning visuals but often fails at creating coherent, engaging stories. Storytelling demands consistency in character, plot, and setting—areas where current models fall short. This workshop explores combining advanced visual models, large language models, and multi-modal AI to generate narratives that are visually consistent and compelling. Our goal is to push generative AI beyond impressive graphics, enabling it to deliver dynamic, cohesive stories and broaden its role in content creation.

6th Workshop on Continual Learning in Computer Vision Mon 20 Oct 08:00 a.m.

Jonghyun Choi, Marc Masana, Gido van de Ven, Liyuan Wang, Andrew D. Bagdanov, Evan Shelhamer, Dhireesha Kudithipudi

The Workshop on Continual Learning in Computer Vision (CLVision) aims to gather researchers and engineers from academia and industry to discuss the latest advances in Continual Learning. In this workshop, there will be regular paper presentations, invited speakers, and technical benchmark challenges to present the current state of the art, as well as the limitations and future directions for Continual Learning, arguably one of the most crucial milestones of AI.

Workshop: Generative AI for Biomedical Image Analysis: Opportunities; Challenges and Futures Mon 20 Oct 08:00 a.m.

Yuanfeng Ji, Zhongying Deng, Xiangde Luo, Jin Ye, Xiyue Wang, Dan Lin, Junjun He, Jianfei Cai, Angelica I Aviles-Rivero, Carola-Bibiane Schönlieb, Shaoting Zhang, Ping Luo

The GAIA (Generative AI for Biomedical Image Analysis) workshop at ICCV 2025 explores how generative AI is transforming medical imaging and healthcare. The workshop focuses on three key areas: (1) Data synthesis and clinical modeling using generative models for anatomically accurate image creation and disease simulation, (2) Multimodal learning that integrates visual data with medical reports through large language models, and (3) Workflow automation streamlining medical imaging from acquisition to diagnosis. Bringing together experts from computer vision, healthcare, and AI research, the workshop addresses challenges in interpretability, regulatory compliance, and clinical reliability while showcasing opportunities for interdisciplinary collaboration in advancing biomedical image analysis.

Workshop: Mobile Intelligent Photography and Imaging Mon 20 Oct 08:10 a.m.

Shangchen Zhou, Xiaoming Li, Zongsheng Yue, Kang Liao, Peiqing Yang, Jianyi Wang, Yuekun Dai, Yikai Wang, Xinyu Hou, Zhouxia Wang, Haoying Li, Ruicheng Feng, Yihang Luo, Chongyi Li, Chen Change Loy

Developing and integrating advanced image sensors with novel algorithms in camera systems is prevalent with the increasing demand for computational photography and imaging on mobile platforms. However, the lack of high-quality data for research and the rare opportunity for in-depth exchange of views from industry and academia constrain the development of Mobile Intelligent Photography and Imaging (MIPI). The workshop's main focus is on MIPI, emphasizing the integration of novel image sensors and imaging algorithms.

The 3rd workshop on Binary and Extreme Quantization for Computer Vision Mon 20 Oct 08:15 a.m.

Adrian Bulat, Zechun Liu, Haotong Qin, Ioanna Ntinou, Nic Lane, Georgios Tzimiropoulos

The 3rd edition of the Workshop seeks to explore novel directions for making deep learning models more efficient. We'll delve into low-bit quantization, a technique that significantly reduces model size and computational demand by representing model weights and activations with fewer bits. This is crucial for deploying models on-device, especially given the ever-growing model size. A core focus of the workshop will be to study ways of maintaining accuracy under extreme quantization, with recent breakthroughs demonstrating exciting potential for achieving this. Hear about the latest trends from our invited speakers and presented papers.

Workshop: Anomaly Detection with Foundation Models Mon 20 Oct 08:25 a.m.

Kuan-Chuan Peng, Ying Zhao, Abhishek Aich

The rapid advancement of foundation models in fields like healthcare, cybersecurity, and finance highlights the urgent need to improve their anomaly detection capabilities. Despite their growing application in high-stakes areas, the challenges of using these models for anomaly detection remain underexplored. The Anomaly Detection with Foundation Models (ADFM 2025) workshop aims to address this gap by focusing on the intersection of foundation models and anomaly detection. Our organizing and technical committee, composed of leading experts, provides a platform for advancing research and discussing the recent breakthroughs, and the technical and ethical implications of deploying these models. ADFM 2025 will foster interdisciplinary collaboration and contribute to the development of more reliable and effective anomaly detection systems in artificial intelligence.

Workshop: Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities Mon 20 Oct 08:25 a.m.

Yapeng Tian,Yuhang Zhao,Jon E. Froehlich,Chu Li,Yuheng Wu

The CV4A11y Workshop focuses on how new advances in vision foundation models and generative AI can help improve accessibility for people with disabilities. These technologies have great potential, but there are still important challenges, such as bias, limited data, lack of explainability, and real-world deployment issues. This workshop brings together experts in computer vision, AI, human-computer interaction, and accessibility to share new ideas, discuss open problems, and explore future directions. Our goal is to support the development of AI-powered tools that are more inclusive, useful, and effective in improving daily life for individuals with disabilities.

Workshop: Fairness and Ethics in AI: facing the ChalLEnge through Model Debiasing Mon 20 Oct 08:30 a.m.

Vito Paolo Pastore, Enzo Tartaglione, Irina Voiculescu, Jaegul Choo, Vittorio Murino

Despite the unprecedented surge, the presence of biases within AI models is a critical concern, perpetuating disparities and ethical dilemmas. In the last years, the scientific community has increasingly focused on understanding and addressing model bias, as evidenced by a significant uptick in research across various disciplines. In this context, we present the second edition of the workshop FAILED. This initiative aims to convene experts and practitioners from diverse backgrounds to explore innovative strategies for rectifying biases and promoting fairness and transparency in AI systems. Join us in this collaborative endeavor and be a part of this transformative journey!

The Second Workshop on Multimodal Representation and Retrieval Mon 20 Oct 08:30 a.m.

Xinliang Zhu, Arnab Dhua, Shengsheng Qian, Xin (Eric) Wang, Rene Vidal, Douglas Gray

Multimodal representation learning is central to modern AI, enabling applications across retrieval, generation, RAG, reasoning, agentic AI, and embodied intelligence. With the growing ubiquity of multimodal data—from e-commerce listings to social media and video content—new challenges arise in multimodal retrieval, where both queries and indexed content span multiple modalities. This task requires deeper semantic understanding and reasoning, especially at scale, where data complexity and noise become significant hurdles. The half-day event will feature keynote talks, oral and poster presentations.

Workshop & Competition on Computationally Optimal Gaussian Splatting Mon 20 Oct 08:30 a.m.

Forrest Iandola, Zechun Liu, Cheng Chang, Karthik Ganesan, Kareem Ibrahim, Enrique Torres Sanchez, Hugo Tessier, Miloš Nikolić, Andreas Moshovos

The first Workshop & Competition on Computationally Optimal Gaussian Splatting (COGS) welcomes researchers working on techniques for efficient 3D Gaussian Splatting (3DGS). While 3DGS has advanced rapidly, real-time rendering of Gaussian splats on resource-limited devices (e.g., smartphones, AR/VR headsets) remains a challenge. COGS aims to lower the barrier to entry and encourage new research in this area. The event will feature keynote talks from leading experts and a panel session exploring the current state and future directions of this promising technique.

Workshop: Generative Scene Completion for Immersive Worlds Mon 20 Oct 08:30 a.m.

Ethan Weber, Hong-Xing “Koven” Yu, Lily Goli, Alex Trevithick, Angjoo Kanazawa, Jiajun Wu, Norman Müller, Christian Richardt

This workshop focuses on generative scene completion, which is indispensable for world models, VR/AR, telepresence, autonomous driving, and robotics. It explores how generative models can help reconstruct photorealistic 3D environments from sparse or partial input data by filling in occluded or unseen spaces. Topics include world models, generative models, inpainting, artifact removal, uncertainty, controllability, and handling of casual data. We will discuss how related directions like text-to-3D and single-image-to-3D compare with scene completion, where more input constraints must be satisfied. The workshop highlights key challenges and recent progress in transforming incomplete real-world captures into immersive environments. https://scenecomp.github.io/

Workshop: Human-inspired Computer Vision Mon 20 Oct 08:30 a.m.

Lucia Schiatti, Mengmi Zhang, Yen-Ling Kuo, Vittorio Cuculo, Andrei Barbu

The goal of the Human-inspired Computer Vision workshop is to link and disseminate parallel findings in the fields of neuroscience, psychology, cognitive science, and computer vision, to inform the development of human-inspired computational models capable of solving visual tasks in a human-like fashion. Although the high performance reached by recent computer vision approaches, the relationship between machine and human vision remains unclear. Investigating such a relationship is timely and important both to improve machine vision, by identifying and tackling gaps between humans and machines, and to understand/enhance human vision, by developing interpretable models useful to explain neuroscientific and cognitive observations.

1st Workshop on Multimodal Sign Language Recognition Mon 20 Oct 08:30 a.m.

Hamzah Luqman, Raffaele Mineo, Maad Alowaifeer, Simone Palazzo, Motaz Alfarraj, Mufti Mahmud, Amelia Sorrenti, Federica Proietto Salanitri, Giovanni Bellitto, Concetto Spampinato, Silvio Giancola, Muhammad Haris Khan, Moi Hoon Yap, Ahmed Abul Hasanaath, Murtadha Aljubran, Sarah Alyami, Egidio Ragonese, Gaia Caligiore, Sabina Fontana, Senya Polikovsky, Sevgi Gurbuz, Kamrul Islam

Sign language is a rich and expressive visual language that uses hand gestures, body movements, and facial expressions to convey meaning. With hearing impairment increasingly prevalent worldwide, Sign Language Recognition research is advancing to enable more inclusive communication technologies. The 1st Multimodal Sign Language Recognition Workshop (MSLR 2025) brings together researchers to explore vision, sensor, and generative based approaches. Emphasizing multimodal fusion of RGB video, depth maps, skeletal and facial keypoints, and radar data, the workshop highlights systems designed for real-world variability and privacy. Topics include statistical and neural sign-to-text and text-to-sign translation, cross-lingual and multilingual methods, multimodal generative synthesis, and inclusive dataset creation. Through keynotes, presentations, and challenges on continuous and isolated sign recognition, participants will engage with new benchmarks, metrics, and ethical data practices. The workshop also highlights privacy-preserving sensing and healthcare accessibility, inviting contributions from researchers across disciplines to shape the future of multimodal sign language technologies.

Workshop: From street to space: 3D Vision AcrosS altiTudes Mon 20 Oct 08:30 a.m.

Yujiao Shi, Yuanbo Xiangli, Zuzana Kukelova, Bo Dai, Richard Hartley, Hongdong Li

As large-scale 3D scene modeling becomes increasingly important for applications such as urban planning, robotics, autonomous navigation, and virtual simulations, the need for diverse, high-quality visual data is greater than ever. However, acquiring dense and high-resolution ground-level imagery at scale is often impractical due to access limitations, cost, and environmental variability. In contrast, aerial and satellite imagery provide broader spatial coverage but lack the fine-grained details needed for many downstream applications. Combining images from multiple altitudes — from ground cameras to aerial drones and satellites—offers a promising solution to overcome these limitations, enabling richer, more complete 3D reconstructions. How can we achieve coherent and accurate 3D scene modeling when our visual world is captured from vastly different altitudes—ground, aerial, and satellite—under varying conditions? Each altitude offers distinct advantages, but cross-altitude data fusion introduces significant challenges: sparse and incomplete views, visual ambiguities, spatio-temporal inconsistencies, image quality variations, dynamic scene changes, and environmental factors that alter topology over time. Traditional 3D reconstruction methods, optimized for dense and structured inputs, struggle with such heterogeneous multi-altitude data. Advances in multi-scale feature alignment, neural scene representations, and robust cross-view fusion offer promising solutions, but key challenges remain.

Workshop: Findings of the ICCV Mon 20 Oct 08:45 a.m.

Margrit Betke, Yonatan Bisk, Juan C. Caicedo, Grigorios Chrysos, Trevor Darrell, Deepti Ghadiyaram, Boqing Gong, Derek Hoiem, Ziwei Liu, Bryan Plummer, Anna Rohrbach, Bryan Russell, Kate Saenko, Humphrey Shi, Kevin Shih

This workshop introduces the concept of a findings-style track to the computer vision community. NLP conferences have included Findings since 2020 to publish technically sound work, but which may not meet the main conference's threshold for novelty, impact, or excitement. There are many important results the community should be made aware of, and this venue provides an audience without delays for further submission iterations, or that might otherwise be lost if never published. This workshop provides a vehicle to discuss creating a computer vision Findings track and present Findings-quality papers to demonstrate their impact and benefits to inform future conferences.

Workshop: BioImage Computing Mon 20 Oct 09:00 a.m.

Alexander Krull, Peter Bajcsy, Jan Funke, Dagmar Kainmueller, Khaled Khairy, Qingjie Meng, Virginie Uhlmann, Martin Weigert

Bio-image computing (BIC) is a rapidly growing field at the interface of engineering, biology and computer science. Advanced light microscopy can deliver 2D and 3D image sequences of living cells with unprecedented image quality and ever increasing resolution in space and time. The emergence of novel and diverse microscopy modalities has provided biologists with unprecedented means to explore cellular mechanisms, embryogenesis, and neural development, to mention only a few fundamental biological questions. The enormous size and complexity of these data sets, which can exceed multiple TB per volume or video, requires state-of-the-art computer vision methods.

1st Workshop and Challenge on Category-Level Object Pose Estimation for Robotic Manipulation Mon 20 Oct 09:00 a.m.

Yang You, Jiyao Zhang, Jiankai Sun, Leonidas Guibas, Chen Wang, Luca Carlone, Linfang Zheng, Mac Schwager, Cewu Lu, Hao Dong, Bowen Wen, Ruida Zhang, Weiyao Huang, Mingdong Wu, Yijia Weng, Yitong Peng, Ruihai Wu, Lixin Yang, Junxiao Kong, Qiaojun Yu

This workshop addresses the critical problem of category-level object pose estimation and its applications within complex robotic manipulation scenarios. Pose estimation, a fundamental challenge in both 3D computer vision and robotics perception, involves accurately determining an object's complete 6-degree-of-freedom (6DoF) pose, comprising its 3D rotation and translation. Our workshop specifically focuses on advancing category-level pose estimation methods under realistic and demanding robotic manipulation settings, particularly emphasizing articulated objects, dynamic environments with potential human-object interactions, and objects subject to severe occlusions and partial visibility.

Tutorial: Andrew Westbury · Shoubhik Debnath · Weiyao Wang · Laura Gustafson · Daniel Bolya · Xitong Yang · Kate Saenko · Chaitanya Ryali · Haitham Khedr · Christoph Feichtenhofer ·

CANCELED: From Segment Anything to Generalized Visual Grounding

In this tutorial, members of the AI team at Meta and its academic partners will overview the latest research on visual grounding. We will cover each building block necessary to move toward future general purpose visual grounding systems, including universal image and video encoding, multimodal language understanding, semantic instance segmentation and tracking, and more. We will provide practical guidance on using SAM open source models, resources, and tooling to tackle the field's biggest open research problems. A new suite of SAM systems to be released this year will provide a foundation for our tutorial, offering practical entry points for each course component.

Bio s:

Workshop: 2nd AI for Content Generation; Quality Enhancement and Streaming Mon 20 Oct 09:00 a.m.

Marcos V. Conde, Radu Timofte, Eduard Zamfir, Julian Tanke, Takashi Shibuya, Yuki Mitsufuji, Varun Jain, Fan Zhang, Heather Yu

Welcome to the 2nd Workshop on AI for Content Generation, Quality Enhancement and Streaming. This workshop focuses on unifying new streaming technologies, computer graphics, and computer vision, from the modern deep learning point of view. Streaming is a huge industry where hundreds of millions of users demand everyday high-quality content on different platforms. Computer vision and deep learning have emerged as revolutionary forces for rendering content, image and video compression, enhancement, and quality assessment. From neural codecs for efficient compression to deep learning-based video enhancement and quality assessment, these advanced techniques are setting new standards for streaming quality and efficiency. Moreover, novel neural representations also pose new challenges and opportunities in rendering streamable content, and allowing to redefine computer graphics pipelines and visual content.

2nd Workshop on Computer Vision for Ecology Mon 20 Oct 09:00 a.m.

Sara Beery, Julia Chae, Mohamed Elhoseiny, Faizan Khan, Rupa Kurinchi-Vendhan, Andrew Temple, Edward Vendrow

The Computer Vision for Ecology workshop aims to bring together experts to foster discussion on the automation of ecological data collection, collation, and analysis. The goal is to establish a hub for the broader computer vision and ecology community at ICCV. The workshop encompasses applications of computer vision across a wide variety of ecological systems, spanning both terrestrial and aquatic systems, diverse geographic regions, and urban to wildland settings. The topics we aim to address include, but are not limited to, remote sensing, bioacoustics, video and image-based monitoring, citizen science, long-tailed recognition, zero-shot learning, expert AI systems, and model deployment.

Workshop: Multimodal Spatial Intelligence Mon 20 Oct 01:00 p.m.

Songyou Peng, Jihan Yang, Kyle Genova, Thomas Funkhouser, Fei-Fei Li, Leonidas J. Guibas, Saining Xie

Our workshop will feature insightful keynote talks and a panel discussion around multi-modal spatial intelligence. Key topics include enhancing MLLMs' reasoning with images and 3D data, advancing 2D/3D perception, and enabling embodied AI. We will also delve into dynamic physical world modeling and critically examine the trust, ethics, and societal impact of these technologies. This workshop is a hub for advancing the future of spatially-aware AI, from core reasoning to real-world application and responsible deployment.

Workshop: PHAROS - Adaptation; Fairness; Explainability in AI Medical Imaging Mon 20 Oct 01:00 p.m.

Stefanos Kollias, Dimitrios Kollias, Xujiong Ye, Francesco Rundo

PHAROS-AFE-AIMI aims to present innovative approaches for predictive modeling using large medical image datasets, emphasizing deep learning models and transparent, human-centered integration of GenAI and LLMs in health services. It tackles key challenges at the intersection of computer vision and healthcare AI, including multidisease diagnosis, model explainability, fairness, domain adaptation, continual learning. With rising interest in trustworthy and interpretable AI, PHAROS-AFE-AIMI fosters discussion on responsible deployment in sensitive applications. PHAROS-AFE-AIMI is organised under PHAROS AI Factory, ensuring its topics having real-world relevance and strong foundation in cutting-edge research. Finally, the workshop includes two challenges (Multi-Source-Covid-19 Detection and Fair Disease Diagnosis).

TrustFM: Workshop on Trustworthy Foundation Models Mon 20 Oct 01:00 p.m.

Chenliang Xu, Jure Leskovec, Dan Hendrycks, Jindong Wang, Lingjuan Lyu, Hangfeng He, Ting Wang, Zhiheng Li

Foundation models are revolutionizing the way we interact with AI—powering everything from search engines to scientific discovery. But as their reach expands, so do the risks. Can we truly trust these systems—before putting them to use? From AlexNet to LLaVA , the pace of innovation is staggering. Yet one thing remains constant: the urgent need for trustworthiness. In the foundation model era, we ask:What does trust mean at scale? Can classical insights still guide us? This workshop brings together researchers, engineers, and thought leaders to confront these challenges head-on. We’ll explore how to create models that are not just powerful, but robust, fair, interpretable, and accountable.

Tutorial: Manling Li · Yunzhu Li · Jiayuan Mao · Wenlong Huang

Foundation Models Meet Embodied Agents

An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of foundation models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects). We categorize the foundation models into Large Language Models (LLMs), Vision-Language Models (VLMs), and Vision-Language-Action Models (VLAs). In this tutorial, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and design a structured view to investigate the robot’s decision making process. This tutorial will present a systematic overview of recent advances in foundation models for embodied agents. We compare these models and explore their design space to guide future developments, focusing on Lower-Level Environment Encoding and Interaction and Longer-Horizon Decision Making.

Bio s:

Wenlong Huang

Wenlong Huang is a PhD candidate in Computer Science at Stanford University, advised by Professor Fei-Fei Li. He works at the intersection between robotic manipulation, foundation models, and 3D computer vision.

1st Workshop on Long Multi-Scene Video Foundations: Generation; Understanding and Evaluation Mon 20 Oct 01:00 p.m.

Vasco Ramos, Regev Cohen, Hila Chefer, Sivan Doveh, Jehanzeb Mirza, Hritik Bansal, Inbar Mosseri, Joao Magalhaes

This workshop aims to advance the state-of-the-art in long multi-scene video modelling, covering generation, understanding, evaluation, and ethical considerations. Long videos offer a powerful means of expression and communication, with applications in diverse fields such as entertainment, education, and health. However, current video generation and understanding techniques are typically confined to short, single-scene videos, limiting both our ability to create and comprehend complex video narratives. Thus, a growing need and research area is the development of methods for generating and understanding long-form videos of multiple dynamic scenes.

Workshop: Biometrics for Art Mon 20 Oct 01:00 p.m.

Dzemila Sero, Estefanía Talavera, Tuğçe Arican, Katrien Keune, John Delanay, Karen Trentelman, Robert van Langh

The Biometrics for Arts (ArtMetrics) workshop aims to explore the intersection of biometrics, computer vision and the arts to provide a more nuanced understanding of an artwork's provenance and maker(s). In the same way as Biometrics serves as a tool for person identification from unique phenotypic or behavioural traits, Biometrics for Arts aims at artist recognition from unique attributes detected on works of art, thus fostering dialogue between engineers, computer scientists, heritage scientists, conservators, and art historians. This workshop will showcase innovative applications of computer vision in the visual art domain, emphasizing the role of technology in supporting conservation practices and enhancing the management of museum and private collections. Key topics include pattern recognition in works of art, AI-driven artistic generation, AI-driven analysis of multimodal imaging data of works of art, and digital restoration of different media. ArtMetrics seeks to inspire interdisciplinary collaboration, highlighting how computer vision can both interpret and enhance the diverse world of art.

Workshop on Computer Vision Systems for Document Analysis and Recognition Mon 20 Oct 01:00 p.m.

Axel De Nardin, Silvia Zottin, Silvia Cascianelli, Alessio Fagioli, Marco Raoul Marini, Claudio Piciarelli, Romeo Lanzino, Luigi Cinque, Fabio Galasso, Rita Cucchiara, Gian Luca Foresti

In today’s rapidly digitalizing world, the ability to analyze documents automatically is becoming increasingly important in our daily life. Document Analysis plays a growing role in both industrial and cultural contexts, highlighting the need for AI systems capable of handling highly diverse documents, presenting significant challenges. This workshop seeks to address these issues by fostering interdisciplinary collaboration. By bringing together researchers and professionals from different domains, it aims to facilitate knowledge exchange, promote innovation, and advance the development of intelligent, adaptable solutions for Document Analysis in a wide range of applications.

2nd Workshop on Audio-Visual Generation and Learning Mon 20 Oct 01:00 p.m.

Shiqi Yang, Zhixiang Wang, Rodrigo Mira, Shoukang Hu, Vicky Kalogeiton, Stavros Petridis, Tae-Hyun Oh, Ming-Hsuan Yang

In this workshop, we aim to shine a spotlight on this exciting yet underinvestigated field by prioritizing new approaches in audio-visual generation, as well as covering a wide range of topics related to audio-visual learning, where the convergence of auditory and visual signals unlocks a plethora of opportunities for advancing creativity, understanding, and also machine perception. We hope our workshop can bring together researchers, practitioners, and enthusiasts from diverse disciplines in both academia and industry to delve into the latest developments, challenges, and breakthroughs in audio-visual generation and learning.

Tutorial: Xi Li · Muchao Ye · Manling Li

Towards Safe Multi-Modal Learning: Unique Challenges and Future Directions

Modern multi-modal learning leverages large models, such as large language models (LLMs), to integrate diverse data sources (e.g., text, images, audio, and video) and enhance understanding and decision-making. However, the inherent complexities of multi-modal learning introduce unique safety challenges that existing frameworks, primarily designed for uni-modal models, fail to address. This tutorial explores the emerging safety risks in multi-modal learning and provides insights into future research directions. We begin by examining the unique characteristics of multi-modal learning -- modality integration, alignment, and fusion. We then review existing safety studies across adversarial attacks, data poisoning, jailbreak exploits, and hallucinations. Next, we analyze emerging safety threats exploiting multi-modal challenges, including risks from additional modalities, modality misalignment, and fused representations. Finally, we discuss potential directions for enhancing the safety of multi-modal learning. As multi-modal learning expands, addressing its safety risks is crucial. This tutorial lays the foundation for understanding these challenges and fostering discussions on trustworthy systems.

Bio s:

Workshop: UniLight: Unifying Evaluation Metrics for Image-based Lighting; Relighting and Compositing Mon 20 Oct 01:00 p.m.

Anand Battad, Jean-Francois Lalonde, Javier Vazquez-Corral, Roni Sengupta, Mathieu Garon, Yannick Hold-Geoffroy

Recent advancements in image editing applications such as relighting, compositing, harmonization, and virtual object insertion have opened up new horizons in visual media, augmented reality, and virtual production---especially with the rise of powerful image generative models. However, evaluating the quality of results for these applications is still a significant challenge. Traditional image quality metrics are not always effective in capturing the perceptual realism and subtle effects these technologies aim to achieve. Additionally, relying on user studies can be time-consuming and introduces variability, making comparing methods consistently challenging. To address these issues, this workshop explores and develops standardized evaluation metrics to bridge the gap between quantitative assessment and qualitative perception.

Workshop: Responsible Imaging Mon 20 Oct 01:00 p.m.

Zhixiang Wang, Jian Wang, Yang Liu, Brandon Y. Feng, Zheng Wang, Yinqiang Zheng, Mingmin Zhao, Mohan Kankanhalli, Laura Waller

As imaging technologies advance, they surpass traditional capabilities, capturing and interpreting visual information beyond the limits of human perception. While these cutting-edge computational imaging systems push the boundaries of what can be seen and understood, they also quietly introduce critical ethical concerns related to privacy, safety, and robustness. Since these systems operate beyond human vision, many potential threats remain imperceptible, making them more difficult to detect and mitigate. This workshop aims to bring attention to these challenges and explore innovative solutions to deal with these challengings as imaging technologies push the boundaries of perceiving the invisible.

Workshop: Large Scale Cross Device Localization Mon 20 Oct 01:00 p.m.

Zuria Bauer, Hermann Blum, Mihai Dusmanu, Linfei Pan, Qunjie Zhou, Marc Pollefeys

As computer vision moves into real-world use, robust localization across diverse devices is crucial. The CroCoDL workshop unites experts in vision, robotics, and AR to tackle cross-device, multi-agent localization. Focusing on 3D vision, visual localization, embodied AI, and AR/VR/MR, it bridges academic research and real-world deployment. The inaugural event features invited talks, papers, and a competition. It also introduces CroCoDL, a new large-scale benchmark with synchronized data from phones, headsets, and robots. By connecting efforts in structure-from-motion, neural rendering, and embodied AI, the workshop advances scalable localization across domains, sensors, and dynamic environments.

Workshop: Egocentric Body Motion Tracking; Synthesis and Action Recognition Mon 20 Oct 01:00 p.m.

Lingni Ma, Yuting Ye, Robin Kips, Siyu Tang, Gen Li, Karen Liu, Boxiao Pan, Richard Newcombe

EgoMotion, in its second edition, is a continuation workshop focusing on human motion modeling using egocentric, multi-modal data from wearable devices. We focus motion tracking, synthesis, and understanding algorithms from egocentric/exocentric cameras, non-visual sensors, and high-level derived data. The workshop also covers research that applies egocentric motion for character animation, simulation, robotic learning etc. In addition to algorithms, the workshop promotes recent open-source projects, research platforms, datasets and associated challenges to encourage and accelerate research in the field. We will include live demo sessions to encourage discussions.

Workshop: Critical Evaluation of Generative Models and their Impact on Society Mon 20 Oct 01:00 p.m.

Noa Garcia, Amelia Katirai, Kento Masui, Mayu Otani, Yankun Wu

Visual generative models have revolutionized our ability to generate realistic images, videos, and other visual content. However, with great power comes great responsibility. While the computer vision community continues to innovate with models trained on vast datasets to improve visual quality, questions regarding the adequacy of evaluation protocols arise. Automatic measures such as CLIPScore and FID may not fully capture human perception, while human evaluation methods are costly and lack reproducibility. Alongside technical considerations, critical concerns have been raised by artists and social scientists regarding the ethical, legal, and social implications of visual generative technologies. The democratization and accessibility of these technologies exacerbate issues such as privacy, copyright violations, and the perpetuation of social biases, necessitating urgent attention from our community. This interdisciplinary workshop aims to convene experts from computer vision, machine learning, social sciences, digital humanities, and other relevant fields. By fostering collaboration and dialogue, we seek to address the complex challenges associated with visual generative models and their evaluation, benchmarking, and auditing.

Workshop: Computer Vision in Plant Phenotyping and Agriculture Mon 20 Oct 01:00 p.m.

Ian Stavness, Michael Pound, Feng Chen, Ronja Güldenring, Zane Hartley, Andrew French, Valerio Giuffrida

The CVPPA aims to advance computer vision techniques for applications in plant phenotyping and agriculture to support sustainable food, feed, fiber, and plant-based fuel production. The workshop seeks to highlight unsolved challenges, showcase current methods, and expand the research community at the intersection of plant and computer sciences. Topics include segmentation, tracking, detection, and reconstruction in agricultural contexts, open-source tools, and annotated datasets with benchmarks. Effective plant phenotyping is urgently needed to support the sustainability of our planet and its inhabitants: having strong community structures and computer vision scientists enter this field is more crucial now than ever.

Tutorial: Daniel Barath

RANSAC in 2025

RANSAC (Random Sample Consensus) has been a cornerstone of robust estimation in computer vision since its introduction in 1981. It remains highly relevant in 2025, as many vision applications still rely on detecting and handling outliers in data. This tutorial, “RANSAC in 2025”, aims to provide a comprehensive update on the latest advancements of RANSAC and its family of algorithms. We will balance theoretical foundations (to understand how and why RANSAC works) with practical applications (to demonstrate its use in real-world vision problems). By covering both classic principles and cutting-edge improvements, the tutorial will equip researchers and practitioners with state-of-the-art techniques for robust model fitting in computer vision.

Bio :

Visual object tracking and segmentation challenge workshop Mon 20 Oct 01:30 p.m.

Matej Kristan, Jiři Matas, Alan Lukežič, Luka Čehovin Zajc, Michael Felsberg, Pavel Tokmakov, Hyung Jin Chang, Gustavo Fernández

The VOTS2025 workshop is the thirteenth annual benchmarking activity of the VOT initiative, which has successfully identified key trends in tracking research, most recently the rise of video segmentation models as a promising direction for general object tracking. Continuing to connect the tracking community, VOTS2025 pushes the boundaries of tracking research. The workshop will present results of 32 trackers from three sub-challenges, focusing on holistic targets, targets undergoing topological transformations, and real-time tracking. Additionally, the program features presentations of winning methods, a panel discussion, and keynotes outlining future directions in object tracking and video understanding.

Workshop: Human-Robot-Scene Interaction and Collaboration Mon 20 Oct 01:30 p.m.

Jiayuan Gu, Xingyu Lin, Fangchen Liu, Yuexin Ma, Martin Magnusson, Sören Schwertfeger, Ye Shi, Hao Su, Jingya Wang, Lan Xu, Li Yi

Intelligent robots are advancing rapidly, with embodied agents increasingly expected to work and live alongside humans in households, factories, hospitals, schools, etc. For these agents to operate safely, socially, and intelligently, they must effectively interact with humans and adapt to changing environments. Moreover, such interactions can transform human behavior and even reshape the environment—for example, through adjustments in human motion during robot-assisted handovers or the redesign of objects for improved robotic grasping. Beyond established research in human-human and human-scene interactions, vast opportunities remain in exploring human-robot-scene collaboration. This workshop will explore the integration of embodied agents into dynamic human-robot-scene interactions.

Workshop on Scene Graphs and Graph Representation Learning Mon 20 Oct 01:30 p.m.

Azade Farshad, Maëlic Neau, Iro Armeni, Federico Tombari, Ehsan Adeli, Nassir Navab

The workshop focuses on the topic of scene graphs and graph representation learning for visual perception applications in different domains. Through a series of keynote talks, the audience will learn about defining, generating and predicting scene graphs, as well as about employing them for other tasks. Oral presentations of accepted submissions to the workshop will further enrich discussed topics with state-of-the-art advancements and engage the community. The objective is for attendees to learn about current developments and application domains of scene graphs and graph representation learning, as well as to draw inspiration and identify commonalities across these domains. Furthermore, this workshop will create an opportunity to discuss limitations, challenges, and next steps from research, practical, and ethical perspectives.

International Workshop on Observing and Understanding Hands in Action Mon 20 Oct 02:00 p.m.

Hyung Jin Chang, Rongyu Chen, Zicong Fan, Rao Fu, Kun He, Kailin Li, Take Ohkawa, Yoichi Sato, Linlin Yang, Lixin Yang, Angela Yao, Qi Ye, Linguang Zhang, Zhongqun Zhang

The ninth edition of this workshop will emphasize the use of multimodal LLMs for hand-related tasks. Multimodal LLMs have revolutionized the perceptions of AI, and demonstrated groundbreaking contributions to multimodal understanding, zero-shot learning, and transfer learning. Those models can process and integrate information from different types of hand data (or modalities), allowing the model to better understand complex hand-object/-hand interaction situations by capturing richer, more diverse representations.

2nd Workshop on Scalable 3D Generation and 3D Geometric Scene Understanding Mon 20 Oct 02:00 p.m.

Miaomiao Liu, Jose Alvarez, Mathieu Salzmann, Lingjie Liu, Hongdong Li, Richard Hartley

The objective of this workshop is to bring together engineers and researchers from academia and industry to discuss the current state-of-the-art methods in the field and challenges of computer vision for 3D scene generation, 3D scene reconstruction, 3D compositional scene geometric representation learning at a large scale. Moreover, this edition of workshop will also highlight 3D scene generation and understanding from multimodal data such as video, audio, and text, driven by its growing range of industrial applications.