ICCV 2025 Tutorials

Skip to yearly menu bar Skip to main content

Tutorial

Towards Comprehensive Reasoning in Vision-Language Models

Yujun Cai · Yiwei Wang · Kai-Wei Chang · Junsong Yuan · Ziwei Liu · Chi Zhang · Jun Liu · Ming-Hsuan Yang

Oct 19, 11:00 AM - 3:00 PM 318 A

Vision-Language Models (VLMs) have achieved remarkable progress in image captioning and visual question answering, yet developing genuine reasoning capabilities remains an open challenge. Unlike recent breakthroughs in reasoning-focused LLMs, many VLMs still rely primarily on pattern recognition and struggle with compositional logic. This tutorial provides a comprehensive overview of reasoning capabilities in VLMs, focusing on the transition from basic perception to complex inference. We will explore reasoning-oriented prompting and training techniques in multimodal contexts, reasoning-focused benchmarks, and architectural innovations for visual-textual fusion. Through lectures and hands-on demonstrations, participants will gain insights into current capabilities, persistent challenges in compositional generalization and explainability, and practical guidance for implementing reasoning mechanisms. This tutorial uniquely bridges advances in LLM reasoning with the visual domain, addressing the distinct challenges of spatial information processing and providing a roadmap toward more cognitively capable vision-language systems.

View full details

Tutorial

Beyond Self-Driving: Exploring Three Levels of Driving Automation

Zhiyu Huang · Zewei Zhou · Zhihao Zhao

Oct 19, 11:45 AM - 3:15 PM 308 A

Self-driving technologies have demonstrated significant potential to transform human mobility. However, single-agent systems face inherent limitations in perception and decision-making capabilities. Transitioning from self-driving vehicles to cooperative multi-vehicle systems and large-scale intelligent transportation systems is essential to enable safer and more efficient mobility. Realizing such sophisticated mobility systems introduces significant challenges, requiring comprehensive tools and models, simulation environments, real-world datasets, and deployment frameworks. This tutorial will delve into key areas of driving automation, beginning with advanced end-to-end self-driving techniques such as vision-language-action (VLA) models, interactive prediction and planning, and scenario generation. The tutorial emphasizes V2X communication and cooperative perception in real-world settings, as well as datasets including V2X-Real and V2XPnP. It also covers simulation and deployment frameworks for urban mobility, such as MetaDrive, MetaUrban, and UrbanSim. By bridging foundational research with real-world deployment, this tutorial offers practical insights into developing future-ready autonomous mobility systems.

View full details

Tutorial

Learning Deep Low-Dimensional Models from High-Dimensional Data: From Theory to Practice

Qing Qu · Zhihui Zhu · Sam Buchanan · Liyue Shen · Peihao Wang · Yi Ma

Oct 19, 12:00 PM - 8:00 PM 324

Over the past decade, the advent of deep learning and large-scale computing has immeasurably changed the ways we process, interpret, and predict with data in imaging and computer vision. The ``traditional'' approach to algorithm design, based around parametric models for specific structures of signals and measurements---say sparse and low-rank models---and the associated optimization toolkit, is now significantly enriched with data-driven learning-based techniques, where large-scale networks are pre-trained and then adapted to a variety of specific tasks. Nevertheless, the successes of both modern data-driven and classic model-based paradigms rely crucially on correctly identifying the low-dimensional structures present in real-world data, to the extent that we see the roles of learning and compression of data processing algorithms---whether explicit or implicit, as with deep networks---as inextricably linked. As such, this tutorial provides a timely tutorial that uniquely bridges low-dimensional models with deep learning in imaging and vision. This tutorial will show how (i) these low-dimensional models and principles provide a valuable lens for formulating problems and understanding the behavior of modern deep models in imaging and computer vision, and (ii) how ideas from low-dimensional models can provide valuable guidance for designing new parameter efficient, robust, and interpretable deep learning models for computer vision problems in practice. The tutorial will start by introducing fundamental low-dimensional models (e.g., basic sparse and low-rank models) with motivating computer vision applications. Based on these developments, we will discuss strong conceptual, algorithmic, and theoretical connections between low-dimensional structures and deep models, providing new perspectives to understand state-of-the-art deep models in terms of learned representations and generative models. Finally, we will demonstrate that these connections can lead to new principles for designing deep networks and learning low-dimensional structures in computer vision, with both clear interpretability and practical benefits.

View full details

Tutorial

Foundation Models in Visual Anomaly Detection: Advances, Challenges, and Applications

Jiawen Zhu · Chengjie Wang · Guansong Pang · Peng Wu

Oct 19, 12:00 PM - 3:00 PM 307 A

In recent years, foundation models have emerged as transformative tools in computer vision, offering powerful zero-shot and few-shot learning capabilities across a wide range of tasks. Their integration into visual anomaly detection—a critical and high-stakes field spanning healthcare, industrial inspection, security, and autonomous systems—has opened new frontiers in both research and real-world applications. This tutorial aims to deliver a comprehensive and timely overview of the role of foundation models in visual anomaly detection. We will cover multiple visual modalities, including 2D images, 3D images, and videos—each presenting unique challenges and necessitating modality-specific solutions. Specifically, we will delve into the entire pipeline, from data (pre-)training and prompt engineering to methodological innovations, inference strategies, and deployment in real-world environments. Key topics include zero- and few-shot learning, pseudo-labeling, anomaly generation, and multi-modal alignment between vision and language. To facilitate a deep and practical understanding of these areas, the tutorial will bring together leading experts from both academia and industry. Through in-depth technical presentations and discussions, participants will gain valuable insights into the latest advances, real-world applications, and open challenges shaping this rapidly evolving field.

View full details

Tutorial

3D Human Motion Generation and Simulation

Huaizu Jiang · chuan guo · Yiming Xie · Lingjie Liu · Zhiyang Dou

Oct 19, 12:00 PM - 8:00 PM 320

3D human motion generation and simulation is an important area of research with applications in virtual reality, gaming, animation, robotics, and AI-driven content creation. Recent advances in deep learning have made it possible to automate motion generation, reducing the need for expensive motion capture and manual animation. Techniques such as diffusion models, generative masking, and variational autoencoders (VAEs) have been used to synthesize diverse and realistic human motion. Transformer-based models have improved the ability to capture temporal dependencies, leading to smoother and more natural movement. In addition, reinforcement learning and physics-based methods have helped create physically consistent and responsive motion, which is useful for applications like robotics and virtual avatars. This tutorial will bridge the gap between computer vision, graphics, and robotics, providing a comprehensive guide to the latest methods, practical applications, and future challenges. This tutorial will be organized into six core parts, guiding you from foundational knowledge to advanced research frontiers: (1) Human Motion Generation Basics: introducing fundamentals, key concepts and data representations; (2) Kinematic-Based Generation Methods: explore popular data-driven techniques that learn from motion capture datasets to produce lifelike animations; (3) Physics-Based Generation Methods: dive into methods that use reinforcement learning and physics simulations to create physically consistent and responsive motion; (4) Controllability of Human Motion Generation: learn how to direct and control motion synthesis using inputs like text, audio, or specific goals; (5) Human-Object/Human/Scene Interactions: cover advanced scenarios involving complex interactions with objects, other people, and the surrounding environment, and (6) Open Research Problems: discussing the major unsolved challenges and exciting opportunities for future work in the field.

View full details

Tutorial

A Tour Through AI-powered Photography and Imaging

Marcos Conde · Radu Timofte

Oct 19, 12:00 PM - 3:30 PM 326 B

Computational Photography and low-level vision are pivotal research areas within Computer Vision, significantly impacting both academia and industry. Despite their importance, progress in these fields often lags behind areas like generative AI, primarily due to the scarcity of standardized datasets, clear benchmarks, and limited transparency from camera manufacturers. This tutorial bridges the gap between academic research and industry applications by providing an in-depth, hands-on exploration of computational photography and imaging using deep learning. Collaboratively presented by leading academic researchers and prominent industry experts from Sony, this tutorial systematically covers learned Image Signal Processors (ISPs), cutting-edge transformer and convolutional neural network architectures for image restoration and enhancement, and the development of realistic synthetic data generation pipelines. Attendees will acquire practical skills in dataset creation, realistic pipeline simulation, and evaluation protocols, empowering them with the tools and insights needed to accelerate innovation in this field.

View full details

Tutorial

Foundation Models for 3D Asset Synthesis.

Yangguang Li · Angela Dai · Minghao Chen · Zhaoxi Chen

Oct 19, 4:00 PM - 8:00 PM 307 B

In recent years, thanks to the continuous innovation and progress of diffusion technology, significant advancements have been made in image and video generation. By inputting textual descriptions or images, we can generate high-quality images or videos, which greatly enhance creative efficiency and imagination. However, progress in the 3D generation field has been relatively slow. Initially, optimization routes, represented by DreamFusion, were explored. This was followed by the exploration of reconstruction routes, such as LRM. It was only later that diffusion based on 3D generation techniques, similar to those in image and video generation, were gradually developed.In addition, based on the token-by-token prediction form similar to LLM, 3D generation based on autoregressive method has gradually made significant progress. Therefore, this tutorial focuses on the topic of 3D asset generation using diffusion and autoregression, specifically including: (1) Geometry generation modeling based on the diffusion paradigm; (2) Geometry generation modeling based on the autoregression paradigm; (3) Texture generation modeling based on the diffusion paradigm.

View full details

Tutorial

Benchmarking Egocentric Visual-Inertial SLAM at City Scale

Shaohui Liu · Anusha Krishnan · Jakob Engel · Marc Pollefeys

Oct 19, 4:00 PM - 8:00 PM 325 B

Simultaneous localization and mapping (SLAM) is a fundamental technique with applications spanning robotics, spatial AI, and autonomous navigation. It addresses two tightly coupled challenges: localizing the device while incrementally building a coherent map of the surroundings. Localization, or positioning, involves estimating a 6 Degrees-of-Freedom (6-DoF) pose for each image in a continuous sequence, typically aided by other sensor data, while mapping involves constructing an evolving representation of the surrounding environment. This tutorial specifically addresses the task of accurate positioning for large-scale egocentric data using visual-inertial SLAM and odometry (VIO). It offers a comprehensive overview of the challenges faced by VIO/SLAM methods on egocentric data and introduces a new dataset and benchmark that can serve as a robust testbed for benchmarking these systems. With the help of well-positioned speakers, this tutorial explores the new benchmarking approach by analyzing failure cases, identifying limitations, and highlighting open problems in open-source academic VIO/SLAM systems. Additionally, it provides hands-on experience using the dataset and evaluation tools for researchers to get started with their own SLAM evaluations.

View full details

Tutorial

Responsible Vision-Language Generative Models

Changhoon Kim · Yezhou Yang · Sijia Liu

Oct 19, 4:00 PM - 8:00 PM 316 B

Vision-language generative models, such as text-to-image and image-to-text systems, have rapidly transitioned from research prototypes to widely deployed tools across domains like education, journalism, and design. However, their real-world adoption has introduced critical challenges surrounding robustness, controllability, and ethical risks—including issues like prompt misalignment, unauthorized content generation, adversarial attacks, and data memorization. This tutorial provides a comprehensive overview of these concerns and emerging solutions by covering recent advances and failure modes in state-of-the-art models, robust concept erasure techniques in diffusion models, and adversarial vulnerabilities and defenses in image-to-text systems. Through a blend of theoretical foundations, participants will examine failure scenarios, explore attack and defense strategies, and gain practical insights into enhancing the trustworthiness of multimodal generative models. Designed for researchers and practitioners in vision, language, and AI safety, this tutorial uniquely focuses on the responsible deployment of these models—bridging technical rigor with societal impact and offering guidance for future research directions in secure and reliable generative AI.

View full details

Tutorial

Foundations of Interpretable AI

Aditya Chattopadhyay · Rene Vidal · Jeremias Sulam

Oct 20, 11:00 AM - 3:00 PM 306 B

In recent years, interpretability has emerged as a significant barrier to the widespread adoption of deep learning techniques, particularly in domains where AI decisions can have consequential impacts on human lives, such as healthcare and finance. Recent attempts at interpreting the decisions made by a deep network can be broadly classified in two categories, (i) methods that seek to explain existing models (post-hoc explainability), and (ii) methods that seek to build models that are explainable by design. This tutorial aims to provide a comprehensive overview of both approaches along with a discussion on their limitations. More specifically, this tutorial will consist of three lectures covering the following topics: Post-hoc explainability methods; Explaining deep networks using Shapley values and statistical testing; Explainable-by-design deep networks.

View full details

Tutorial

Fourth Hands-on Egocentric Research Tutorial with Project Aria, from Meta

James Fort

Oct 20, 11:00 AM - 3:00 PM 328

Project Aria is a research device that is worn like a regular pair of glasses, for researchers to study the future of computer vision with always-on sensing. Sensors in Project Aria capture egocentric video and audio, in addition to eye-gaze, inertial, and location information. On-device compute power is used to encrypt and store information that, when uploaded to separate designated back-end storage, helps researchers build the capabilities necessary for AR to work in the real world. In this fourth tutorial, in addition to sharing research from academic partner program members, we will also provide an introduction to the second generation of Aria glasses ’Aria Gen 2’, announced in February. As part of this introduction, we will provide live hands-on demo of Aria Research Kit (including Gen 2 glasses), describe how researchers can gain access to the Project Aria academic program, and demonstrate how open-source tools can be used to accelerate research for specific research challenges, including visual and non-visual localization and mapping, static and dynamic object detection and spatialization, human pose estimation, and building geometry estimation. We will review new open datasets from Meta and academic partners partners, including a dataset of 6000+ 3D objects with Aria captures for each object to facilitate novel research on egocentric 3D object reconstruction, and a review of ego-perception challenges and benchmarks associated with all datasets, including a demonstration of methods for approaching each challenge.

View full details

Tutorial

CANCELED: From Segment Anything to Generalized Visual Grounding

Andrew Westbury · Shoubhik Debnath · Weiyao Wang · Laura Gustafson · Daniel Bolya · Xitong Yang · Kate Saenko · Chaitanya Ryali · Haitham Khedr · Christoph Feichtenhofer ·

Oct 20, 12:00 PM - 3:30 PM 328

In this tutorial, members of the AI team at Meta and its academic partners will overview the latest research on visual grounding. We will cover each building block necessary to move toward future general purpose visual grounding systems, including universal image and video encoding, multimodal language understanding, semantic instance segmentation and tracking, and more. We will provide practical guidance on using SAM open source models, resources, and tooling to tackle the field's biggest open research problems. A new suite of SAM systems to be released this year will provide a foundation for our tutorial, offering practical entry points for each course component.

View full details

Tutorial

Foundation Models Meet Embodied Agents

Manling Li · Yunzhu Li · Jiayuan Mao · Wenlong Huang

Oct 20, 4:00 PM - 8:00 PM 306 B

An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of foundation models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects). We categorize the foundation models into Large Language Models (LLMs), Vision-Language Models (VLMs), and Vision-Language-Action Models (VLAs). In this tutorial, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and design a structured view to investigate the robot’s decision making process. This tutorial will present a systematic overview of recent advances in foundation models for embodied agents. We compare these models and explore their design space to guide future developments, focusing on Lower-Level Environment Encoding and Interaction and Longer-Horizon Decision Making.

View full details

Tutorial

RANSAC in 2025

Daniel Barath

Oct 20, 4:00 PM - 8:00 PM 328

RANSAC (Random Sample Consensus) has been a cornerstone of robust estimation in computer vision since its introduction in 1981. It remains highly relevant in 2025, as many vision applications still rely on detecting and handling outliers in data. This tutorial, “RANSAC in 2025”, aims to provide a comprehensive update on the latest advancements of RANSAC and its family of algorithms. We will balance theoretical foundations (to understand how and why RANSAC works) with practical applications (to demonstrate its use in real-world vision problems). By covering both classic principles and cutting-edge improvements, the tutorial will equip researchers and practitioners with state-of-the-art techniques for robust model fitting in computer vision.

View full details

Tutorial

Towards Safe Multi-Modal Learning: Unique Challenges and Future Directions

Xi Li · Muchao Ye · Manling Li

Oct 20, 4:00 PM - 8:00 PM 326 B

Modern multi-modal learning leverages large models, such as large language models (LLMs), to integrate diverse data sources (e.g., text, images, audio, and video) and enhance understanding and decision-making. However, the inherent complexities of multi-modal learning introduce unique safety challenges that existing frameworks, primarily designed for uni-modal models, fail to address. This tutorial explores the emerging safety risks in multi-modal learning and provides insights into future research directions. We begin by examining the unique characteristics of multi-modal learning -- modality integration, alignment, and fusion. We then review existing safety studies across adversarial attacks, data poisoning, jailbreak exploits, and hallucinations. Next, we analyze emerging safety threats exploiting multi-modal challenges, including risks from additional modalities, modality misalignment, and fused representations. Finally, we discuss potential directions for enhancing the safety of multi-modal learning. As multi-modal learning expands, addressing its safety risks is crucial. This tutorial lays the foundation for understanding these challenges and fostering discussions on trustworthy systems.

View full details