Skip to yearly menu bar Skip to main content



Tutorials
Tutorial
Huaizu Jiang
Abstract
3D human motion generation and simulation is an important area of research with applications in virtual reality, gaming, animation, robotics, and AI-driven content creation. Generating realistic and controllable human motion is essential for creating interactive digital environments, improving character animation, and enhancing human-computer interaction. Recent advances in deep learning have made it possible to automate motion generation, reducing the need for expensive motion capture and manual animation. Techniques such as diffusion models, generative masking, and variational autoencoders (VAEs) have been used to synthesize diverse and realistic human motion. Transformer-based models have improved the ability to capture temporal dependencies, leading to smoother and more natural movement. In addition, reinforcement learning and physics-based methods have helped create physically consistent and responsive motion, which is useful for applications like robotics and virtual avatars. This tutorial will bridge the gap between computer vision, graphics, and robotics, providing a comprehensive guide to the latest methods, practical applications, and future challenges. This tutorial will be organized into six core parts, guiding you from foundational knowledge to advanced research frontiers: (1) Human Motion Generation Basics: introducing fundamentals, key concepts and data representations; (2) Kinematic-Based Generation Methods: explore popular data-driven techniques that learn from motion capture datasets to …
Tutorial
Yujun Cai · Yiwei Wang · Kai-Wei Chang · Junsong Yuan · Ziwei Liu · Chi Zhang · Jun Liu · Ming-Hsuan Yang
Abstract
Vision-Language Models (VLMs) have achieved remarkable progress in image captioning and visual question answering, yet developing genuine reasoning capabilities remains an open challenge. Unlike recent breakthroughs in reasoning-focused LLMs, many VLMs still rely primarily on pattern recognition and struggle with compositional logic. This tutorial provides a comprehensive overview of reasoning capabilities in VLMs, focusing on the transition from basic perception to complex inference. We will explore reasoning-oriented prompting and training techniques in multimodal contexts, reasoning-focused benchmarks, and architectural innovations for visual-textual fusion. Through lectures and hands-on demonstrations, participants will gain insights into current capabilities, persistent challenges in compositional generalization and explainability, and practical guidance for implementing reasoning mechanisms. This tutorial uniquely bridges advances in LLM reasoning with the visual domain, addressing the distinct challenges of spatial information processing and providing a roadmap toward more cognitively capable vision-language systems.
Tutorial
Zhiyu Huang · Zewei Zhou · Zhihao Zhao
Abstract
Self-driving technologies have demonstrated significant potential to transform human mobility. However, single-agent systems face inherent limitations in perception and decision-making capabilities. Transitioning from self-driving vehicles to cooperative multi-vehicle systems and large-scale intelligent transportation systems is essential to enable safer and more efficient mobility. Realizing such sophisticated mobility systems introduces significant challenges, requiring comprehensive tools and models, simulation environments, real-world datasets, and deployment frameworks. This tutorial will delve into key areas of driving automation, beginning with advanced end-to-end self-driving techniques such as vision-language-action (VLA) models, interactive prediction and planning, and scenario generation. The tutorial emphasizes V2X communication and cooperative perception in real-world settings, as well as datasets including V2X-Real and V2XPnP. It also covers simulation and deployment frameworks for urban mobility, such as MetaDrive, MetaUrban, and UrbanSim. By bridging foundational research with real-world deployment, this tutorial offers practical insights into developing future-ready autonomous mobility systems.
Tutorial
Qing Qu · Zhihui Zhu · Sam Buchanan · Liyue Shen · Peihao Wang · Yi Ma
Abstract
Over the past decade, the advent of deep learning and large-scale computing has immeasurably changed the ways we process, interpret, and predict with data in imaging and computer vision. The ``traditional'' approach to algorithm design, based around parametric models for specific structures of signals and measurements---say sparse and low-rank models---and the associated optimization toolkit, is now significantly enriched with data-driven learning-based techniques, where large-scale networks are pre-trained and then adapted to a variety of specific tasks. Nevertheless, the successes of both modern data-driven and classic model-based paradigms rely crucially on correctly identifying the low-dimensional structures present in real-world data, to the extent that we see the roles of learning and compression of data processing algorithms---whether explicit or implicit, as with deep networks---as inextricably linked. As such, this tutorial provides a timely tutorial that uniquely bridges low-dimensional models with deep learning in imaging and vision. This tutorial will show how (i) these low-dimensional models and principles provide a valuable lens for formulating problems and understanding the behavior of modern deep models in imaging and computer vision, and (ii) how ideas from low-dimensional models can provide valuable guidance for designing new parameter efficient, robust, and interpretable deep learning models for computer vision …
Tutorial
Marcos Conde · Radu Timofte
Abstract
Computational Photography and low-level vision are pivotal research areas within Computer Vision, significantly impacting both academia and industry. Despite their importance, progress in these fields often lags behind areas like generative AI, primarily due to the scarcity of standardized datasets, clear benchmarks, and limited transparency from camera manufacturers. This tutorial bridges the gap between academic research and industry applications by providing an in-depth, hands-on exploration of computational photography and imaging using deep learning. Collaboratively presented by leading academic researchers and prominent industry experts from Sony, this tutorial systematically covers learned Image Signal Processors (ISPs), cutting-edge transformer and convolutional neural network architectures for image restoration and enhancement, and the development of realistic synthetic data generation pipelines. Attendees will acquire practical skills in dataset creation, realistic pipeline simulation, and evaluation protocols, empowering them with the tools and insights needed to accelerate innovation in this field.
Tutorial
Jiawen Zhu · Chengjie Wang · Guansong Pang
Abstract
In recent years, foundation models have emerged as transformative tools in computer vision, offering powerful zero-shot and few-shot learning capabilities across a wide range of tasks. Their integration into visual anomaly detection—a critical and high-stakes field spanning healthcare, industrial inspection, security, and autonomous systems—has opened new frontiers in both research and real-world applications. This tutorial aims to deliver a comprehensive and timely overview of the role of foundation models in visual anomaly detection. We will cover multiple visual modalities, including 2D images, 3D images, and videos—each presenting unique challenges and necessitating modality-specific solutions. Specifically, we will delve into the entire pipeline, from data (pre-)training and prompt engineering to methodological innovations, inference strategies, and deployment in real-world environments. Key topics include zero- and few-shot learning, pseudo-labeling, anomaly generation, and multi-modal alignment between vision and language. To facilitate a deep and practical understanding of these areas, the tutorial will bring together leading experts from both academia and industry. Through in-depth technical presentations and discussions, participants will gain valuable insights into the latest advances, real-world applications, and open challenges shaping this rapidly evolving field.
Tutorial
Shaohui Liu · Anusha Krishnan · Jakob Engel · Marc Pollefeys
Abstract
Simultaneous localization and mapping (SLAM) is a fundamental technique with applications spanning robotics, spatial AI, and autonomous navigation. It addresses two tightly coupled challenges: localizing the device while incrementally building a coherent map of the surroundings. Localization, or positioning, involves estimating a 6 Degrees-of-Freedom (6-DoF) pose for each image in a continuous sequence, typically aided by other sensor data, while mapping involves constructing an evolving representation of the surrounding environment. This tutorial specifically addresses the task of accurate positioning for large-scale egocentric data using visual-inertial SLAM and odometry (VIO). It offers a comprehensive overview of the challenges faced by VIO/SLAM methods on egocentric data and introduces a new dataset and benchmark that can serve as a robust testbed for benchmarking these systems. With the help of well-positioned speakers, this tutorial explores the new benchmarking approach by analyzing failure cases, identifying limitations, and highlighting open problems in open-source academic VIO/SLAM systems. Additionally, it provides hands-on experience using the dataset and evaluation tools for researchers to get started with their own SLAM evaluations.
Tutorial
Yangguang Li · Angela Dai · Minghao Chen · Zhaoxi Chen
Abstract
In recent years, thanks to the continuous innovation and progress of diffusion technology, significant advancements have been made in image and video generation. By inputting textual descriptions or images, we can generate high-quality images or videos, which greatly enhance creative efficiency and imagination. However, progress in the 3D generation field has been relatively slow. Initially, optimization routes, represented by DreamFusion, were explored. This was followed by the exploration of reconstruction routes, such as LRM. It was only later that diffusion based on 3D generation techniques, similar to those in image and video generation, were gradually developed.In addition, based on the token-by-token prediction form similar to LLM, 3D generation based on autoregressive method has gradually made significant progress. Therefore, this tutorial focuses on the topic of 3D asset generation using diffusion and autoregression, specifically including: (1) Geometry generation modeling based on the diffusion paradigm; (2) Geometry generation modeling based on the autoregression paradigm; (3) Texture generation modeling based on the diffusion paradigm.
Tutorial
Changhoon Kim · Yezhou Yang · Sijia Liu
Abstract
Vision-language generative models, such as text-to-image and image-to-text systems, have rapidly transitioned from research prototypes to widely deployed tools across domains like education, journalism, and design. However, their real-world adoption has introduced critical challenges surrounding robustness, controllability, and ethical risks—including issues like prompt misalignment, unauthorized content generation, adversarial attacks, and data memorization. This tutorial provides a comprehensive overview of these concerns and emerging solutions by covering recent advances and failure modes in state-of-the-art models, robust concept erasure techniques in diffusion models, and adversarial vulnerabilities and defenses in image-to-text systems. Through a blend of theoretical foundations, participants will examine failure scenarios, explore attack and defense strategies, and gain practical insights into enhancing the trustworthiness of multimodal generative models. Designed for researchers and practitioners in vision, language, and AI safety, this tutorial uniquely focuses on the responsible deployment of these models—bridging technical rigor with societal impact and offering guidance for future research directions in secure and reliable generative AI.
Tutorial
Aditya Chattopadhyay · Rene Vidal · Jeremias Sulam
Abstract
In recent years, interpretability has emerged as a significant barrier to the widespread adoption of deep learning techniques, particularly in domains where AI decisions can have consequential impacts on human lives, such as healthcare and finance. Recent attempts at interpreting the decisions made by a deep network can be broadly classified in two categories, (i) methods that seek to explain existing models (post-hoc explainability), and (ii) methods that seek to build models that are explainable by design. This tutorial aims to provide a comprehensive overview of both approaches along with a discussion on their limitations. More specifically, this tutorial will consist of three lectures covering the following topics: Post-hoc explainability methods; Explaining deep networks using Shapley values and statistical testing; Explainable-by-design deep networks.
Tutorial
James Fort
Abstract
Project Aria is a research device that is worn like a regular pair of glasses, for researchers to study the future of computer vision with always-on sensing. Sensors in Project Aria capture egocentric video and audio, in addition to eye-gaze, inertial, and location information. On-device compute power is used to encrypt and store information that, when uploaded to separate designated back-end storage, helps researchers build the capabilities necessary for AR to work in the real world. In this fourth tutorial, in addition to sharing research from academic partner program members, we will also provide an introduction to the second generation of Aria glasses ’Aria Gen 2’, announced in February. As part of this introduction, we will provide live hands-on demo of Aria Research Kit (including Gen 2 glasses), describe how researchers can gain access to the Project Aria academic program, and demonstrate how open-source tools can be used to accelerate research for specific research challenges, including visual and non-visual localization and mapping, static and dynamic object detection and spatialization, human pose estimation, and building geometry estimation. We will review new open datasets from Meta and academic partners partners, including a dataset of 6000+ 3D objects with Aria captures for each …
Tutorial
Andrew Westbury · Shoubhik Debnath · Weiyao Wang · Laura Gustafson · Daniel Bolya · Xitong Yang · Kate Saenko · Chaitanya Ryali · Haitham Khedr · Christoph Feichtenhofer ·
Abstract
In this tutorial, Meta AI and its academic partners will overview frontier research on visual grounding. We will cover each building block necessary to move toward future general purpose visual grounding systems, including universal image and video encoding, multimodal language understanding, semantic instance segmentation and tracking, and the latest in 3D reconstructions methods. We will provide practical guidance on using SAM open source models, resources, and tooling to tackle the field’s biggest open research problems. A new suite of SAM systems to be released this year will provide a foundation for our tutorial, offering practical entry points for each course component.
Tutorial
Daniel Barath
Abstract
RANSAC (Random Sample Consensus) has been a cornerstone of robust estimation in computer vision since its introduction in 1981. It remains highly relevant in 2025, as many vision applications still rely on detecting and handling outliers in data. This tutorial, “RANSAC in 2025”, aims to provide a comprehensive update on the latest advancements of RANSAC and its family of algorithms. We will balance theoretical foundations (to understand how and why RANSAC works) with practical applications (to demonstrate its use in real-world vision problems). By covering both classic principles and cutting-edge improvements, the tutorial will equip researchers and practitioners with state-of-the-art techniques for robust model fitting in computer vision.
Tutorial
Xi Li · Muchao Ye · Manling Li
Abstract
Modern multi-modal learning leverages large models, such as large language models (LLMs), to integrate diverse data sources (e.g., text, images, audio, and video) and enhance understanding and decision-making. However, the inherent complexities of multi-modal learning introduce unique safety challenges that existing frameworks, primarily designed for uni-modal models, fail to address. This tutorial explores the emerging safety risks in multi-modal learning and provides insights into future research directions. We begin by examining the unique characteristics of multi-modal learning -- modality integration, alignment, and fusion. We then review existing safety studies across adversarial attacks, data poisoning, jailbreak exploits, and hallucinations. Next, we analyze emerging safety threats exploiting multi-modal challenges, including risks from additional modalities, modality misalignment, and fused representations. Finally, we discuss potential directions for enhancing the safety of multi-modal learning. As multi-modal learning expands, addressing its safety risks is crucial. This tutorial lays the foundation for understanding these challenges and fostering discussions on trustworthy systems.
Tutorial
Manling Li
Abstract
An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of foundation models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects). We categorize the foundation models into Large Language Models (LLMs), Vision-Language Models (VLMs), and Vision-Language-Action Models (VLAs). In this tutorial, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and design a structured view to investigate the robot’s decision making process. This tutorial will present a systematic overview of recent advances in foundation models for embodied agents. We compare these models and explore their design space to guide future developments, focusing on Lower-Level Environment Encoding and Interaction and Longer-Horizon Decision Making.