Tutorial 306 B

Foundation Models Meet Embodied Agents

Manling Li ⋅ Yunzhu Li ⋅ Jiayuan Mao ⋅ Wenlong Huang

2025 Tutorial

Project Page

Abstract

An embodied agent is a generalist agent that can take natural language instructions from humans and perform a wide range of tasks in diverse environments. Recent years have witnessed the emergence of foundation models, which have shown remarkable success in supporting embodied agents for different abilities such as goal interpretation, subgoal decomposition, action sequencing, and transition modeling (causal transitions from preconditions to post-effects). We categorize the foundation models into Large Language Models (LLMs), Vision-Language Models (VLMs), and Vision-Language-Action Models (VLAs). In this tutorial, we will comprehensively review existing paradigms for foundations for embodied agents, and focus on their different formulations based on the fundamental mathematical framework of robot learning, Markov Decision Process (MDP), and design a structured view to investigate the robot’s decision making process. This tutorial will present a systematic overview of recent advances in foundation models for embodied agents. We compare these models and explore their design space to guide future developments, focusing on Lower-Level Environment Encoding and Interaction and Longer-Horizon Decision Making.

Chat is not available.