Multi-Modal Reasoning for Agentic Intelligence
Zhenfei Yin, Naji Khosravan, Tao Ji, Yin Wang, Roozbeh Mottagi, Iro Armeni, Zhuqiang Lu, Annie S. Chen, Yufang Liu, Zixian Ma, Mahtab Bigverdi, Amita Kamath, Chen Feng, Lei Bai, Gordon Wetzstein, Philip Torr
Abstract
AI agents powered by Large Language Models (LLMs) have shown strong reasoning abilities across tasks like coding and research. With the rise of Multimodal Foundation Models (MFMs), agents can now integrate visual, textual, and auditory inputs for richer perception and decision-making. This workshop explores the development of Multimodal AI Agents across four categories: Digital, Virtual, Wearable, and Physical. We will discuss their applications in science, robotics, and human-computer interaction, as well as key challenges in cross-modal integration, real-time responsiveness, and interpretability. The goal is to advance robust, context-aware agents for complex, real-world environments.
Video
Chat is not available.
Successful Page Load