Skip to yearly menu bar Skip to main content


Tutorial

Towards Comprehensive Reasoning in Vision-Language Models

Yujun Cai · Yiwei Wang · Kai-Wei Chang · Junsong Yuan · Ziwei Liu · Chi Zhang · Jun Liu · Ming-Hsuan Yang

[ ] [ Project Page ]
Sun 19 Oct 11 a.m. PDT — 3 p.m. PDT

Abstract:

Vision-Language Models (VLMs) have achieved remarkable progress in image captioning and visual question answering, yet developing genuine reasoning capabilities remains an open challenge. Unlike recent breakthroughs in reasoning-focused LLMs, many VLMs still rely primarily on pattern recognition and struggle with compositional logic. This tutorial provides a comprehensive overview of reasoning capabilities in VLMs, focusing on the transition from basic perception to complex inference. We will explore reasoning-oriented prompting and training techniques in multimodal contexts, reasoning-focused benchmarks, and architectural innovations for visual-textual fusion. Through lectures and hands-on demonstrations, participants will gain insights into current capabilities, persistent challenges in compositional generalization and explainability, and practical guidance for implementing reasoning mechanisms. This tutorial uniquely bridges advances in LLM reasoning with the visual domain, addressing the distinct challenges of spatial information processing and providing a roadmap toward more cognitively capable vision-language systems.

Live content is unavailable. Log in and register to view live content