ICCV Poster UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence

Poster

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence

Jie Feng · Shengyuan Wang · Tianhui Liu · Yanxin Xi · Yong Li

[ Abstract ] [ Project Page ]

Abstract:

Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data, such as structured geospatial data, trajectory data, satellite image data, and street view image data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In UrbanLLaVA, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we design an effective multi-stage training pipeline to ensure the training stability and compatibility across various urban tasks. We also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that UrbanLLaVA outperforms open source and commercial MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. UrbanLLaVA sheds lights for building the unified foundation model with powerful perception and reasoning abilities for general urban intelligence.

Live content is unavailable. Log in and register to view live content