UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence
Abstract
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data, such as structured geospatial data, trajectory data, satellite image data, and street view image data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce UrbanLLaVA, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In UrbanLLaVA, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we design an effective multi-stage training pipeline to ensure the training stability and compatibility across various urban tasks. We also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that UrbanLLaVA outperforms open source and commercial MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. UrbanLLaVA sheds lights for building the unified foundation model with powerful perception and reasoning abilities for general urban intelligence.