Poster
OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding
Tianrun Xu · Guanyu Chen · Ye Li · Xi Yuxin · Zeyu Mu · Ruichen Wang · Tianren Zhang · Haichuan Gao · Feng Chen
Multimodal large models have made significant progress, yet fine-grained understanding of complex scenes remains a challenge. High-quality, large-scale vision-language datasets are essential for addressing this issue.However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model’s own multimodal capabilities to recursively refine its understanding. By decomposing existing multimodal data into localized sub-regions and generating hierarchical scene descriptions and multi-faceted question-answer pairs, we construct a 1.4M-image dataset. We further utilize this dataset to train the base model, significantly enhancing its ability to interpret complex visual scenes and perform various vision-related tasks. Our OURO model, fine-tuned on Qwen2-VL-7B-Instruct using LoRA, achieves substantial improvements over both the base model and similarly-sized counterparts across multiple multimodal benchmarks. The results demonstrate the effectiveness of our method in advancing scene understanding and multimodal reasoning. Our self-bootstrapped training pipeline offers a novel paradigm for the continuous improvement of multimodal models. Code and datasets will be released upon acceptance.
Live content is unavailable. Log in and register to view live content