ICCV Poster Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Poster

Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

Weitai Kang · Haifeng Huang · Yuzhang Shang · Mubarak Shah · Yan Yan

[ Abstract ]

Abstract:

Recent advancements in 3D Large Language Models (3DLLMs) show their potential to build general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D further integrates an improved vision projector and enhanced sequence organization. Notably, we achieve a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap).

Live content is unavailable. Log in and register to view live content