DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Abstract
As large models begin to gain momentum, vision-language foundation models are enabling robots to generalizably perform more and more tasks. However, due to the difficulty in data collection, the benefits are limited with simple embodiments. In this paper, we present \textbf{DexVLG}, a vision-language model that predicts language instruction-aligned dexterous grasp poses given single-view RGBD perception. To achieve this, we first synthesize a dataset of 170M dexterous grasp poses aligned with semantic parts on 174k objects in simulation, paired with informative part-level captions. With this large-scale dataset named \textbf{DexGraspNet 3.0}, we train a flow-matching VLM to generate instruction-aligned grasp poses on tabletop objects. To evaluate DexVLG, we curate benchmarks in physics-based simulation and perform real-world experiments. Our extensive experiments demonstrate DexVLG's great zero-shot generalizability, achieving over 76\% zero-shot execution success rate and state-of-the art part grasp accuracy in simulation, and demonstrate successful part-aligned grasps on real-world objects.