ICCV Poster RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications

Poster

RMultiplex200K: Toward Reliable Multimodal Process Supervision for Visual Language Models on Telecommunications

Sijia Chen · Bin Song

Exhibit Hall I #150

[ Abstract ]

Tue 21 Oct 2:45 p.m. PDT — 4:45 p.m. PDT

Abstract:

Visual Language Models (VLMs) have achieved remarkable success in many domains due to their ability to perform step-by-step reasoning. However, progress in the telecommunication (Telecom) domain remains limited, primarily due to the lack of high-quality datasets and domain-specific insights. In this paper, we introduce RMultiplex200K, a multimodal dataset designed to present step-wise reasoning rationales and correctness scores for real-world TC questions. This enables VLMs to engage in step-level reasoning and verification using multimodal information, thereby facilitating reliable problem-solving. RMultiplex200K is highly scalable as it is constructed without human annotations, relying instead on our automatic plan-based annotation (ApPA) method, which automatically synthesizes reasoning steps labeled with reward scores. With this dataset, we introduce TC-NAVIGATOR, a new mechanism for training multimodal process reward models to serve as reliable reasoning verifiers for VLMs. For instance, the Qwen-2-VL-72B and Llama-3.2-90B models, which initially achieve only 21.3\% and 19.8\% respectively on practice Telecom questions, reached 48.5\% and 46.1\% accuracy, respectively, after training with RMultiplex200K and verifying with TC-NAVIGATOR.

Live content is unavailable. Log in and register to view live content