Skip to yearly menu bar Skip to main content


Poster

Chimera: Improving Generalist Model with Domain-Specific Experts

Tianshuo Peng · Mingsheng Li · Jiakang Yuan · Hongbin Zhou · Renqiu Xia · Renrui Zhang · LEI BAI · Song Mao · Bin Wang · Aojun Zhou · Botian Shi · Tao Chen · Bo Zhang · Xiangyu Yue


Abstract:

Large Multi-modal Models (LMMs), trained on web-scale datasets predominantly composed of natural images, have demonstrated remarkable performance on general tasks. However, these models often exhibit limited specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. An intuitive solution is to post-train LMMs on a specific domain, but often suffers from the labor-intensive annotating process and the inaccessibility of private training data. Directly integrating expert models tailored for those tasks is also challenging due to representational gaps and imbalanced optimization. To address these challenges, we introduce \textbf{Chimera}, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs. We will release model weights, along with the data used for training and evaluation, to facilitate future research on LMMs.

Live content is unavailable. Log in and register to view live content