Poster
X-Fusion: Introducing New Modality to Frozen Large Language Models
Sicheng Mo · Thao Nguyen · Xun Huang · Siddharth Iyer · Yijun Li · Yuchen Liu · Abhishek Tandon · Eli Shechtman · Krishna Kumar Singh · Yong Jae Lee · Bolei Zhou · Yuheng Li
We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM’s parameters frozen while integrating vision-specific information for both understanding and generation. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.
Live content is unavailable. Log in and register to view live content