Towards Safe Multi-Modal Learning: Unique Challenges and Future Directions
Abstract
Modern multi-modal learning leverages large models, such as large language models (LLMs), to integrate diverse data sources (e.g., text, images, audio, and video) and enhance understanding and decision-making. However, the inherent complexities of multi-modal learning introduce unique safety challenges that existing frameworks, primarily designed for uni-modal models, fail to address. This tutorial explores the emerging safety risks in multi-modal learning and provides insights into future research directions. We begin by examining the unique characteristics of multi-modal learning -- modality integration, alignment, and fusion. We then review existing safety studies across adversarial attacks, data poisoning, jailbreak exploits, and hallucinations. Next, we analyze emerging safety threats exploiting multi-modal challenges, including risks from additional modalities, modality misalignment, and fused representations. Finally, we discuss potential directions for enhancing the safety of multi-modal learning. As multi-modal learning expands, addressing its safety risks is crucial. This tutorial lays the foundation for understanding these challenges and fostering discussions on trustworthy systems.