ICCV Poster WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Poster

WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image

Yuci Liang · Xinheng Lyu · Meidan Ding · Wenting Chen · Xiaohan Xing · Jipeng Zhang · Sen Yang · Xiangjian He · Song Wu · Xiyue Wang · Linlin Shen

[ Abstract ]

Abstract:

Recent advances in computational pathology have introduced whole slide image (WSI)-level multimodal large language models (MLLMs) for automated pathological analysis. However, current WSI-level MLLMs face two critical challenges: limited explainability in their decision-making process and insufficient attention to morphological features crucial for accurate diagnosis. To address these challenges, we first introduce \textbf{WSI-Bench}, a large-scale morphology-aware benchmark containing 180k VQA pairs from 9,850 WSIs across 30 cancer types, specifically designed to evaluate MLLMs' understanding of morphological characteristics crucial for accurate diagnosis. To the best of our knowledge, WSI-Bench presents the first benchmarking systematically evaluate morphological understanding capabilities in WSI analysis. To enhance the model explainability, we present \textbf{WSI-LLaVA}, an MLLM framework for gigapixel WSI understanding with a three-stage training strategy, which can provide detailed morphological findings to explain its final answer. For more precise model assessment in pathological contexts, we develop two specialized WSI metrics: \textbf{WSI-Precision} and \textbf{WSI-Relevance}. Extensive evaluation on WSI-Bench reveals both the capabilities and limitations of current WSI MLLMs in morphological analysis and various pathology tasks, while demonstrating WSI-LLaVA's superior performance across all capabilities.

Live content is unavailable. Log in and register to view live content