Metsch Jacqueline Michelle, Hauschild Anne-Christin
Institute for Medical Informatics, University Medical Center Göttingen, Germany.
Institute for Medical Informatics, University Medical Center Göttingen, Germany; Institute for Predictive Deep Learning in Medicine and Healthcare, Justus-Liebig University, Gießen, Germany.
Comput Biol Med. 2025 Jun;191:110124. doi: 10.1016/j.compbiomed.2025.110124. Epub 2025 Apr 15.
The increasing digitalization of multi-modal data in medicine and novel artificial intelligence (AI) algorithms opens up a large number of opportunities for predictive models. In particular, deep learning models show great performance in the medical field. A major limitation of such powerful but complex models originates from their 'black-box' nature. Recently, a variety of explainable AI (XAI) methods have been introduced to address this lack of transparency and trust in medical AI. However, the majority of such methods have solely been evaluated on single data modalities. Meanwhile, with the increasing number of XAI methods, integrative XAI frameworks and benchmarks are essential to compare their performance on different tasks. For that reason, we developed BenchXAI, a novel XAI benchmarking package supporting comprehensive evaluation of fifteen XAI methods, investigating their robustness, suitability, and limitations in biomedical data. We employed BenchXAI to validate these methods in three common biomedical tasks, namely clinical data, medical image and signal data, and biomolecular data. Our newly designed sample-wise normalization approach for post-hoc XAI methods enables the statistical evaluation and visualization of performance and robustness. We found that the XAI methods Integrated Gradients, DeepLift, DeepLiftShap, and GradientShap performed well over all three tasks, while methods like Deconvolution, Guided Backpropagation, and LRP-α1-β0 struggled for some tasks. With acts such as the EU AI Act the application of XAI in the biomedical domain becomes more and more essential. Our evaluation study represents a first step towards verifying the suitability of different XAI methods for various medical domains.
医学中多模态数据日益数字化以及新型人工智能(AI)算法为预测模型带来了大量机会。特别是深度学习模型在医学领域表现出卓越性能。这类强大但复杂的模型的一个主要局限源于其“黑箱”性质。最近,已引入多种可解释人工智能(XAI)方法来解决医学人工智能中缺乏透明度和可信度的问题。然而,此类方法大多仅在单一数据模态上进行了评估。与此同时,随着XAI方法数量的增加,集成式XAI框架和基准对于比较它们在不同任务上的性能至关重要。因此,我们开发了BenchXAI,这是一个新颖的XAI基准测试包,支持对15种XAI方法进行全面评估,研究它们在生物医学数据中的稳健性、适用性和局限性。我们使用BenchXAI在三个常见的生物医学任务中验证这些方法,即临床数据、医学图像和信号数据以及生物分子数据。我们为事后XAI方法新设计的样本归一化方法能够对性能和稳健性进行统计评估和可视化。我们发现,XAI方法集成梯度、深度提升、深度提升Shap和梯度Shap在所有三个任务上均表现良好,而诸如反卷积、引导反向传播和局部相关传播-α1-β0等方法在某些任务上表现不佳。随着欧盟人工智能法案等举措的出台,XAI在生物医学领域的应用变得越来越重要。我们的评估研究是朝着验证不同XAI方法对各种医学领域适用性迈出的第一步。