Domínguez-Rodríguez Sara, Liz-López Helena, Panizo-LLedot Angel, Ballesteros Álvaro, Dagan Ron, Greenberg David, Gutiérrez Lourdes, Rojo Pablo, Otheo Enrique, Galán Juan Carlos, Villanueva Sara, García Sonsoles, Mosquera Pablo, Tagarro Alfredo, Moraleda Cinta, Camacho David
Pediatric Research and Clinical Trials Unit (UPIC). Instituto de Investigación Sanitaria Hospital 12 de Octubre (imas12), Fundación para la Investigación Biomédica del Hospital 12 de Octubre, Madrid, Spain.
Computer Systems Engineering Department, Universidad Politécnica de Madrid, Spain.
Comput Methods Programs Biomed. 2023 Dec;242:107765. doi: 10.1016/j.cmpb.2023.107765. Epub 2023 Sep 9.
Community-acquired Pneumonia (CAP) is a common childhood infectious disease. Deep learning models show promise in X-ray interpretation and diagnosis, but their validation should be extended due to limitations in the current validation workflow. To extend the standard validation workflow we propose doing a pilot test with the next characteristics. First, the assumption of perfect ground truth (100% sensitive and specific) is unrealistic, as high intra and inter-observer variability have been reported. To address this, we propose using Bayesian latent class models (BLCA) to estimate accuracy during the pilot. Additionally, assessing only the performance of a model without considering its applicability and acceptance by physicians is insufficient if we hope to integrate AI systems into day-to-day clinical practice. Therefore, we propose employing explainable artificial intelligence (XAI) methods during the pilot test to involve physicians and evaluate how well a Deep Learning model is accepted and how helpful it is for routine decisions as well as analyze its limitations by assessing the etiology. This study aims to apply the proposed pilot to test a deep Convolutional Neural Network (CNN)-based model for identifying consolidation in pediatric chest-X-ray (CXR) images already validated using the standard workflow.
For the standard validation workflow, a total of 5856 public CXRs and 950 private CXRs were used to train and validate the performance of the CNN model. The performance of the model was estimated assuming a perfect ground truth. For the pilot test proposed in this article, a total of 190 pediatric chest-X-ray (CXRs) images were used to test the CNN model support decision tool (SDT). The performance of the model on the pilot test was estimated using extensions of the two-test Bayesian Latent-Class model (BLCA). The sensitivity, specificity, and accuracy of the model were also assessed. The clinical characteristics of the patients were compared according to the model performance. The adequacy and applicability of the SDT was tested using XAI techniques. The adequacy of the SDT was assessed by asking two senior physicians the agreement rate with the SDT. The applicability was tested by asking three medical residents before and after using the SDT and the agreement between experts was calculated using the kappa index.
The CRXs of the pilot test were labeled by the panel of experts into consolidation (124/176, 70.4%) and no-consolidation/other infiltrates (52/176, 29.5%). A total of 31/176 (17.6%) discrepancies were found between the model and the panel of experts with a kappa index of 0.6. The sensitivity and specificity reached a median of 90.9 (95% Credible Interval (CrI), 81.2-99.9) and 77.7 (95% CrI, 63.3-98.1), respectively. The senior physicians reported a high agreement rate (70%) with the system in identifying logical consolidation patterns. The three medical residents reached a higher agreement using SDT than alone with experts (0.66±0.1 vs. 0.75±0.2).
Through the pilot test, we have successfully verified that the deep learning model was underestimated when a perfect ground truth was considered. Furthermore, by conducting adequacy and applicability tests, we can ensure that the model is able to identify logical patterns within the CXRs and that augmenting clinicians with automated preliminary read assistants could accelerate their workflows and enhance accuracy in identifying consolidation in pediatric CXR images.
社区获得性肺炎(CAP)是一种常见的儿童传染病。深度学习模型在X线解释和诊断方面显示出前景,但由于当前验证流程的局限性,其验证工作应予以扩展。为扩展标准验证流程,我们建议进行具有以下特征的试点测试。首先,假设存在完美的金标准(100%敏感且特异)是不现实的,因为已报道存在较高的观察者内和观察者间变异性。为解决这一问题,我们建议在试点期间使用贝叶斯潜在类别模型(BLCA)来估计准确性。此外,如果我们希望将人工智能系统整合到日常临床实践中,仅评估模型的性能而不考虑其适用性和医生的接受程度是不够的。因此,我们建议在试点测试期间采用可解释人工智能(XAI)方法,让医生参与其中,评估深度学习模型的接受程度以及对常规决策的帮助程度,并通过评估病因来分析其局限性。本研究旨在应用所提议的试点来测试基于深度卷积神经网络(CNN)的模型,以识别已使用标准流程验证的儿科胸部X线(CXR)图像中的实变。
对于标准验证流程,共使用5856张公共CXR图像和950张私人CXR图像来训练和验证CNN模型的性能。在假设存在完美金标准的情况下估计模型的性能。对于本文提议的试点测试,共使用190张儿科胸部X线(CXR)图像来测试CNN模型支持决策工具(SDT)。使用双测试贝叶斯潜在类别模型(BLCA)的扩展来估计模型在试点测试中的性能。还评估了模型的敏感性、特异性和准确性。根据模型性能比较患者的临床特征。使用XAI技术测试SDT的充分性和适用性。通过询问两位资深医生与SDT的一致率来评估SDT的充分性。通过询问三名住院医师在使用SDT前后的情况来测试适用性,并使用kappa指数计算专家之间的一致性。
试点测试的CXR图像由专家小组标记为实变(124/176,70.4%)和无实变/其他浸润(52/176,29.5%)。在模型与专家小组之间共发现31/176(17.6%)的差异,kappa指数为0.6。敏感性和特异性的中位数分别达到90.9(95%可信区间(CrI),81.2 - 99.9)和77.7(95% CrI,63.3 - 98.1)。资深医生报告在识别逻辑实变模式方面与该系统的一致率较高(70%)。三名住院医师使用SDT后的一致性高于单独与专家判断时(0.66±0.1对0.75±0.2)。
通过试点测试,我们成功验证了在考虑完美金标准时深度学习模型被低估。此外,通过进行充分性和适用性测试,我们可以确保模型能够识别CXR图像中的逻辑模式,并且使用自动化初步阅读助手辅助临床医生可以加快他们的工作流程并提高识别儿科CXR图像中实变的准确性。