Chen Hongbo, Alfred Myrtede, Brown Andrew D, Atinga Angela, Cohen Eldan
Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON, Canada.
St Michael's Hospital, Toronto, ON, Canada.
JMIR Form Res. 2024 Dec 5;8:e59045. doi: 10.2196/59045.
While deep learning classifiers have shown remarkable results in detecting chest X-ray (CXR) pathologies, their adoption in clinical settings is often hampered by the lack of transparency. To bridge this gap, this study introduces the neural prototype tree (NPT), an interpretable image classifier that combines the diagnostic capability of deep learning models and the interpretability of the decision tree for CXR pathology detection.
This study aimed to investigate the utility of the NPT classifier in 3 dimensions, including performance, interpretability, and fairness, and subsequently examined the complex interaction between these dimensions. We highlight both local and global explanations of the NPT classifier and discuss its potential utility in clinical settings.
This study used CXRs from the publicly available Chest X-ray 14, CheXpert, and MIMIC-CXR datasets. We trained 6 separate classifiers for each CXR pathology in all datasets, 1 baseline residual neural network (ResNet)-152, and 5 NPT classifiers with varying levels of interpretability. Performance, interpretability, and fairness were measured using the area under the receiver operating characteristic curve (ROC AUC), interpretation complexity (IC), and mean true positive rate (TPR) disparity, respectively. Linear regression analyses were performed to investigate the relationship between IC and ROC AUC, as well as between IC and mean TPR disparity.
The performance of the NPT classifier improved as the IC level increased, surpassing that of ResNet-152 at IC level 15 for the Chest X-ray 14 dataset and IC level 31 for the CheXpert and MIMIC-CXR datasets. The NPT classifier at IC level 1 exhibited the highest degree of unfairness, as indicated by the mean TPR disparity. The magnitude of unfairness, as measured by the mean TPR disparity, was more pronounced in groups differentiated by age (chest X-ray 14 0.112, SD 0.015; CheXpert 0.097, SD 0.010; MIMIC 0.093, SD 0.017) compared to sex (chest X-ray 14 0.054 SD 0.012; CheXpert 0.062, SD 0.008; MIMIC 0.066, SD 0.013). A significant positive relationship between interpretability (ie, IC level) and performance (ie, ROC AUC) was observed across all CXR pathologies (P<.001). Furthermore, linear regression analysis revealed a significant negative relationship between interpretability and fairness (ie, mean TPR disparity) across age and sex subgroups (P<.001).
By illuminating the intricate relationship between performance, interpretability, and fairness of the NPT classifier, this research offers insightful perspectives that could guide future developments in effective, interpretable, and equitable deep learning classifiers for CXR pathology detection.
虽然深度学习分类器在检测胸部X光(CXR)病变方面已显示出显著成果,但其在临床环境中的应用常常因缺乏透明度而受阻。为了弥补这一差距,本研究引入了神经原型树(NPT),这是一种可解释的图像分类器,它结合了深度学习模型的诊断能力和决策树对CXR病变检测的可解释性。
本研究旨在从性能、可解释性和公平性三个维度研究NPT分类器的效用,并随后考察这些维度之间的复杂相互作用。我们强调了NPT分类器的局部和全局解释,并讨论了其在临床环境中的潜在效用。
本研究使用了来自公开可用的胸部X光14、CheXpert和MIMIC - CXR数据集的CXR图像。我们针对所有数据集中的每种CXR病变训练了6个单独的分类器,1个基线残差神经网络(ResNet)- 152,以及5个具有不同可解释性水平的NPT分类器。分别使用接收器操作特征曲线(ROC AUC)下的面积、解释复杂性(IC)和平均真阳性率(TPR)差异来衡量性能、可解释性和公平性。进行线性回归分析以研究IC与ROC AUC之间以及IC与平均TPR差异之间的关系。
随着IC水平的提高,NPT分类器的性能有所提升,在胸部X光14数据集的IC水平为15时以及CheXpert和MIMIC - CXR数据集的IC水平为31时超过了ResNet - 152的性能。如平均TPR差异所示,IC水平为1时的NPT分类器表现出最高程度的不公平性。按年龄区分的组中,以平均TPR差异衡量的不公平程度比按性别区分的组更明显(胸部X光14 0.112,标准差0.015;CheXpert 0.097,标准差0.010;MIMIC 0.093,标准差0.017)(按性别区分的组:胸部X光14 0.054,标准差0.012;CheXpert 0.062,标准差0.008;MIMIC 0.066,标准差0.013)。在所有CXR病变中均观察到可解释性(即IC水平)与性能(即ROC AUC)之间存在显著的正相关关系(P <.001)。此外,线性回归分析显示,在年龄和性别亚组中,可解释性与公平性(即平均TPR差异)之间存在显著的负相关关系(P <.001)。
通过阐明NPT分类器的性能、可解释性和公平性之间的复杂关系,本研究提供了有见地的观点,可为未来开发用于CXR病变检测的有效、可解释和公平的深度学习分类器提供指导。