Institute of Medical and Biological Engineering, Medical Research Center, Seoul National University, Seoul, Korea.
Mathematical Institute, University of Oxford, United Kingdom.
Korean J Radiol. 2023 Feb;24(2):155-165. doi: 10.3348/kjr.2022.0548.
Little is known about the effects of using different expert-determined reference standards when evaluating the performance of deep learning-based automatic detection (DLAD) models and their added value to radiologists. We assessed the concordance of expert-determined standards with a clinical gold standard (herein, pathological confirmation) and the effects of different expert-determined reference standards on the estimates of radiologists' diagnostic performance to detect malignant pulmonary nodules on chest radiographs with and without the assistance of a DLAD model.
This study included chest radiographs from 50 patients with pathologically proven lung cancer and 50 controls. Five expert-determined standards were constructed using the interpretations of 10 experts: individual judgment by the most experienced expert, majority vote, consensus judgments of two and three experts, and a latent class analysis (LCA) model. In separate reader tests, additional 10 radiologists independently interpreted the radiographs and then assisted with the DLAD model. Their diagnostic performance was estimated using the clinical gold standard and various expert-determined standards as the reference standard, and the results were compared using the test with Bonferroni correction.
The LCA model (sensitivity, 72.6%; specificity, 100%) was most similar to the clinical gold standard. When expert-determined standards were used, the sensitivities of radiologists and DLAD model alone were overestimated, and their specificities were underestimated (all -values < 0.05). DLAD assistance diminished the overestimation of sensitivity but exaggerated the underestimation of specificity (all -values < 0.001). The DLAD model improved sensitivity and specificity to a greater extent when using the clinical gold standard than when using the expert-determined standards (all -values < 0.001), except for sensitivity with the LCA model ( = 0.094).
The LCA model was most similar to the clinical gold standard for malignant pulmonary nodule detection on chest radiographs. Expert-determined standards caused bias in measuring the diagnostic performance of the artificial intelligence model.
当评估基于深度学习的自动检测(DLAD)模型的性能及其对放射科医生的附加值时,使用不同的专家确定的参考标准的效果知之甚少。我们评估了专家确定的标准与临床金标准(在此为病理证实)的一致性,以及不同的专家确定的参考标准对放射科医生在有无 DLAD 模型辅助下检测胸部 X 线片中恶性肺结节的诊断性能估计的影响。
本研究纳入了 50 例经病理证实的肺癌患者和 50 例对照患者的胸部 X 线片。使用 10 位专家的解释构建了 5 种专家确定的标准:最有经验的专家的个体判断、多数票、两位和三位专家的共识判断以及潜在类别分析(LCA)模型。在单独的读者测试中,另外 10 位放射科医生独立解读 X 线片,然后使用 DLAD 模型辅助。使用临床金标准和各种专家确定的标准作为参考标准来估计他们的诊断性能,并使用校正后的卡方检验比较结果。
LCA 模型(敏感性 72.6%,特异性 100%)与临床金标准最为相似。当使用专家确定的标准时,放射科医生和 DLAD 模型的敏感性被高估,特异性被低估(所有 P 值均 <0.05)。DLAD 辅助减少了敏感性的高估,但夸大了特异性的低估(所有 P 值均 <0.001)。当使用临床金标准时,DLAD 模型对敏感性和特异性的改善程度大于使用专家确定的标准(所有 P 值均 <0.001),除了使用 LCA 模型时的敏感性( P =0.094)。
在检测胸部 X 线片中的恶性肺结节方面,LCA 模型与临床金标准最为相似。专家确定的标准会导致对人工智能模型诊断性能的测量产生偏差。