Department of Diagnostic Imaging, National University Hospital, Singapore.
Saw Swee Hock School of Public Health, Institute of Data Science, Yong Loo Lin School of Medicine, National University Health System, National University of Singapore, Singapore.
Acad Radiol. 2022 Sep;29(9):1350-1358. doi: 10.1016/j.acra.2021.09.013. Epub 2021 Oct 12.
RATIONALE AND OBJECTIVES: To compare the performance of pneumothorax deep learning detection models trained with radiologist versus natural language processing (NLP) labels on the NIH ChestX-ray14 dataset. MATERIALS AND METHODS: The ChestX-ray14 dataset consisted of 112,120 frontal chest radiographs with 5302 positive and 106, 818 negative labels for pneumothorax using NLP (dataset A). All 112,120 radiographs were also inspected by 4 radiologists leaving a visually confirmed set of 5,138 positive and 104,751 negative for pneumothorax (dataset B). Datasets A and B were used independently to train 3 convolutional neural network (CNN) architectures (ResNet-50, DenseNet-121 and EfficientNetB3). All models' area under the receiver operating characteristic curve (AUC) were evaluated with the official NIH test set and an external test set of 525 chest radiographs from our emergency department. RESULTS: There were significantly higher AUCs on the NIH internal test set for CNN models trained with radiologist vs NLP labels across all architectures. AUCs for the NLP/radiologist-label models were 0.838 (95%CI:0.830, 0.846)/0.881 (95%CI:0.873,0.887) for ResNet-50 (p = 0.034), 0.839 (95%CI:0.831,0.847)/0.880 (95%CI:0.873,0.887) for DenseNet-121, and 0.869 (95%CI: 0.863,0.876)/0.943 (95%CI: 0.939,0.946) for EfficientNetB3 (p ≤0.001). Evaluation with the external test set also showed higher AUCs (p <0.001) for the CNN models trained with radiologist versus NLP labels across all architectures. The AUCs for the NLP/radiologist-label models were 0.686 (95%CI:0.632,0.740)/0.806 (95%CI:0.758,0.854) for ResNet-50, 0.736 (95%CI:0.686, 0.787)/0.871 (95%CI:0.830,0.912) for DenseNet-121, and 0.822 (95%CI: 0.775,0.868)/0.915 (95%CI: 0.882,0.948) for EfficientNetB3. CONCLUSION: We demonstrated improved performance and generalizability of pneumothorax detection deep learning models trained with radiologist labels compared to models trained with NLP labels.
背景与目的:比较使用放射科医生标签与自然语言处理(NLP)标签训练的气胸深度学习检测模型在 NIH ChestX-ray14 数据集上的性能。 材料与方法:ChestX-ray14 数据集包含 112120 张正面胸部 X 光片,使用 NLP 对 5302 张阳性和 106818 张阴性气胸进行标签(数据集 A)。所有 112120 张 X 光片均由 4 名放射科医生进行检查,留下一组 5138 张阳性和 104751 张阴性气胸的视觉确认标签(数据集 B)。数据集 A 和 B 分别用于训练 3 个卷积神经网络(CNN)架构(ResNet-50、DenseNet-121 和 EfficientNetB3)。使用官方 NIH 测试集和来自我们急诊室的 525 张胸部 X 光片的外部测试集评估所有模型的接收器工作特征曲线(ROC)下面积(AUC)。 结果:在所有架构中,使用放射科医生标签而非 NLP 标签训练的 CNN 模型在 NIH 内部测试集上的 AUC 显著更高。ResNet-50 上的 AUC 为 0.838(95%CI:0.830,0.846)/0.881(95%CI:0.873,0.887),DenseNet-121 上的 AUC 为 0.839(95%CI:0.831,0.847)/0.880(95%CI:0.873,0.887),EfficientNetB3 上的 AUC 为 0.869(95%CI:0.863,0.876)/0.943(95%CI:0.939,0.946)(p ≤0.001)。使用外部测试集评估也显示,在所有架构中,使用放射科医生标签而非 NLP 标签训练的 CNN 模型的 AUC 更高(p<0.001)。ResNet-50 上的 AUC 为 0.686(95%CI:0.632,0.740)/0.806(95%CI:0.758,0.854),DenseNet-121 上的 AUC 为 0.736(95%CI:0.686,0.787)/0.871(95%CI:0.830,0.912),EfficientNetB3 上的 AUC 为 0.822(95%CI:0.775,0.868)/0.915(95%CI:0.882,0.948)。 结论:与使用 NLP 标签训练的模型相比,我们证明了使用放射科医生标签训练的气胸检测深度学习模型在性能和泛化能力方面有所提高。
Front Bioeng Biotechnol. 2025-8-12
Eur Respir Rev. 2023-6-30