From the Drexel University College of Medicine, Philadelphia, Pa (S.M.S.); University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, 1st Fl, Room 1172, Baltimore, MD 21201 (S.M.S., K.P., E.B., V.S.P., P.H.Y.); and Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, Md (P.H.Y.).
Radiol Artif Intell. 2024 May;6(3):e230240. doi: 10.1148/ryai.230240.
Purpose To evaluate the robustness of an award-winning bone age deep learning (DL) model to extensive variations in image appearance. Materials and Methods In December 2021, the DL bone age model that won the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated using the RSNA validation set (1425 pediatric hand radiographs; internal test set in this study) and the Digital Hand Atlas (DHA) (1202 pediatric hand radiographs; external test set). Each test image underwent seven types of transformations (rotations, flips, brightness, contrast, inversion, laterality marker, and resolution) to represent a range of image appearances, many of which simulate real-world variations. Computational "stress tests" were performed by comparing the model's predictions on baseline and transformed images. Mean absolute differences (MADs) of predicted bone ages compared with radiologist-determined ground truth on baseline versus transformed images were compared using Wilcoxon signed rank tests. The proportion of clinically significant errors (CSEs) was compared using McNemar tests. Results There was no evidence of a difference in MAD of the model on the two baseline test sets (RSNA = 6.8 months, DHA = 6.9 months; = .05), indicating good model generalization to external data. Except for the RSNA dataset images with an appended radiologic laterality marker ( = .86), there were significant differences in MAD for both the DHA and RSNA datasets among other transformation groups (rotations, flips, brightness, contrast, inversion, and resolution). There were significant differences in proportion of CSEs for 57% of the image transformations (19 of 33) performed on the DHA dataset. Conclusion Although an award-winning pediatric bone age DL model generalized well to curated external images, it had inconsistent predictions on images that had undergone simple transformations reflective of several real-world variations in image appearance. Pediatrics, Hand, Convolutional Neural Network, Radiography © RSNA, 2024 See also commentary by Faghani and Erickson in this issue.
目的 评估一个屡获殊荣的骨龄深度学习(DL)模型对图像外观广泛变化的稳健性。
材料与方法 2021 年 12 月,回顾性评估了在 RSNA 验证集(1425 例儿科手部 X 线片;本研究中的内部测试集)和 Digital Hand Atlas(DHA)(1202 例儿科手部 X 线片;外部测试集)上获得 2017 年 RSNA 儿科骨龄挑战赛奖项的 DL 骨龄模型。每个测试图像都经历了七种类型的变换(旋转、翻转、亮度、对比度、反转、侧标和分辨率),以代表一系列图像外观,其中许多模拟了真实世界的变化。通过比较模型在基线和变换图像上的预测,进行计算“压力测试”。比较基线和变换图像上模型预测骨龄的平均绝对差异(MAD),使用 Wilcoxon 符号秩检验。使用 McNemar 检验比较临床上显著错误(CSE)的比例。
结果 两种基线测试集(RSNA = 6.8 个月,DHA = 6.9 个月; =.05)上模型 MAD 无差异,表明模型对外部数据具有良好的泛化能力。除了在 RSNA 数据集图像上附加放射侧标( =.86)外,DHA 和 RSNA 数据集的其他变换组(旋转、翻转、亮度、对比度、反转和分辨率)之间的 MAD 存在显著差异。在 DHA 数据集上进行的 57%(33 项中的 19 项)图像变换中,有 19 项存在 CSE 比例差异。
结论 尽管一个屡获殊荣的儿科骨龄 DL 模型对经过精心挑选的外部图像具有很好的泛化能力,但它对反映图像外观的几种真实世界变化的简单变换后的图像的预测结果不一致。