From the University of Maryland Medical Intelligent Imaging (UM2ii) Center, Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, 670 W Baltimore St, First Floor, Room 1172, Baltimore, MD 21201.
Radiology. 2023 Feb;306(2):e220505. doi: 10.1148/radiol.220505. Epub 2022 Sep 27.
Background Although deep learning (DL) models have demonstrated expert-level ability for pediatric bone age prediction, they have shown poor generalizability and bias in other use cases. Purpose To quantify generalizability and bias in a bone age DL model measured by performance on external versus internal test sets and performance differences between different demographic groups, respectively. Materials and Methods The winning DL model of the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated and trained on 12 611 pediatric hand radiographs from two U.S. hospitals. The DL model was tested from September 2021 to December 2021 on an internal validation set and an external test set of pediatric hand radiographs with diverse demographic representation. Images reporting ground-truth bone age were included for study. Mean absolute difference (MAD) between ground-truth bone age and the model prediction bone age was calculated for each set. Generalizability was evaluated by comparing MAD between internal and external evaluation sets with use of tests. Bias was evaluated by comparing MAD and clinically significant error rate (rate of errors changing the clinical diagnosis) between demographic groups with use of tests or analysis of variance and χ tests, respectively (statistically significant difference defined as < .05). Results The internal validation set had images from 1425 individuals (773 boys), and the external test set had images from 1202 individuals (mean age, 133 months ± 60 [SD]; 614 boys). The bone age model generalized well to the external test set, with no difference in MAD (6.8 months in the validation set vs 6.9 months in the external set; = .64). Model predictions would have led to clinically significant errors in 194 of 1202 images (16%) in the external test set. The MAD was greater for girls than boys in the internal validation set ( = .01) and in the subcategories of age and Tanner stage in the external test set ( < .001 for both). Conclusion A deep learning (DL) bone age model generalized well to an external test set, although clinically significant sex-, age-, and sexual maturity-based biases in DL bone age were identified. © RSNA, 2022 See also the editorial by Larson in this issue.
背景 深度学习 (DL) 模型在儿科骨龄预测方面表现出了专家级的能力,但在其他应用场景中表现出较差的泛化能力和偏差。目的 分别通过外部和内部测试集的性能以及不同人群的性能差异,来量化 DL 模型的泛化能力和偏差。材料与方法 回顾性评估了 2017 年 RSNA 儿科骨龄挑战赛的获胜 DL 模型,并在来自美国两家医院的 12011 例儿科手部 X 光片上进行了训练。该 DL 模型于 2021 年 9 月至 2021 年 12 月在内部验证集和外部测试集上进行了测试,这些测试集包含了具有不同人口统计学特征的儿科手部 X 光片。纳入了报告实际骨龄的图像。对于每个数据集,计算实际骨龄与模型预测骨龄之间的平均绝对差值 (MAD)。通过使用 检验比较内部和外部评估集之间的 MAD 来评估泛化能力。通过使用 检验或方差分析和 χ 检验分别比较 MAD 和临床上显著的错误率(改变临床诊断的错误率)来评估偏差(统计学显著差异定义为 <.05)。结果 内部验证集的图像来自 1425 个人(773 名男孩),外部测试集的图像来自 1202 个人(平均年龄为 133 个月±60 [SD];614 名男孩)。骨龄模型很好地泛化到外部测试集,MAD 没有差异(验证集中为 6.8 个月,外部集中为 6.9 个月; =.64)。在外部测试集中,模型预测将导致 1202 张图像中的 194 张(16%)出现临床上显著的错误。在内部验证集和外部测试集的子类别(年龄和性成熟阶段)中,女孩的 MAD 大于男孩( =.01 和 <.001 )。结论 尽管在 DL 骨龄中发现了与性别、年龄和性成熟相关的临床上显著的偏差,但深度学习(DL)骨龄模型很好地泛化到了外部测试集。
Radiology. 2018-11-27
Am J Sports Med. 2025-9
Health Care Anal. 2025-7-21
Diagnostics (Basel). 2025-1-23