Pan Ian, Thodberg Hans Henrik, Halabi Safwan S, Kalpathy-Cramer Jayashree, Larson David B
Department of Radiology, Warren Alpert Medical School, Brown University, 593 Eddy St, Providence, RI 02903 (I.P.); Department of Diagnostic Imaging, Rhode Island Hospital, Providence, RI (I.P.); Visiana, Hørsholm, Denmark (H.H.T.); Department of Radiology, Stanford University, Palo Alto, Calif (S.S.H., D.B.L.); and Department of Radiology, Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital, Harvard Medical School, Boston, Mass (J.K.C.).
Radiol Artif Intell. 2019 Nov 20;1(6):e190053. doi: 10.1148/ryai.2019190053.
To investigate improvements in performance for automatic bone age estimation that can be gained through model ensembling.
A total of 48 submissions from the 2017 RSNA Pediatric Bone Age Machine Learning Challenge were used. Participants were provided with 12 611 pediatric hand radiographs with bone ages determined by a pediatric radiologist to develop models for bone age determination. The final results were determined using a test set of 200 radiographs labeled with the weighted average of six ratings. The mean pairwise model correlation and performance of all possible model combinations for ensembles of up to 10 models using the mean absolute deviation (MAD) were evaluated. A bootstrap analysis using the 200 test radiographs was conducted to estimate the true generalization MAD.
The estimated generalization MAD of a single model was 4.55 months. The best-performing ensemble consisted of four models with an MAD of 3.79 months. The mean pairwise correlation of models within this ensemble was 0.47. In comparison, the lowest achievable MAD by combining the highest-ranking models based on individual scores was 3.93 months using eight models with a mean pairwise model correlation of 0.67.
Combining less-correlated, high-performing models resulted in better performance than naively combining the top-performing models. Machine learning competitions within radiology should be encouraged to spur development of heterogeneous models whose predictions can be combined to achieve optimal performance.© RSNA, 2019 See also the commentary by Siegel in this issue.
研究通过模型集成提高自动骨龄估计的性能。
使用了2017年RSNA儿科骨龄机器学习挑战赛的48份参赛作品。为参与者提供了12611张儿科手部X光片,其骨龄由儿科放射科医生确定,用于开发骨龄测定模型。最终结果使用一组200张X光片的测试集确定,该测试集标记有六个评分的加权平均值。使用平均绝对偏差(MAD)评估了多达10个模型的所有可能模型组合的平均成对模型相关性和性能。使用200张测试X光片进行了自举分析,以估计真实的泛化MAD。
单个模型的估计泛化MAD为4.55个月。表现最佳的集成由四个模型组成,MAD为3.79个月。该集成中模型的平均成对相关性为0.47。相比之下,基于个体分数组合排名最高的模型可实现的最低MAD为3.93个月,使用八个模型,平均成对模型相关性为0.67。
组合相关性较低的高性能模型比单纯组合表现最佳的模型具有更好的性能。应鼓励放射学领域的机器学习竞赛,以促进异质模型的开发,其预测结果可以组合以实现最佳性能。©RSNA,2019 另见本期Siegel的评论。