Suppr超能文献

测试集构成对X线小儿腕部骨折检测中人工智能性能的影响。

Impact of test set composition on AI performance in pediatric wrist fracture detection in X-rays.

作者信息

Till Tristan, Scherkl Mario, Stranger Nikolaus, Singer Georg, Hankel Saskia, Flucher Christina, Hržić Franko, Štajduhar Ivan, Tschauner Sebastian

机构信息

Division of Pediatric Radiology, Department of Radiology, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.

Department of Pediatric and Adolescent Surgery, Medical University of Graz, Auenbruggerplatz 34, Graz, 8036, Austria.

出版信息

Eur Radiol. 2025 May 16. doi: 10.1007/s00330-025-11669-z.

Abstract

OBJECTIVES

To evaluate how different test set sampling strategies-random selection and balanced sampling-affect the performance of artificial intelligence (AI) models in pediatric wrist fracture detection using radiographs, aiming to highlight the need for standardization in test set design.

MATERIALS AND METHODS

This retrospective study utilized the open-sourced GRAZPEDWRI-DX dataset of 6091 pediatric wrist radiographs. Two test sets, each containing 4588 images, were constructed: one using a balanced approach based on case difficulty, projection type, and fracture presence and the other a random selection. EfficientNet and YOLOv11 models were trained and validated on 18,762 radiographs and tested on both sets. Binary classification and object detection tasks were evaluated using metrics such as precision, recall, F1 score, AP50, and AP50-95. Statistical comparisons between test sets were performed using nonparametric tests.

RESULTS

Performance metrics significantly decreased in the balanced test set with more challenging cases. For example, the precision for YOLOv11 models decreased from 0.95 in the random set to 0.83 in the balanced set. Similar trends were observed for recall, accuracy, and F1 score, indicating that models trained on easy-to-recognize cases performed poorly on more complex ones. These results were consistent across all model variants tested.

CONCLUSION

AI models for pediatric wrist fracture detection exhibit reduced performance when tested on balanced datasets containing more difficult cases, compared to randomly selected cases. This highlights the importance of constructing representative and standardized test sets that account for clinical complexity to ensure robust AI performance in real-world settings.

KEY POINTS

Question Do different sampling strategies based on samples' complexity have an influence in deep learning models' performance in fracture detection? Findings AI performance in pediatric wrist fracture detection significantly drops when tested on balanced datasets with more challenging cases, compared to randomly selected cases. Clinical relevance Without standardized and validated test datasets for AI that reflect clinical complexities, performance metrics may be overestimated, limiting the utility of AI in real-world settings.

摘要

目的

评估不同的测试集抽样策略——随机选择和平衡抽样——如何影响使用X光片进行小儿手腕骨折检测的人工智能(AI)模型的性能,旨在强调测试集设计标准化的必要性。

材料与方法

这项回顾性研究使用了包含6091张小儿童手腕X光片的开源GRAZPEDWRI-DX数据集。构建了两个测试集,每个测试集包含4588张图像:一个基于病例难度、投影类型和骨折情况采用平衡方法构建,另一个采用随机选择。EfficientNet和YOLOv11模型在18762张X光片上进行训练和验证,并在两个测试集上进行测试。使用精度、召回率、F1分数、AP50和AP50-95等指标评估二分类和目标检测任务。使用非参数检验对测试集之间进行统计比较。

结果

在包含更具挑战性病例的平衡测试集中,性能指标显著下降。例如,YOLOv11模型的精度从随机测试集中的0.95降至平衡测试集中的0.83。在召回率、准确率和F1分数方面也观察到类似趋势,表明在易于识别的病例上训练的模型在更复杂的病例上表现不佳。这些结果在所有测试的模型变体中都是一致的。

结论

与随机选择的病例相比,在包含更困难病例的平衡数据集上进行测试时,用于小儿手腕骨折检测的AI模型性能会降低。这凸显了构建考虑临床复杂性的代表性和标准化测试集的重要性,以确保AI在实际应用中的稳健性能。

关键点

问题基于样本复杂性的不同抽样策略是否会影响深度学习模型在骨折检测中的性能?研究结果与随机选择的病例相比,在包含更具挑战性病例的平衡数据集上进行测试时,小儿手腕骨折检测中的AI性能显著下降。临床意义如果没有反映临床复杂性的标准化和经过验证的AI测试数据集,性能指标可能会被高估,从而限制AI在实际应用中的效用。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验