D. W. G. Langerhuizen, R. L. Jaarsma, J. N. Doornberg, Flinders University, Department of Orthopaedic and Trauma Surgery, Flinders Medical Centre, Adelaide, Australia S. J. Janssen, Department of Orthopaedic Surgery, Amphia Hospital, Breda, the Netherlands W. H. Mallee, G. M. M. J. Kerkhoffs, Department of Orthopaedic Surgery, Amsterdam Movement Sciences, Amsterdam University Medical Centre, Amsterdam, the Netherlands M. P. J. van den Bekerom, Department of Orthopaedic Surgery, Onze Lieve Vrouwe Gasthuis, Amsterdam, the Netherlands D. Ring, Department of Surgery and Perioperative Care, Dell Medical School, the University of Texas at Austin, Austin, TX, USA.
Clin Orthop Relat Res. 2019 Nov;477(11):2482-2491. doi: 10.1097/CORR.0000000000000848.
BACKGROUND: Artificial-intelligence algorithms derive rules and patterns from large amounts of data to calculate the probabilities of various outcomes using new sets of similar data. In medicine, artificial intelligence (AI) has been applied primarily to image-recognition diagnostic tasks and evaluating the probabilities of particular outcomes after treatment. However, the performance and limitations of AI in the automated detection and classification of fractures has not been examined comprehensively. QUESTION/PURPOSES: In this systematic review, we asked (1) What is the proportion of correctly detected or classified fractures and the area under the receiving operating characteristic (AUC) curve of AI fracture detection and classification models? (2) What is the performance of AI in this setting compared with the performance of human examiners? METHODS: The PubMed, Embase, and Cochrane databases were systematically searched from the start of each respective database until September 6, 2018, using terms related to "fracture", "artificial intelligence", and "detection, prediction, or evaluation." Of 1221 identified studies, we retained 10 studies: eight studies involved fracture detection (ankle, hand, hip, spine, wrist, and ulna), one addressed fracture classification (diaphyseal femur), and one addressed both fracture detection and classification (proximal humerus). We registered the review before data collection (PROSPERO: CRD42018110167) and used the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA). We reported the range of the accuracy and AUC for the performance of the predicted fracture detection and/or classification task. An AUC of 1.0 would indicate perfect prediction, whereas 0.5 would indicate a prediction is no better than a flip-of-a-coin. We conducted quality assessment using a seven-item checklist based on a modified methodologic index for nonrandomized studies instrument (MINORS). RESULTS: For fracture detection, the AUC in five studies reflected near perfect prediction (range, 0.95-1.0), and the accuracy in seven studies ranged from 83% to 98%. For fracture classification, the AUC was 0.94 in one study, and the accuracy in two studies ranged from 77% to 90%. In two studies AI outperformed human examiners for detecting and classifying hip and proximal humerus fractures, and one study showed equivalent performance for detecting wrist, hand and ankle fractures. CONCLUSIONS: Preliminary experience with fracture detection and classification using AI shows promising performance. AI may enhance processing and communicating probabilistic tasks in medicine, including orthopaedic surgery. At present, inadequate reference standard assignments to train and test AI is the biggest hurdle before integration into clinical workflow. The next step will be to apply AI to more challenging diagnostic and therapeutic scenarios when there is absence of certitude. Future studies should also seek to address legal regulation and better determine feasibility of implementation in clinical practice. LEVEL OF EVIDENCE: Level II, diagnostic study.
背景:人工智能算法从大量数据中提取规则和模式,使用新的相似数据集来计算各种结果的概率。在医学领域,人工智能(AI)主要应用于图像识别诊断任务和评估治疗后特定结果的概率。然而,人工智能在自动检测和分类骨折方面的性能和局限性尚未得到全面评估。
问题/目的:在这项系统评价中,我们提出了以下两个问题:(1)AI 骨折检测和分类模型的正确检测或分类骨折的比例以及接收者操作特征(ROC)曲线下的面积是多少?(2)与人类检查者相比,AI 在这方面的表现如何?
方法:从每个数据库的开始日期到 2018 年 9 月 6 日,使用与“骨折”、“人工智能”和“检测、预测或评估”相关的术语,对 PubMed、Embase 和 Cochrane 数据库进行系统搜索。在 1221 项确定的研究中,我们保留了 10 项研究:8 项研究涉及骨折检测(脚踝、手、髋部、脊柱、手腕和尺骨),1 项研究涉及骨折分类(股骨骨干),1 项研究涉及骨折检测和分类(肱骨近端)。在数据收集之前,我们对该研究进行了注册(PROSPERO:CRD42018110167),并使用了系统评价和荟萃分析的首选报告项目(PRISMA)。我们报告了预测骨折检测和/或分类任务性能的准确性和 AUC 的范围。AUC 为 1.0 表示完美预测,而 0.5 表示预测不比抛硬币好。我们使用基于改良非随机研究仪器方法学指数的七个项目检查表(MINORS)对质量进行了评估。
结果:在骨折检测方面,五项研究的 AUC 反映了近乎完美的预测(范围为 0.95-1.0),七项研究的准确性范围为 83%-98%。在骨折分类方面,一项研究的 AUC 为 0.94,两项研究的准确性范围为 77%-90%。在两项研究中,AI 在检测和分类髋部和肱骨近端骨折方面优于人类检查者,一项研究在检测手腕、手部和脚踝骨折方面表现出同等的性能。
结论:使用 AI 进行骨折检测和分类的初步经验表明其具有良好的性能。AI 可能会增强医学中处理和传达概率任务的能力,包括矫形外科。目前,在将 AI 整合到临床工作流程之前,最大的障碍是缺乏用于训练和测试 AI 的适当参考标准分配。下一步将是在没有确定性的情况下,将 AI 应用于更具挑战性的诊断和治疗场景。未来的研究还应寻求解决法律监管问题,并更好地确定在临床实践中实施的可行性。
证据水平:二级,诊断研究。
Eur J Trauma Emerg Surg. 2023-4
J Orthop Surg Res. 2022-12-1
JAMA. 2019-1-1
Proc Natl Acad Sci U S A. 2018-10-22
Clin Orthop Relat Res. 2018-10
Acta Orthop. 2017-12