深度学习模型自动检测椎体骨折的性能是否能达到人类专家的水平？

Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?

机构信息

Institute of Data Science and Engineering, National Chiao Tung University, Hsinchu, Taiwan.

Center of Teaching and Learning Development, National Chiao Tung University, Hsinchu, Taiwan.

出版信息

Clin Orthop Relat Res. 2021 Jul 1;479(7):1598-1612. doi: 10.1097/CORR.0000000000001685.

DOI:10.1097/CORR.0000000000001685

PMID:33651768

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8208416/

Abstract

BACKGROUND

Vertebral fractures are the most common osteoporotic fractures in older individuals. Recent studies suggest that the performance of artificial intelligence is equal to humans in detecting osteoporotic fractures, such as fractures of the hip, distal radius, and proximal humerus. However, whether artificial intelligence performs as well in the detection of vertebral fractures on plain lateral spine radiographs has not yet been reported.

QUESTIONS/PURPOSES: (1) What is the accuracy, sensitivity, specificity, and interobserver reliability (kappa value) of an artificial intelligence model in detecting vertebral fractures, based on Genant fracture grades, using plain lateral spine radiographs compared with values obtained by human observers? (2) Do patients' clinical data, including the anatomic location of the fracture (thoracic or lumbar spine), T-score on dual-energy x-ray absorptiometry, or fracture grade severity, affect the performance of an artificial intelligence model? (3) How does the artificial intelligence model perform on external validation?

METHODS

Between 2016 and 2018, 1019 patients older than 60 years were treated for vertebral fractures in our institution. Seventy-eight patients were excluded because of missing CT or MRI scans (24% [19]), poor image quality in plain lateral radiographs of spines (54% [42]), multiple myeloma (5% [4]), and prior spine instrumentation (17% [13]). The plain lateral radiographs of 941 patients (one radiograph per person), with a mean age of 76 ± 12 years, and 1101 vertebral fractures between T7 and L5 were retrospectively evaluated for training (n = 565), validating (n = 188), and testing (n = 188) of an artificial intelligence deep-learning model. The gold standard for diagnosis (ground truth) of a vertebral fracture is the interpretation of the CT or MRI reports by a spine surgeon and a radiologist independently. If there were any disagreements between human observers, the corresponding CT or MRI images would be rechecked by them together to reach a consensus. For the Genant classification, the injured vertebral body height was measured in the anterior, middle, and posterior third. Fractures were classified as Grade 1 (< 25%), Grade 2 (26% to 40%), or Grade 3 (> 40%). The framework of the artificial intelligence deep-learning model included object detection, data preprocessing of radiographs, and classification to detect vertebral fractures. Approximately 90 seconds was needed to complete the procedure and obtain the artificial intelligence model results when applied clinically. The accuracy, sensitivity, specificity, interobserver reliability (kappa value), receiver operating characteristic curve, and area under the curve (AUC) were analyzed. The bootstrapping method was applied to our testing dataset and external validation dataset. The accuracy, sensitivity, and specificity were used to investigate whether fracture anatomic location or T-score in dual-energy x-ray absorptiometry report affected the performance of the artificial intelligence model. The receiver operating characteristic curve and AUC were used to investigate the relationship between the performance of the artificial intelligence model and fracture grade. External validation with a similar age population and plain lateral radiographs from another medical institute was also performed to investigate the performance of the artificial intelligence model.

RESULTS

The artificial intelligence model with ensemble method demonstrated excellent accuracy (93% [773 of 830] of vertebrae), sensitivity (91% [129 of 141]), and specificity (93% [644 of 689]) for detecting vertebral fractures of the lumbar spine. The interobserver reliability (kappa value) of the artificial intelligence performance and human observers for thoracic and lumbar vertebrae were 0.72 (95% CI 0.65 to 0.80; p < 0.001) and 0.77 (95% CI 0.72 to 0.83; p < 0.001), respectively. The AUCs for Grades 1, 2, and 3 vertebral fractures were 0.919, 0.989, and 0.990, respectively. The artificial intelligence model with ensemble method demonstrated poorer performance for discriminating normal osteoporotic lumbar vertebrae, with a specificity of 91% (260 of 285) compared with nonosteoporotic lumbar vertebrae, with a specificity of 95% (222 of 234). There was a higher sensitivity 97% (60 of 62) for detecting osteoporotic (dual-energy x-ray absorptiometry T-score ≤ -2.5) lumbar vertebral fractures, implying easier detection, than for nonosteoporotic vertebral fractures (83% [39 of 47]). The artificial intelligence model also demonstrated better detection of lumbar vertebral fractures compared with detection of thoracic vertebral fractures based on the external dataset using various radiographic techniques. Based on the dataset for external validation, the overall accuracy, sensitivity, and specificity on bootstrapping method were 89%, 83%, and 95%, respectively.

CONCLUSION

The artificial intelligence model detected vertebral fractures on plain lateral radiographs with high accuracy, sensitivity, and specificity, especially for osteoporotic lumbar vertebral fractures (Genant Grades 2 and 3). The rapid reporting of results using this artificial intelligence model may improve the efficiency of diagnosing vertebral fractures. The testing model is available at http://140.113.114.104/vght_demo/corr/. One or multiple plain lateral radiographs of the spine in the Digital Imaging and Communications in Medicine format can be uploaded to see the performance of the artificial intelligence model.

LEVEL OF EVIDENCE

Level II, diagnostic study.

摘要

背景

椎体骨折是老年人中最常见的骨质疏松性骨折。最近的研究表明，人工智能在检测髋部、桡骨远端和肱骨近端等部位的骨质疏松性骨折方面的表现与人类相当。然而，人工智能在检测普通侧位脊柱 X 光片上的椎体骨折方面的表现尚未得到报道。

问题/目的：(1) 与人类观察者相比，基于 Genant 骨折分级，使用普通侧位脊柱 X 光片，人工智能模型在检测椎体骨折方面的准确性、敏感度、特异度和观察者间可靠性（kappa 值）是多少？(2) 患者的临床数据，包括骨折的解剖位置（胸椎或腰椎）、双能 X 线吸收法的 T 评分或骨折严重程度，是否会影响人工智能模型的性能？(3) 人工智能模型在外部验证中的表现如何？

方法

在 2016 年至 2018 年间，我们机构治疗了 1019 名年龄在 60 岁以上的椎体骨折患者。由于缺少 CT 或 MRI 扫描（24%[19]）、脊柱普通侧位 X 光片图像质量差（54%[42]）、多发性骨髓瘤（5%[4]）和脊柱内固定（17%[13]），排除了 78 名患者。回顾性评估了 941 名患者（每人 1 张 X 光片）的普通侧位 X 光片，这些患者的平均年龄为 76±12 岁，T7 至 L5 之间有 1101 个椎体骨折，用于训练（n=565）、验证（n=188）和测试（n=188）人工智能深度学习模型。诊断（金标准）椎体骨折的依据是脊柱外科医生和放射科医生对 CT 或 MRI 报告的独立解读。如果人类观察者之间存在任何分歧，将对相应的 CT 或 MRI 图像进行重新检查，以达成共识。对于 Genant 分级，测量前、中、后三分之一的损伤椎体高度。骨折分为 1 级（<25%）、2 级（26%至 40%）或 3 级（>40%）。人工智能深度学习模型的框架包括目标检测、X 光片的数据预处理和分类，以检测椎体骨折。当应用于临床时，该程序大约需要 90 秒的时间完成，并获得人工智能模型的结果。分析准确性、敏感度、特异度、观察者间可靠性（kappa 值）、接收者操作特征曲线和曲线下面积（AUC）。应用bootstrap 方法对我们的测试数据集和外部验证数据集进行分析。准确性、敏感度和特异度用于研究骨折解剖位置或双能 X 线吸收法报告中的 T 评分是否影响人工智能模型的性能。接收者操作特征曲线和 AUC 用于研究人工智能模型的性能与骨折严重程度的关系。还对来自另一家医疗机构的具有相似年龄人群的普通侧位 X 光片进行了外部验证，以研究人工智能模型的性能。

结果

具有集成方法的人工智能模型在检测腰椎椎体骨折方面表现出优异的准确性（93%[773 个椎体中的 830 个]）、敏感度（91%[141 个骨折中的 129 个]）和特异度（93%[689 个椎体中的 644 个]）。人工智能性能和人类观察者对胸椎和腰椎的观察者间可靠性（kappa 值）分别为 0.72（95%CI 0.65 至 0.80；p<0.001）和 0.77（95%CI 0.72 至 0.83；p<0.001）。1 级、2 级和 3 级椎体骨折的 AUC 分别为 0.919、0.989 和 0.990。具有集成方法的人工智能模型在区分正常骨质疏松性腰椎椎体方面表现较差，特异性为 91%（285 个正常椎体中的 260 个），而非骨质疏松性腰椎椎体的特异性为 95%（234 个非骨质疏松性椎体中的 222 个）。检测骨质疏松性（双能 X 线吸收法 T 评分≤-2.5）腰椎椎体骨折的敏感度更高，为 97%（62 个骨折中的 60 个），这意味着更容易检测到骨折，而检测非骨质疏松性椎体骨折的敏感度为 83%（47 个骨折中的 39 个）。基于外部数据集，人工智能模型在检测腰椎椎体骨折方面的表现也优于检测胸椎椎体骨折，并且使用了各种放射技术。基于外部验证数据集，Bootstrap 方法的整体准确性、敏感度和特异性分别为 89%、83%和 95%。

结论

人工智能模型在检测普通侧位脊柱 X 光片上的椎体骨折方面具有较高的准确性、敏感度和特异性，特别是对骨质疏松性腰椎骨折（Genant 分级 2 和 3）。使用这种人工智能模型快速报告结果可能会提高诊断椎体骨折的效率。测试模型可在 http://140.113.114.104/vght_demo/corr/ 上获取。可以上传脊柱的数字成像和通信格式的 1 或多个普通侧位 X 光片，以查看人工智能模型的性能。

证据水平

II 级，诊断研究。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

深度学习模型自动检测椎体骨折的性能是否能达到人类专家的水平？

Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

LEVEL OF EVIDENCE

背景

方法

结果

结论

证据水平

相似文献

引用本文的文献

深度学习模型自动检测椎体骨折的性能是否能达到人类专家的水平？

Can a Deep-learning Model for the Automated Detection of Vertebral Fractures Approach the Performance Level of Human Subspecialists?

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSION

LEVEL OF EVIDENCE

背景

方法

结果

结论

证据水平

相似文献

引用本文的文献