Suppr超能文献

真实情况的可推广性会影响人工智能模型在脊柱正位侧位X线片自动检测椎体骨折中的性能。

Ground truth generalizability affects performance of the artificial intelligence model in automated vertebral fracture detection on plain lateral radiographs of the spine.

作者信息

Chou Po-Hsin, Jou Tony Hong-Ting, Wu Hung-Ta Hondar, Yao Yu-Cheng, Lin Hsi-Hsien, Chang Ming-Chau, Wang Shih-Tien, Lu Henry Horng-Shing, Chen Hung-Hsun

机构信息

School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan; Department of Orthopedics and Traumatology, Taipei Veterans General Hospital, Taipei, Taiwan.

School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan.

出版信息

Spine J. 2022 Apr;22(4):511-523. doi: 10.1016/j.spinee.2021.10.020. Epub 2021 Nov 1.

Abstract

BACKGROUND CONTEXT

Computer-aided diagnosis with artificial intelligence (AI) has been used clinically, and ground truth generalizability is important for AI performance in medical image analyses. The AI model was trained on one specific group of older adults (aged≧60) has not yet been shown to work equally well in a younger adult group (aged 18-59).

PURPOSE

To compare the performance of the developed AI model with ensemble method trained with the ground truth for those aged 60 years or older in identifying vertebral fractures (VFs) on plain lateral radiographs of spine (PLRS) between younger and older adult populations.

STUDY DESIGN/SETTING: Retrospective analysis of PLRS in a single medical institution.

OUTCOME MEASURES

Accuracy, sensitivity, specificity, and interobserver reliability (kappa value) were used to compare diagnostic performance of the AI model and subspecialists' consensus between the two groups.

METHODS

Between January 2016 and December 2018, the ground truth of 941 patients (one PLRS per person) aged 60 years and older with 1101 VFs and 6358 normal vertebrae was used to set up the AI model. The framework of the developed AI model includes: object detection with You Only Look Once Version 3 (YOLOv3) at T0-L5 levels in the PLRS, data pre-preprocessing with image-size and quality processing, and AI ensemble model (ResNet34, DenseNet121, and DenseNet201) for identifying or grading VFs. The reported overall accuracy, sensitivity and specificity were 92%, 91% and 93%, respectively, and external validation was also performed. Thereafter, patients diagnosed as VFs and treated in our institution during October 2019 to August 2020 were the study group regardless of age. In total, 258 patients (339 VFs and 1725 normal vertebrae) in the older adult population (mean age 78±10.4; range, 60-106) were enrolled. In the younger adult population (mean age 36±9.43; range, 20-49), 106 patients (120 VFs and 728 normal vertebrae) were enrolled. After identification and grading of VFs based on the Genant method with consensus between two subspecialists', VFs in each PLRS with human labels were defined as the testing dataset. The corresponding CT or MRI scan was used for labeling in the PLRS. The bootstrap method was applied to the testing dataset.

RESULTS

The model for clinical application, Digital Imaging and Communications in Medicine (DICOM) format, is uploaded directly (available at: http://140.113.114.104/vght_demo/svf-model (grading) and http://140.113.114.104/vght demo/svf-model2 (labeling). Overall accuracy, sensitivity and specificity in the older adult population were 93.36% (95% CI 93.34%-93.38%), 88.97% (95% CI 88.59%-88.99%) and 94.26% (95% CI 94.23%-94.29%), respectively. Overall accuracy, sensitivity and specificity in the younger adult population were 93.75% (95% CI 93.7%-93.8%), 65.00% (95% CI 64.33%-65.67%) and 98.49% (95% CI 98.45%-98.52%), respectively. Accuracy reached 100% in VFs grading once the VFs were labeled accurately. The unique pattern of limbus-like VFs, 43 (35.8%) were investigated only in the younger adult population. If limbus-like VFs from the dataset were not included, the accuracy increased from 93.75% (95% CI 93.70%-93.80%) to 95.78% (95% CI 95.73%-95.82%), sensitivity increased from 65.00% (95% CI 64.33%-65.67%) to 70.13% (95% CI 68.98%-71.27%) and specificity remained unchanged at 98.49% (95% CI 98.45%-98.52%), respectively. The main causes of false negative results in older adults were patients' lung markings, diaphragm or bowel airs (37%, n=14) followed by type I fracture (29%, n=11). The main causes of false negatives in younger adults were limbus-like VFs (45%, n=19), followed by type I fracture (26%, n=11). The overall kappa between AI discrimination and subspecialists' consensus in the older and younger adult populations were 0.77 (95% CI, 0.733-0.805) and 0.72 (95% CI, 0.6524-0.80), respectively.

CONCLUSIONS

The developed VF-identifying AI ensemble model based on ground truth of older adults achieved better performance in identifying VFs in older adults and non-fractured thoracic and lumbar vertebrae in the younger adults. Different age distribution may have potential disease diversity and implicate the effect of ground truth generalizability on the AI model performance.

摘要

背景

计算机辅助人工智能诊断已应用于临床,真实数据的通用性对医学图像分析中人工智能的性能很重要。尚未证明在一组特定的老年人(年龄≥60岁)上训练的人工智能模型在较年轻的成年人组(年龄18 - 59岁)中同样有效。

目的

比较在60岁及以上人群中使用真实数据训练的集成方法开发的人工智能模型在识别年轻和老年人群脊柱正位侧位X线片(PLRS)上的椎体骨折(VF)方面的性能。

研究设计/设置:对单个医疗机构的PLRS进行回顾性分析。

观察指标

使用准确性、敏感性、特异性和观察者间可靠性(kappa值)来比较两组中人工智能模型和专科医生共识的诊断性能。

方法

在2016年1月至2018年12月期间,使用941例60岁及以上患者(每人一张PLRS)的真实数据,其中有1101处椎体骨折和6358个正常椎体,来建立人工智能模型。所开发的人工智能模型框架包括:在PLRS的T0 - L5水平使用You Only Look Once Version 3(YOLOv3)进行目标检测,通过图像大小和质量处理进行数据预处理,以及用于识别或分级椎体骨折的人工智能集成模型(ResNet34、DenseNet121和DenseNet201)。报告的总体准确性、敏感性和特异性分别为92%、91%和93%,并进行了外部验证。此后,2019年10月至2020年8月期间在本机构被诊断为椎体骨折并接受治疗的患者,无论年龄大小,均为研究组。老年人群中共有258例患者(339处椎体骨折和1725个正常椎体)入组(平均年龄78±10.4岁;范围60 - 106岁)。在年轻成年人群中(平均年龄36±9.43岁;范围20 - 49岁),有106例患者(120处椎体骨折和728个正常椎体)入组。在两位专科医生达成共识的基础上,基于Genant方法对椎体骨折进行识别和分级后,将带有人类标签的每个PLRS中的椎体骨折定义为测试数据集。相应的CT或MRI扫描用于在PLRS中进行标记。对测试数据集应用自助法。

结果

临床应用模型以医学数字成像和通信(DICOM)格式直接上传(可在:http://140.113.114.104/vght_demo/svf-model(分级)和http://140.113.114.104/vght_demo/svf-model2(标记)获取)。老年人群中的总体准确性、敏感性和特异性分别为93.36%(95%CI 93.34% - 93.38%)、88.97%(95%CI 88.59% - 88.99%)和94.26%(95%CI 94.23% - 94.29%)。年轻成年人群中的总体准确性、敏感性和特异性分别为93.75%(95%CI 93.7% - 93.8%)、65.00%(95%CI 64.33% - 65.67%)和98.49%(95%CI 98.45% - 98.52%)。一旦椎体骨折被准确标记,其分级的准确性达到100%。仅在年轻成年人群中研究了43例(35.8%)类似边缘的椎体骨折独特模式。如果数据集中不包括类似边缘的椎体骨折,准确性从93.75%(95%CI 93.70% - 93.80%)提高到95.78%(95%CI 95.73% - 95.82%),敏感性从65.00%(95%CI 64.33% - 65.67%)提高到70.13%(95%CI 68.98% - 71.27%),特异性保持不变,分别为98.49%(95%CI 98.45% - 98.52%)。老年人假阴性结果的主要原因是患者的肺纹理、膈肌或肠气(37%,n = 14),其次是I型骨折(29%,n = 11)。年轻人假阴性的主要原因是类似边缘的椎体骨折(45%,n = 19),其次是I型骨折(26%,n = 11)。人工智能判别与老年和年轻成年人群中专科医生共识之间的总体kappa值分别为0.77(95%CI,0.733 - 0.805)和0.72(95%CI,0.6524 - 0.80)。

结论

基于老年人真实数据开发的椎体骨折识别人工智能集成模型在识别老年人椎体骨折以及年轻成年人非骨折的胸腰椎方面表现更好。不同的年龄分布可能存在潜在的疾病差异,并暗示真实数据通用性对人工智能模型性能的影响。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验