D. W. G. Langerhuizen, S. J. Janssen, G. M. M. J. Kerkhoffs, Department of Orthopaedic Surgery, Amsterdam Movement Sciences (AMS), Amsterdam University Medical Centre, Amsterdam, The Netherlands.
A. E. J. Bulstra, R. L. Jaarsma, J. N. Doornberg, Flinders University, Department of Orthopaedic & Trauma Surgery, Flinders Medical Centre, Adelaide, Australia.
Clin Orthop Relat Res. 2020 Nov;478(11):2653-2659. doi: 10.1097/CORR.0000000000001318.
Preliminary experience suggests that deep learning algorithms are nearly as good as humans in detecting common, displaced, and relatively obvious fractures (such as, distal radius or hip fractures). However, it is not known whether this also is true for subtle or relatively nondisplaced fractures that are often difficult to see on radiographs, such as scaphoid fractures.
QUESTIONS/PURPOSES: (1) What is the diagnostic accuracy, sensitivity, and specificity of a deep learning algorithm in detecting radiographically visible and occult scaphoid fractures using four radiographic imaging views? (2) Does adding patient demographic (age and sex) information improve the diagnostic performance of the deep learning algorithm? (3) Are orthopaedic surgeons better at diagnostic accuracy, sensitivity, and specificity compared with deep learning? (4) What is the interobserver reliability among five human observers and between human consensus and deep learning algorithm?
We retrospectively searched the picture archiving and communication system (PACS) to identify 300 patients with a radiographic scaphoid series, until we had 150 fractures (127 visible on radiographs and 23 only visible on MRI) and 150 non-fractures with a corresponding CT or MRI as the reference standard for fracture diagnosis. At our institution, MRIs are usually ordered for patients with scaphoid tenderness and normal radiographs, and a CT with radiographically visible scaphoid fracture. We used a deep learning algorithm (a convolutional neural network [CNN]) for automated fracture detection on radiographs. Deep learning, an advanced subset of artificial intelligence, combines artificial neuronal layers to resemble a neuron cell. CNNs-essentially deep learning algorithms resembling interconnected neurons in the human brain-are most commonly used for image analysis. Area under the receiver operating characteristic curve (AUC) was used to evaluate the algorithm's diagnostic performance. An AUC of 1.0 would indicate perfect prediction, whereas 0.5 would indicate that a prediction is no better than a flip of a coin. The probability of a scaphoid fracture generated by the CNN, sex, and age were included in a multivariable logistic regression to determine whether this would improve the algorithm's diagnostic performance. Diagnostic performance characteristics (accuracy, sensitivity, and specificity) and reliability (kappa statistic) were calculated for the CNN and for the five orthopaedic surgeon observers in our study.
The algorithm had an AUC of 0.77 (95% CI 0.66 to 0.85), 72% accuracy (95% CI 60% to 84%), 84% sensitivity (95% CI 0.74 to 0.94), and 60% specificity (95% CI 0.46 to 0.74). Adding age and sex did not improve diagnostic performance (AUC 0.81 [95% CI 0.73 to 0.89]). Orthopaedic surgeons had better specificity (0.93 [95% CI 0.93 to 0.99]; p < 0.01), while accuracy (84% [95% CI 81% to 88%]) and sensitivity (0.76 [95% CI 0.70 to 0.82]; p = 0.29) did not differ between the algorithm and human observers. Although the CNN was less specific in diagnosing relatively obvious fractures, it detected five of six occult scaphoid fractures that were missed by all human observers. The interobserver reliability among the five surgeons was substantial (Fleiss' kappa = 0.74 [95% CI 0.66 to 0.83]), but the reliability between the algorithm and human observers was only fair (Cohen's kappa = 0.34 [95% CI 0.17 to 0.50]).
Initial experience with our deep learning algorithm suggests that it has trouble identifying scaphoid fractures that are obvious to human observers. Thirteen false positive suggestions were made by the CNN, which were correctly detected by the five surgeons. Research with larger datasets-preferably also including information from physical examination-or further algorithm refinement is merited.
Level III, diagnostic study.
初步经验表明,深度学习算法在检测常见、移位和相对明显的骨折(如桡骨远端或髋部骨折)方面与人类一样出色。然而,对于在 X 光片上通常难以看到的细微或相对无移位的骨折,例如舟状骨骨折,这种情况是否也是如此,目前还不得而知。
问题/目的:(1)深度学习算法在使用四种影像学视图检测 X 光片可见和隐匿性舟状骨骨折时的诊断准确性、敏感度和特异性如何?(2)添加患者的人口统计学信息(年龄和性别)是否会提高深度学习算法的诊断性能?(3)与深度学习算法相比,骨科医生的诊断准确性、敏感度和特异性是否更高?(4)五位人类观察者之间以及人类共识与深度学习算法之间的观察者间可靠性如何?
我们回顾性地在图像存档与通信系统(PACS)中搜索,以确定有 300 名患者的 X 射线舟状骨系列,直到我们有 150 例骨折(127 例在 X 光片上可见,23 例仅在 MRI 上可见)和 150 例无骨折,相应的 CT 或 MRI 作为骨折诊断的参考标准。在我们的机构中,通常在 X 光片正常且舟状骨有压痛的情况下为患者开 MRI,在 X 光片可见舟状骨骨折的情况下开 CT。我们使用深度学习算法(卷积神经网络[CNN])对 X 光片进行自动骨折检测。深度学习是人工智能的一个高级子集,它结合了人工神经元层,以模拟神经元细胞。CNN-本质上是类似于人脑神经元的深度学习算法-最常用于图像分析。接受者操作特征曲线下的面积(AUC)用于评估算法的诊断性能。AUC 为 1.0 表示完美预测,而 0.5 表示预测并不比掷硬币好。CNN 生成的舟状骨骨折的概率、性别和年龄被包含在多变量逻辑回归中,以确定这是否会提高算法的诊断性能。为 CNN 和我们研究中的五位骨科医生观察者计算了诊断性能特征(准确性、敏感度和特异性)和可靠性(kappa 统计量)。
该算法的 AUC 为 0.77(95%CI 0.66 至 0.85),准确率为 72%(95%CI 60%至 84%),敏感度为 84%(95%CI 0.74 至 0.94),特异性为 60%(95%CI 0.46 至 0.74)。添加年龄和性别并不能提高诊断性能(AUC 0.81[95%CI 0.73 至 0.89])。骨科医生的特异性更好(0.93[95%CI 0.93 至 0.99];p<0.01),而准确性(84%[95%CI 81% 至 88%])和敏感度(0.76[95%CI 0.70 至 0.82];p=0.29)与人类观察者之间没有差异。尽管该 CNN 在诊断明显骨折方面的特异性较低,但它检测到了所有五位人类观察者都遗漏的六例隐匿性舟状骨骨折。五位外科医生之间的观察者间可靠性较高(Fleiss'kappa=0.74[95%CI 0.66 至 0.83]),但算法与人类观察者之间的可靠性仅为中等(Cohen's kappa=0.34[95%CI 0.17 至 0.50])。
我们对深度学习算法的初步经验表明,它很难识别对人类观察者来说明显的舟状骨骨折。CNN 提出了 13 个假阳性建议,这被五位外科医生正确检测到。需要进行更大数据集的研究-最好还包括体格检查信息-或进一步改进算法。
III 级,诊断研究。