Department of Orthopedic Surgery, St. Vincent's Hospital, College of Medicine, the Catholic University of Korea, Seoul, Republic of Korea.
Department of Medical Informatics, College of Medicine, the Catholic University of Korea, Seoul, Republic of Korea.
Clin Orthop Relat Res. 2023 Nov 1;481(11):2247-2256. doi: 10.1097/CORR.0000000000002771. Epub 2023 Aug 23.
Improvement in survival in patients with advanced cancer is accompanied by an increased probability of bone metastasis and related pathologic fractures (especially in the proximal femur). The few systems proposed and used to diagnose impending fractures owing to metastasis and to ultimately prevent future fractures have practical limitations; thus, novel screening tools are essential. A CT scan of the abdomen and pelvis is a standard modality for staging and follow-up in patients with cancer, and radiologic assessments of the proximal femur are possible with CT-based digitally reconstructed radiographs. Deep-learning models, such as convolutional neural networks (CNNs), may be able to predict pathologic fractures from digitally reconstructed radiographs, but to our knowledge, they have not been tested for this application.
QUESTIONS/PURPOSES: (1) How accurate is a CNN model for predicting a pathologic fracture in a proximal femur with metastasis using digitally reconstructed radiographs of the abdomen and pelvis CT images in patients with advanced cancer? (2) Do CNN models perform better than clinicians with varying backgrounds and experience levels in predicting a pathologic fracture on abdomen and pelvis CT images without any knowledge of the patients' histories, except for metastasis in the proximal femur?
A total of 392 patients received radiation treatment of the proximal femur at three hospitals from January 2011 to December 2021. The patients had 2945 CT scans of the abdomen and pelvis for systemic evaluation and follow-up in relation to their primary cancer. In 33% of the CT scans (974), it was impossible to identify whether a pathologic fracture developed within 3 months after each CT image was acquired, and these were excluded. Finally, 1971 cases with a mean age of 59 ± 12 years were included in this study. Pathologic fractures developed within 3 months after CT in 3% (60 of 1971) of cases. A total of 47% (936 of 1971) were women. Sixty cases had an established pathologic fracture within 3 months after each CT scan, and another group of 1911 cases had no established pathologic fracture within 3 months after CT scan. The mean age of the cases in the former and latter groups was 64 ± 11 years and 59 ± 12 years, respectively, and 32% (19 of 60) and 53% (1016 of 1911) of cases, respectively, were female. Digitally reconstructed radiographs were generated with perspective projections of three-dimensional CT volumes onto two-dimensional planes. Then, 1557 images from one hospital were used for a training set. To verify that the deep-learning models could consistently operate even in hospitals with a different medical environment, 414 images from other hospitals were used for external validation. The number of images in the groups with and without a pathologic fracture within 3 months after each CT scan increased from 1911 to 22,932 and from 60 to 720, respectively, using data augmentation methods that are known to be an effective way to boost the performance of deep-learning models. Three CNNs (VGG16, ResNet50, and DenseNet121) were fine-tuned using digitally reconstructed radiographs. For performance measures, the area under the receiver operating characteristic curve, accuracy, sensitivity, specificity, precision, and F1 score were determined. The area under the receiver operating characteristic curve was used to evaluate three CNN models mainly, and the optimal accuracy, sensitivity, and specificity were calculated using the Youden J statistic. Accuracy refers to the proportion of fractures in the groups with and without a pathologic fracture within 3 months after each CT scan that were accurately predicted by the CNN model. Sensitivity and specificity represent the proportion of accurately predicted fractures among those with and without a pathologic fracture within 3 months after each CT scan, respectively. Precision is a measure of how few false-positives the model produces. The F1 score is a harmonic mean of sensitivity and precision, which have a tradeoff relationship. Gradient-weighted class activation mapping images were created to check whether the CNN model correctly focused on potential pathologic fracture regions. The CNN model with the best performance was compared with the performance of clinicians.
DenseNet121 showed the best performance in identifying pathologic fractures; the area under the receiver operating characteristic curve for DenseNet121 was larger than those for VGG16 (0.77 ± 0.07 [95% CI 0.75 to 0.79] versus 0.71 ± 0.08 [95% CI 0.69 to 0.73]; p = 0.001) and ResNet50 (0.77 ± 0.07 [95% CI 0.75 to 0.79] versus 0.72 ± 0.09 [95% CI 0.69 to 0.74]; p = 0.001). Specifically, DenseNet121 scored the highest in sensitivity (0.22 ± 0.07 [95% CI 0.20 to 0.24]), precision (0.72 ± 0.19 [95% CI 0.67 to 0.77]), and F1 score (0.34 ± 0.10 [95% CI 0.31 to 0.37]), and it focused accurately on the region with the expected pathologic fracture. Further, DenseNet121 was less likely than clinicians to mispredict cases in which there was no pathologic fracture than cases in which there was a fracture; the performance of DenseNet121 was better than clinician performance in terms of specificity (0.98 ± 0.01 [95% CI 0.98 to 0.99] versus 0.86 ± 0.09 [95% CI 0.81 to 0.91]; p = 0.01), precision (0.72 ± 0.19 [95% CI 0.67 to 0.77] versus 0.11 ± 0.10 [95% CI 0.05 to 0.17]; p = 0.0001), and F1 score (0.34 ± 0.10 [95% CI 0.31 to 0.37] versus 0.17 ± 0.15 [95% CI 0.08 to 0.26]; p = 0.0001).
CNN models may be able to accurately predict impending pathologic fractures from digitally reconstructed radiographs of the abdomen and pelvis CT images that clinicians may not anticipate; this can assist medical, radiation, and orthopaedic oncologists clinically. To achieve better performance, ensemble-learning models using knowledge of the patients' histories should be developed and validated. The code for our model is publicly available online at https://github.com/taehoonko/CNN_path_fx_prediction .
Level III, diagnostic study.
在晚期癌症患者中,生存状况的改善伴随着骨转移和相关病理性骨折(尤其是股骨近端)发生概率的增加。尽管已经提出并使用了一些系统来诊断因转移而导致的即将发生的骨折,并最终预防未来的骨折,但这些系统具有实际局限性;因此,迫切需要新的筛查工具。腹部和骨盆 CT 扫描是癌症患者分期和随访的标准方法,并且可以使用基于 CT 的数字重建射线照片进行股骨近端的放射学评估。深度学习模型,如卷积神经网络(CNN),可能能够从数字重建射线照片预测病理性骨折,但据我们所知,它们尚未针对这种应用进行测试。
问题/目的:(1)使用来自接受骨盆和股骨近端放射治疗的 392 名癌症患者的腹部和骨盆 CT 图像的数字重建射线照片,CNN 模型在多大程度上可以预测转移性股骨近端的病理性骨折?(2)在没有任何关于患者病史(除了股骨近端转移)的情况下,CNN 模型是否比具有不同背景和经验水平的临床医生在预测腹部和骨盆 CT 图像上的病理性骨折方面表现更好?
共有 392 名患者在三家医院接受了骨盆和股骨近端的放射治疗。这些患者进行了 2945 次腹部和骨盆 CT 扫描,以进行系统评估和与原发癌相关的随访。在 33%的 CT 扫描(974 次)中,无法确定在每次 CT 图像采集后 3 个月内是否发生了病理性骨折,因此这些 CT 扫描被排除在外。最终,1971 例患者的平均年龄为 59±12 岁,被纳入本研究。在 CT 后 3 个月内发生病理性骨折的病例占 3%(1971 例中的 60 例)。47%的病例为女性(936 例)。在每个 CT 扫描后 3 个月内,有 60 例发生了已建立的病理性骨折,而在另一个组的 1911 例中,在 CT 扫描后 3 个月内没有发生已建立的病理性骨折。前者和后者的平均年龄分别为 64±11 岁和 59±12 岁,分别有 32%(60 例中的 19 例)和 53%(1911 例中的 1016 例)为女性。数字重建射线照片是使用三维 CT 体积的透视投影生成的二维平面。然后,使用来自一家医院的 1557 张图像作为训练集。为了验证深度学习模型即使在具有不同医疗环境的医院中也能够持续运行,使用来自其他两家医院的 414 张图像进行外部验证。通过数据增强方法,将每个 CT 扫描后 3 个月内发生和未发生病理性骨折的病例数量从 1911 例增加到 22932 例和 60 例增加到 720 例,这是一种已知的有效提高深度学习模型性能的方法。使用数字重建射线照片对三个 CNN(VGG16、ResNet50 和 DenseNet121)进行了微调。使用受试者工作特征曲线下面积(area under the receiver operating characteristic curve,AUC)、准确性、敏感性、特异性、精度和 F1 评分来评估性能指标。主要使用 AUC 来评估三个 CNN 模型,使用 Youden J 统计量计算最佳准确性、敏感性和特异性。准确性是指通过 CNN 模型准确预测每个 CT 扫描后 3 个月内发生和未发生病理性骨折的病例比例。敏感性和特异性分别代表在每个 CT 扫描后 3 个月内发生和未发生病理性骨折的病例中被准确预测的比例。精度是衡量模型产生假阳性数量的指标。F1 评分是敏感性和精度的调和平均值,它们具有权衡关系。创建梯度加权类激活映射图像以检查 CNN 模型是否正确关注潜在的病理性骨折区域。比较性能最佳的 CNN 模型与临床医生的表现。
DenseNet121 在识别病理性骨折方面表现最佳;DenseNet121 的 AUC 大于 VGG16(0.77±0.07 [95%CI 0.75 至 0.79] 与 0.71±0.08 [95%CI 0.69 至 0.73];p=0.001)和 ResNet50(0.77±0.07 [95%CI 0.75 至 0.79] 与 0.72±0.09 [95%CI 0.69 至 0.74];p=0.001)。具体而言,DenseNet121 在敏感性(0.22±0.07 [95%CI 0.20 至 0.24])、精度(0.72±0.19 [95%CI 0.67 至 0.77])和 F1 评分(0.34±0.10 [95%CI 0.31 至 0.37])方面得分最高,并且可以准确地关注预期发生病理性骨折的区域。此外,与发生病理性骨折的病例相比,DenseNet121 更不容易误诊未发生病理性骨折的病例;与临床医生相比,DenseNet121 在特异性(0.98±0.01 [95%CI 0.98 至 0.99] 与 0.86±0.09 [95%CI 0.81 至 0.91];p=0.01)、精度(0.72±0.19 [95%CI 0.67 至 0.77] 与 0.11±0.10 [95%CI 0.05 至 0.17];p=0.0001)和 F1 评分(0.34±0.10 [95%CI 0.31 至 0.37] 与 0.17±0.15 [95%CI 0.08 至 0.26];p=0.0001)方面的表现更好。
CNN 模型可能能够准确预测临床医生可能未预见到的腹部和骨盆 CT 图像中即将发生的病理性骨折;这可以为医疗、放射和骨科肿瘤医生提供临床帮助。为了获得更好的性能,应该开发和验证使用患者病史知识的集成学习模型。我们的模型代码可在 https://github.com/taehoonko/CNN_path_fx_prediction 上公开获取。
三级,诊断研究。