Wang Cheng-Tzu, Huang Brady, Thogiti Nagaraju, Zhu Wan-Xuan, Chang Chih-Hung, Pao Jwo-Luen, Lai Feipei
Department of Orthopaedic Surgery, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan.
J Orthop Res. 2023 Apr;41(4):737-746. doi: 10.1002/jor.25415. Epub 2022 Jul 21.
This study aimed to evaluate the performance of a deep-learning model to evaluate knee osteoarthritis using Kellgren-Lawrence grading in real-life knee radiographs. A deep convolutional neural network model was trained using 8964 knee radiographs from the osteoarthritis initiative (OAI), including 962 testing set images. Another 246 knee radiographs from the Far Eastern Memorial Hospital were used for external validation. The OAI testing set and external validation images were evaluated by experienced specialists, two orthopedic surgeons, and a musculoskeletal radiologist. The accuracy, interobserver agreement, F1 score, precision, recall, specificity, and ability to identify surgical candidates were used to compare the performances of the model and specialists. Attention maps illustrated the interpretability of the model classification. The model had a 78% accuracy and consistent interobserver agreement for the OAI (model-surgeon 1 К = 0.80, model-surgeon 2 К = 0.84, model-radiologist К = 0.86) and external validation (model-surgeon 1 К = 0.81, model-surgeon 2 К = 0.82, model-radiologist К = 0.83) images. A lower interobserver agreement was found in the images misclassified by the model (model-surgeon 1 К = 0.57, model-surgeon 2 К = 0.47, model-radiologist К = 0.65). The model performed better than specialists in identifying surgical candidates (Kellgren-Lawrence Stages 3 and 4) with an F1 score of 0.923. Our model not only had comparable results with specialists with respect to the ability to identify surgical candidates but also performed consistently with open database and real-life radiographs. We believe the controversy of the misclassified knee osteoarthritis images was based on a significantly lower interobserver agreement.
本研究旨在评估一种深度学习模型在实际膝关节X线片中使用凯尔格伦-劳伦斯分级法评估膝关节骨关节炎的性能。使用来自骨关节炎倡议(OAI)的8964张膝关节X线片训练了一个深度卷积神经网络模型,其中包括962张测试集图像。另外从远东纪念医院获取的246张膝关节X线片用于外部验证。OAI测试集和外部验证图像由经验丰富的专家、两名骨科医生和一名肌肉骨骼放射科医生进行评估。使用准确率、观察者间一致性、F1分数、精确率、召回率、特异性以及识别手术候选者的能力来比较模型和专家的性能。注意力图展示了模型分类的可解释性。该模型对于OAI图像(模型与外科医生1的Kappa值=0.80,模型与外科医生2的Kappa值=0.84,模型与放射科医生的Kappa值=0.86)和外部验证图像(模型与外科医生1的Kappa值=0.81,模型与外科医生2的Kappa值=0.82,模型与放射科医生的Kappa值=0.83)的准确率为78%,且观察者间一致性良好。在模型误分类的图像中观察者间一致性较低(模型与外科医生1的Kappa值=0.57,模型与外科医生2的Kappa值=0.47,模型与放射科医生的Kappa值=0.65)。在识别手术候选者(凯尔格伦-劳伦斯3级和4级)方面,该模型的表现优于专家,F1分数为0.923。我们的模型不仅在识别手术候选者的能力方面与专家具有可比的结果,而且在开放数据库和实际X线片上表现一致。我们认为膝关节骨关节炎图像误分类的争议是基于显著更低的观察者间一致性。