Gültekin Onur, Inoue Jumpei, Yilmaz Baris, Cerci Mehmet Halis, Kilinc Bekir Eray, Yilmaz Hüsnü, Prill Robert, Kayaalp Mahmut Enes
Department of Orthopaedics and Traumatology, Istanbul Fatih Sultan Mehmet Training and Research Hospital, University of Health Sciences, Istanbul, Turkey.
Department of Orthopaedic Surgery, Nagoya Tokushukai General Hospital, Kasugai, Aichi, Japan.
Knee Surg Sports Traumatol Arthrosc. 2025 Jun 1. doi: 10.1002/ksa.12711.
This study compares ChatGPT-4o, equipped with its deep research feature, and DeepSeek R1, equipped with its deepthink feature-both enabling real-time online data access-in generating responses to frequently asked questions (FAQs) about anterior cruciate ligament (ACL) surgery. The aim is to evaluate and compare their performance in terms of accuracy, clarity, completeness, consistency and readibility for evidence-based patient education.
A list of ten FAQs about ACL surgery was compiled after reviewing the Sports Medicine Fellowship Institution's webpages. These questions were posed to ChatGPT and DeepSeek in research-enabled modes. Orthopaedic sports surgeons evaluated the responses for accuracy, clarity, completeness, and consistency using a 4-point Likert scale. Inter-rater reliability of the evaluations was assessed using intraclass correlation coefficients (ICCs). In addition, a readability analysis was conducted using the Flesch-Kincaid Grade Level (FKGL) and Flesch Reading Ease Score (FRES) metrics via an established online calculator to objectively measure textual complexity. Paired t tests were used to compare the mean scores of the two models for each criterion, with significance set at p < 0.05.
Both models demonstrated high accuracy (mean scores of 3.9/4) and consistency (4/4). Significant differences were observed in clarity and completeness: ChatGPT provided more comprehensive responses (mean completeness 4.0 vs. 3.2, p < 0.001), while DeepSeek's answers were clearer and more accessible to laypersons (mean clarity 3.9 vs. 3.0, p < 0.001). DeepSeek had lower FKGL (8.9 vs. 14.2, p < 0.001) and higher FRES (61.3 vs. 32.7, p < 0.001), indicating greater ease of reading for a general audience. ICC analysis indicated substantial inter-rater agreement (composite ICC = 0.80).
ChatGPT-4o, leveraging its deep research feature, and DeepSeek R1, utilizing its deepthink feature, both deliver high-quality, accurate information for ACL surgery patient education. While ChatGPT excels in comprehensiveness, DeepSeek outperforms in clarity and readability, suggesting that integrating the strengths of both models could optimize patient education outcomes.
Level V.
本研究比较了具备深度研究功能的ChatGPT-4o和具备深度思考功能(均支持实时在线数据访问)的DeepSeek R1在生成关于前交叉韧带(ACL)手术常见问题(FAQ)的回答方面的表现。目的是评估和比较它们在基于证据的患者教育方面的准确性、清晰度、完整性、一致性和可读性。
在查阅运动医学 fellowship 机构的网页后,编制了一份关于ACL手术的十个常见问题列表。这些问题以研究启用模式向ChatGPT和DeepSeek提出。骨科运动外科医生使用4点李克特量表评估回答的准确性、清晰度、完整性和一致性。使用组内相关系数(ICC)评估评估者间的可靠性。此外,通过一个既定的在线计算器,使用弗莱什-金凯德年级水平(FKGL)和弗莱什阅读简易度得分(FRES)指标进行可读性分析,以客观测量文本复杂性。使用配对t检验比较两个模型在每个标准上的平均得分,显著性设定为p < 0.05。
两个模型都表现出高准确性(平均得分3.9/4)和一致性(4/4)。在清晰度和完整性方面观察到显著差异:ChatGPT提供了更全面的回答(平均完整性4.0对3.2,p < 0.001),而DeepSeek的回答对非专业人士来说更清晰、更容易理解(平均清晰度3.9对3.0,p < 0.001)。DeepSeek的FKGL较低(8.9对14.2,p < 0.001),FRES较高(61.3对32.7,p < 0.001),表明对普通读者来说阅读更容易。ICC分析表明评估者间有实质性的一致性(综合ICC = 0.80)。
利用其深度研究功能的ChatGPT-4o和利用其深度思考功能的DeepSeek R1都为ACL手术患者教育提供了高质量、准确的信息。虽然ChatGPT在全面性方面表现出色,但DeepSeek在清晰度和可读性方面表现更优,这表明整合两个模型的优势可以优化患者教育效果。
V级。