• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

大语言模型在关于骨关节炎的常见患者问题上的表现:ChatGPT-3.5、ChatGPT-4.0和Perplexity的比较分析

Large language models' performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Perplexity.

作者信息

Cao Mingde, Wang Qianwen, Zhang Xueyou, Liang Zuru, Qiu Jihong, Yung Patrick Shu-Hang, Ong Michael Tim-Yun

机构信息

Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China; Center for Neuromusculoskeletal Restorative Medicine (CNRM), The Chinese University of Hong Kong, Hong Kong 999077, China.

Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China.

出版信息

J Sport Health Sci. 2024 Nov 28;14:101016. doi: 10.1016/j.jshs.2024.101016.

DOI:10.1016/j.jshs.2024.101016
PMID:39613294
Abstract

BACKGROUND

Large Language Models (LLMs) have gained much attention and, in part, have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses. Osteoarthritis (OA) is a common topic in skeletal muscle disorders, and patients often seek information about it online. Our study evaluated the ability of 3 LLMs (ChatGPT-3.5, ChatGPT-4.0, and Perplexity) to accurately answer common OA-related queries.

METHODS

We defined 6 themes (pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis) based on a generalization of 25 frequently asked questions about OA. Three consultant-level orthopedic specialists independently rated the LLMs' replies on a 4-point accuracy scale. The final ratings for each response were determined using a majority consensus approach. Responses classified as "satisfactory" were evaluated for comprehensiveness on a 5-point scale.

RESULTS

ChatGPT-4.0 demonstrated superior accuracy, with 64% of responses rated as "excellent", compared to 40% for ChatGPT-3.5 and 28% for Perplexity (Pearson's χ test with Fisher's exact test, all p < 0.001). All 3 LLM-chatbots had high mean comprehensiveness ratings (Perplexity = 3.88; ChatGPT-4.0 = 4.56; ChatGPT-3.5 = 3.96, out of a maximum score of 5). The LLM-chatbots performed reliably across domains, except for "treatment and prevention" However, ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity, garnering 53.8% "excellent" ratings (Pearson's χ test with Fisher's exact test, all p < 0.001).

CONCLUSION

Our findings underscore the potential of LLMs, specifically ChatGPT-4.0 and Perplexity, to deliver accurate and thorough responses to OA-related queries. Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.

摘要

背景

大语言模型(LLMs)已备受关注,部分原因是其上下文相关的回答,在一定程度上取代了普通搜索引擎,成为获取信息的常用渠道。骨关节炎(OA)是骨骼肌疾病中的常见话题,患者常在线寻求相关信息。我们的研究评估了3种大语言模型(ChatGPT - 3.5、ChatGPT - 4.0和Perplexity)准确回答常见OA相关问题的能力。

方法

基于对25个关于OA的常见问题的归纳,我们定义了6个主题(发病机制、危险因素、临床表现、诊断、治疗与预防以及预后)。三位顾问级骨科专家独立地根据4分制的准确性量表对大语言模型的回答进行评分。每个回答的最终评分采用多数共识法确定。对分类为“满意”的回答,再用5分制评估其全面性。

结果

ChatGPT - 4.0表现出更高的准确性,64%的回答被评为“优秀”,相比之下,ChatGPT - 3.5为40%,Perplexity为28%(Pearson卡方检验及Fisher精确检验,所有p < 0.001)。所有3种大语言模型聊天机器人的平均全面性评分都很高(Perplexity = 3.88;ChatGPT - 4.0 = 4.56;ChatGPT - 3.5 = 3.96,满分5分)。大语言模型聊天机器人在各个领域的表现都较为可靠,但在“治疗与预防”方面除外。然而,ChatGPT - 4.0仍优于ChatGPT - 3.5和Perplexity,获得了53.8%的“优秀”评分(Pearson卡方检验及Fisher精确检验,所有p < 0.001)。

结论

我们的研究结果强调了大语言模型,特别是ChatGPT - 4.0和Perplexity,对OA相关问题提供准确和全面回答的潜力。针对性地纠正特定误解以提高大语言模型的准确性仍然至关重要。

相似文献

1
Large language models' performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Perplexity.大语言模型在关于骨关节炎的常见患者问题上的表现:ChatGPT-3.5、ChatGPT-4.0和Perplexity的比较分析
J Sport Health Sci. 2024 Nov 28;14:101016. doi: 10.1016/j.jshs.2024.101016.
2
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
3
Large Language Models and Empathy: Systematic Review.大语言模型与同理心:系统综述
J Med Internet Res. 2024 Dec 11;26:e52597. doi: 10.2196/52597.
4
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.葡萄膜炎中大型语言模型性能的基准测试:ChatGPT-3.5、ChatGPT-4.0、谷歌Gemini和Anthropic Claude3的比较分析
Eye (Lond). 2025 Apr;39(6):1132-1137. doi: 10.1038/s41433-024-03545-9. Epub 2024 Dec 17.
5
Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review.ChatGPT 及其他会话型大型语言模型在医疗保健中的应用及关注:系统评价。
J Med Internet Res. 2024 Nov 7;26:e22769. doi: 10.2196/22769.
6
Development and Validation of a Large Language Model-Powered Chatbot for Neurosurgery: Mixed Methods Study on Enhancing Perioperative Patient Education.用于神经外科手术的基于大语言模型的聊天机器人的开发与验证:关于加强围手术期患者教育的混合方法研究
J Med Internet Res. 2025 Jul 15;27:e74299. doi: 10.2196/74299.
7
Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations.大语言模型在非英语环境中的表现:对在中国医学考试中使用不同语言训练的模型的定性研究
JMIR Med Inform. 2025 Jun 27;13:e69485. doi: 10.2196/69485.
8
Is Information About Musculoskeletal Malignancies From Large Language Models or Web Resources at a Suitable Reading Level for Patients?来自大语言模型或网络资源的关于肌肉骨骼恶性肿瘤的信息对患者来说是否处于合适的阅读水平?
Clin Orthop Relat Res. 2025 Feb 1;483(2):306-315. doi: 10.1097/CORR.0000000000003263. Epub 2024 Sep 25.
9
Exercise interventions and patient beliefs for people with hip, knee or hip and knee osteoarthritis: a mixed methods review.髋、膝或髋膝骨关节炎患者的运动干预和患者信念:一项混合方法综述
Cochrane Database Syst Rev. 2018 Apr 17;4(4):CD010842. doi: 10.1002/14651858.CD010842.pub2.
10
Diagnostic efficacy of large language models in the pediatric emergency department: a pilot study.大型语言模型在儿科急诊科的诊断效能:一项试点研究。
Front Digit Health. 2025 Jul 1;7:1624786. doi: 10.3389/fdgth.2025.1624786. eCollection 2025.

引用本文的文献

1
Assessing LLMs on IDSA Practice Guidelines for the Diagnosis and Treatment of Native Vertebral Osteomyelitis: A Comparison Study.根据美国感染病学会(IDSA)关于原发性椎体骨髓炎诊断和治疗的实践指南评估大语言模型:一项比较研究。
J Clin Med. 2025 Jul 15;14(14):4996. doi: 10.3390/jcm14144996.
2
The Role of Artificial Intelligence Large Language Models in Personalized Rehabilitation Programs for Knee Osteoarthritis: An Observational Study.人工智能大语言模型在膝关节骨关节炎个性化康复计划中的作用:一项观察性研究。
J Med Syst. 2025 Jun 3;49(1):73. doi: 10.1007/s10916-025-02207-x.

本文引用的文献

1
Exploring the use of ChatGPT as a virtual health coach for chronic disease management.探索将ChatGPT用作慢性病管理的虚拟健康教练。
Learn Health Syst. 2024 Jan 11;8(3):e10406. doi: 10.1002/lrh2.10406. eCollection 2024 Jul.
2
Krill Oil for Knee Osteoarthritis: A Randomized Clinical Trial.南极磷虾油治疗膝骨关节炎的随机临床试验。
JAMA. 2024 Jun 18;331(23):1997-2006. doi: 10.1001/jama.2024.6063.
3
Leveraging Large Language Models for Decision Support in Personalized Oncology.利用大型语言模型为个性化肿瘤学提供决策支持。
JAMA Netw Open. 2023 Nov 1;6(11):e2343689. doi: 10.1001/jamanetworkopen.2023.43689.
4
Exploring the potential utility of AI large language models for medical ethics: an expert panel evaluation of GPT-4.探讨 AI 大型语言模型在医学伦理中的潜在应用:GPT-4 的专家小组评估。
J Med Ethics. 2024 Jan 23;50(2):90-96. doi: 10.1136/jme-2023-109549.
5
Artificial intelligence and increasing misinformation.人工智能与日益泛滥的错误信息。
Br J Psychiatry. 2024 Feb;224(2):33-35. doi: 10.1192/bjp.2023.136.
6
The Potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic Diseases: Exploratory Study.ChatGPT 在常见骨科疾病自我诊断中的潜力:探索性研究。
J Med Internet Res. 2023 Sep 15;25:e47621. doi: 10.2196/47621.
7
Comparison of ChatGPT-3.5, ChatGPT-4, and Orthopaedic Resident Performance on Orthopaedic Assessment Examinations.ChatGPT-3.5、ChatGPT-4 和骨科住院医师在骨科评估考试中的表现比较。
J Am Acad Orthop Surg. 2023 Dec 1;31(23):1173-1179. doi: 10.5435/JAAOS-D-23-00396. Epub 2023 Sep 4.
8
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.
9
Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information.探索 ChatGPT 作为提供骨科信息的补充工具的潜力。
Knee Surg Sports Traumatol Arthrosc. 2023 Nov;31(11):5190-5198. doi: 10.1007/s00167-023-07529-2. Epub 2023 Aug 8.
10
Large language models in medicine.医学中的大型语言模型。
Nat Med. 2023 Aug;29(8):1930-1940. doi: 10.1038/s41591-023-02448-8. Epub 2023 Jul 17.