• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

基于大语言模型的聊天机器人在为跟腱病提供可靠医学建议方面是否有效?一项国际多专家评估。

Are Large Language Model-Based Chatbots Effective in Providing Reliable Medical Advice for Achilles Tendinopathy? An International Multispecialist Evaluation.

作者信息

Liang Zuru, Wang Ming, Abdelatif Nasef Mohamed Nasef, Arunakul Marut, Borbon Carlo Angelo V, Chong Keen Wai, Chow Man Wai, Hua Yinghui, Oji David, Ahumada Ximena, Siu Kwai Ming, Tan Ken Jin, Tanaka Yasuhito, Taniguchi Akira, Yung Patrick Shu-Hang, Ling Samuel Ka-Kin

机构信息

Department of Orthopaedics and Traumatology, The Chinese University of Hong Kong, Hong Kong, SAR, China.

DrNasef OrthoClinic for Foot and Ankle Orthopedic Disorders, Cairo, Egypt.

出版信息

Orthop J Sports Med. 2025 Apr 30;13(4):23259671251332596. doi: 10.1177/23259671251332596. eCollection 2025 Apr.

DOI:10.1177/23259671251332596
PMID:40322749
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12046157/
Abstract

BACKGROUND

Large language model (LLM)-based chatbots have shown potential in providing health information and patient education. However, the reliability of these chatbots in offering medical advice for specific conditions like Achilles tendinopathy remains uncertain. Mixed outcomes in the field of orthopaedics highlight the need for further examination of these chatbots' reliability.

HYPOTHESIS

Three leading LLM-based chatbots can provide accurate and complete responses to inquiries related to Achilles tendinopathy.

STUDY DESIGN

Cross-sectional study.

METHODS

Eighteen questions derived from the Dutch clinical guideline on Achilles tendinopathy were posed to 3 leading LLM-based chatbots: ChatGPT 4.0, Claude 2, and Gemini. The responses were incorporated into an online survey assessed by orthopaedic surgeons specializing in Achilles tendinopathy. Responses were evaluated using a 4-point scoring system, where 1 indicates unsatisfactory and 4 indicates excellent. The total scores for the 18 responses were aggregated for each rater and compared across the chatbots. The intraclass correlation coefficient was calculated to assess consistency among the raters' evaluations.

RESULTS

Thirteen specialists from 9 diverse countries and regions participated. Analysis showed no significant difference in the mean total scores among the chatbots: ChatGPT (59.7 ± 5.5), Claude 2 (53.4 ± 9.7), and Gemini (53.6 ± 8.4). The proportions of unsatisfactory responses (score 1) were low and comparable across chatbots: 0.9% for ChatGPT 4.0, 3.4% for Claude 2, and 3.4% for Gemini. In terms of excellent responses (score 4), ChatGPT 4.0 outperformed the others, with 43.6% of the responses rated as excellent, significantly higher than Claude 2 at 27.4% and Gemini at 25.2% ( < .001 for both comparisons). Intraclass correlation coefficients indicated poor reliability for ChatGPT 4.0 (0.420) and moderate reliability for Claude 2 (0.522) and Gemini (0.575).

CONCLUSION

While LLM-based chatbots such as ChatGPT 4.0 can deliver high-quality responses to queries regarding Achilles tendinopathy, the inconsistency among specialist evaluations and the absence of standardized assessment criteria significantly challenge our ability to draw definitive conclusions. These issues underscore the need for a cautious and standardized approach when considering the integration of LLM-based chatbots into clinical settings.

摘要

背景

基于大语言模型(LLM)的聊天机器人在提供健康信息和患者教育方面已显示出潜力。然而,这些聊天机器人在为跟腱炎等特定病症提供医疗建议时的可靠性仍不确定。骨科领域的结果不一,凸显了进一步检验这些聊天机器人可靠性的必要性。

假设

三款领先的基于LLM的聊天机器人能够对与跟腱炎相关的询问提供准确且完整的回答。

研究设计

横断面研究。

方法

从荷兰跟腱炎临床指南中提取的18个问题被抛给三款领先的基于LLM的聊天机器人:ChatGPT 4.0、Claude 2和Gemini。回答被纳入一项由专门研究跟腱炎的骨科医生评估的在线调查。回答使用4分评分系统进行评估,1分表示不满意,4分表示优秀。对每个评分者的18个回答的总分进行汇总,并在聊天机器人之间进行比较。计算组内相关系数以评估评分者评估之间的一致性。

结果

来自9个不同国家和地区的13位专家参与了研究。分析表明,聊天机器人之间的平均总分没有显著差异:ChatGPT(59.7±5.5)、Claude 2(53.4±9.7)和Gemini(53.6±8.4)。不满意回答(1分)的比例较低,且在聊天机器人之间相当:ChatGPT 4.0为0.9%,Claude 2为3.4%,Gemini为3.4%。在优秀回答(4分)方面,ChatGPT 4.0表现优于其他两者,43.6%的回答被评为优秀,显著高于Claude 2的27.4%和Gemini的25.2%(两项比较均P<0.001)。组内相关系数表明ChatGPT 4.0的可靠性较差(0.420),Claude 2(0.522)和Gemini(0.575)的可靠性中等。

结论

虽然像ChatGPT 4.0这样的基于LLM的聊天机器人能够对与跟腱炎相关的询问提供高质量回答,但专家评估之间的不一致以及缺乏标准化评估标准严重挑战了我们得出明确结论的能力。这些问题强调了在考虑将基于LLM的聊天机器人整合到临床环境中时需要采取谨慎和标准化的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45fe/12046157/fd604a290b42/10.1177_23259671251332596-fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45fe/12046157/f7040673279a/10.1177_23259671251332596-fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45fe/12046157/df9ec616dd4a/10.1177_23259671251332596-fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45fe/12046157/fd604a290b42/10.1177_23259671251332596-fig3.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45fe/12046157/f7040673279a/10.1177_23259671251332596-fig1.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45fe/12046157/df9ec616dd4a/10.1177_23259671251332596-fig2.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/45fe/12046157/fd604a290b42/10.1177_23259671251332596-fig3.jpg

相似文献

1
Are Large Language Model-Based Chatbots Effective in Providing Reliable Medical Advice for Achilles Tendinopathy? An International Multispecialist Evaluation.基于大语言模型的聊天机器人在为跟腱病提供可靠医学建议方面是否有效?一项国际多专家评估。
Orthop J Sports Med. 2025 Apr 30;13(4):23259671251332596. doi: 10.1177/23259671251332596. eCollection 2025 Apr.
2
Evaluating the Efficacy of Artificial Intelligence-Driven Chatbots in Addressing Queries on Vernal Conjunctivitis.评估人工智能驱动的聊天机器人在解答春季结膜炎相关问题方面的效果。
Cureus. 2025 Feb 26;17(2):e79688. doi: 10.7759/cureus.79688. eCollection 2025 Feb.
3
Evaluating the Quality and Readability of Generative Artificial Intelligence (AI) Chatbot Responses in the Management of Achilles Tendon Rupture.评估生成式人工智能(AI)聊天机器人在跟腱断裂管理中的回复质量和可读性。
Cureus. 2025 Jan 31;17(1):e78313. doi: 10.7759/cureus.78313. eCollection 2025 Jan.
4
Performance of Artificial Intelligence in Addressing Questions Regarding Management of Osteochondritis Dissecans.人工智能在解决剥脱性骨软骨炎管理相关问题中的表现。
Sports Health. 2025 Apr 1:19417381251326549. doi: 10.1177/19417381251326549.
5
Assessing the Capability of Large Language Model Chatbots in Generating Plain Language Summaries.评估大语言模型聊天机器人生成通俗易懂摘要的能力。
Cureus. 2025 Mar 21;17(3):e80976. doi: 10.7759/cureus.80976. eCollection 2025 Mar.
6
Performance of Large Language Models ChatGPT and Gemini on Workplace Management Questions in Radiology.大型语言模型ChatGPT和Gemini在放射学工作场所管理问题上的表现
Diagnostics (Basel). 2025 Feb 19;15(4):497. doi: 10.3390/diagnostics15040497.
7
Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study.前瞻性评估 4 种大型语言模型聊天机器人对患者关于急救护理问题的回答的准确性:实验性对比研究。
J Med Internet Res. 2024 Nov 4;26:e60291. doi: 10.2196/60291.
8
Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience.克劳德、ChatGPT、Copilot和Gemini在神经科学不同主题上与学生的表现对比。
Adv Physiol Educ. 2025 Jun 1;49(2):430-437. doi: 10.1152/advan.00093.2024. Epub 2025 Jan 17.
9
Evaluating Accuracy and Readability of Responses to Midlife Health Questions: A Comparative Analysis of Six Large Language Model Chatbots.评估中年健康问题回答的准确性和可读性:六个大语言模型聊天机器人的比较分析
J Midlife Health. 2025 Jan-Mar;16(1):45-50. doi: 10.4103/jmh.jmh_182_24. Epub 2025 Apr 5.
10
Evaluación de la fiabilidad y legibilidad de las respuestas de los chatbots como recurso de información al paciente para las exploraciones PET-TC más communes.评估聊天机器人回复作为常见PET-CT检查患者信息资源的可靠性和可读性。
Rev Esp Med Nucl Imagen Mol (Engl Ed). 2025 Jan-Feb;44(1):500065. doi: 10.1016/j.remnie.2024.500065. Epub 2024 Sep 28.

引用本文的文献

1
The assessment of ChatGPT-4's performance compared to expert's consensus on chronic lateral ankle instability.与专家共识相比,ChatGPT-4在慢性外侧踝关节不稳方面的性能评估。
J Exp Orthop. 2025 Aug 5;12(3):e70393. doi: 10.1002/jeo2.70393. eCollection 2025 Jul.

本文引用的文献

1
Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement.评估ChatGPT对全膝关节置换常见问题的回答的准确性和相关性。
Knee Surg Relat Res. 2024 Apr 2;36(1):15. doi: 10.1186/s43019-024-00218-5.
2
Artificial Intelligence-Generated Draft Replies to Patient Inbox Messages.人工智能生成的回复患者收件箱消息草稿。
JAMA Netw Open. 2024 Mar 4;7(3):e243201. doi: 10.1001/jamanetworkopen.2024.3201.
3
Editorial Commentary: At Present, ChatGPT Cannot Be Relied Upon to Answer Patient Questions and Requires Physician Expertise to Interpret Answers for Patients.
社论评论:目前,ChatGPT 不能被依赖来回答患者的问题,需要医生的专业知识来为患者解释答案。
Arthroscopy. 2024 Jul;40(7):2080-2082. doi: 10.1016/j.arthro.2024.02.039. Epub 2024 Mar 12.
4
ChatGPT Provides Unsatisfactory Responses to Frequently Asked Questions Regarding Anterior Cruciate Ligament Reconstruction.ChatGPT 对前交叉韧带重建相关常见问题的回答不尽如人意。
Arthroscopy. 2024 Jul;40(7):2067-2079.e1. doi: 10.1016/j.arthro.2024.01.017. Epub 2024 Feb 2.
5
ChatGPT Responses to Common Questions About Anterior Cruciate Ligament Reconstruction Are Frequently Satisfactory.ChatGPT 对前交叉韧带重建常见问题的回答通常令人满意。
Arthroscopy. 2024 Jul;40(7):2058-2066. doi: 10.1016/j.arthro.2023.12.009. Epub 2024 Jan 1.
6
Common Painful Foot and Ankle Conditions: A Review.常见足部和踝关节疼痛病症:综述。
JAMA. 2023 Dec 19;330(23):2285-2294. doi: 10.1001/jama.2023.23906.
7
Performance of Large Language Models on a Neurology Board-Style Examination.大语言模型在神经科 board-style 考试中的表现。
JAMA Netw Open. 2023 Dec 1;6(12):e2346721. doi: 10.1001/jamanetworkopen.2023.46721.
8
Reporting standards for the use of large language model-linked chatbots for health advice.使用与大语言模型相关的聊天机器人提供健康建议的报告标准。
Nat Med. 2023 Dec;29(12):2988. doi: 10.1038/s41591-023-02656-2.
9
Large Language Model-Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures.基于大语言模型的聊天机器人与外科医生生成的常见手术知情同意书文档。
JAMA Netw Open. 2023 Oct 2;6(10):e2336997. doi: 10.1001/jamanetworkopen.2023.36997.
10
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.比较分析 ChatGPT-3.5、ChatGPT-4.0 和谷歌巴德在近视防控方面的表现:大型语言模型的基准测试。
EBioMedicine. 2023 Sep;95:104770. doi: 10.1016/j.ebiom.2023.104770. Epub 2023 Aug 23.