评估生成式人工智能（AI）聊天机器人在跟腱断裂管理中的回复质量和可读性。

Evaluating the Quality and Readability of Generative Artificial Intelligence (AI) Chatbot Responses in the Management of Achilles Tendon Rupture.

作者信息

Collins Christopher E, Giammanco Peter A, Guirgus Monica, Kricfalusi Mikayla, Rice Richard C, Nayak Rusheel, Ruckle David, Filler Ryan, Elsissy Joseph G

机构信息

Orthopedic Surgery, California University of Science and Medicine, Colton, USA.

Orthopedic Surgery, Arrowhead Regional Medical Center, Colton, USA.

出版信息

Cureus. 2025 Jan 31;17(1):e78313. doi: 10.7759/cureus.78313. eCollection 2025 Jan.

DOI:10.7759/cureus.78313

PMID:40034889

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11872741/

Abstract

INTRODUCTION

The rise of artificial intelligence (AI), including generative chatbots like ChatGPT (OpenAI, San Francisco, CA, USA), has revolutionized many fields, including healthcare. Patients have gained the ability to prompt chatbots to generate purportedly accurate and individualized healthcare content. This study analyzed the readability and quality of answers to Achilles tendon rupture questions from six generative AI chatbots to evaluate and distinguish their potential as patient education resources.

METHODS

The six AI models used were ChatGPT 3.5, ChatGPT 4, Gemini 1.0 (previously Bard; Google, Mountain View, CA, USA), Gemini 1.5 Pro, Claude (Anthropic, San Francisco, CA, USA) and Grok (xAI, Palo Alto, CA, USA) without prior prompting. Each was asked 10 common patient questions about Achilles tendon rupture, determined by five orthopaedic surgeons. The readability of generative responses was measured using Flesch-Kincaid Reading Grade Level, Gunning Fog, and SMOG (Simple Measure of Gobbledygook). The response quality was subsequently graded using the DISCERN criteria by five blinded orthopaedic surgeons.

RESULTS

Gemini 1.0 generated statistically significant differences in ease of readability (closest to average American reading level) than responses from ChatGPT 3.5, ChatGPT 4, and Claude. Additionally, mean DISCERN scores demonstrated significantly higher quality of responses from Gemini 1.0 (63.0±5.1) and ChatGPT 4 (63.8±6.2) than ChatGPT 3.5 (53.8±3.8), Claude (55.0±3.8), and Grok (54.2±4.8). However, the overall quality (question 16, DISCERN) of each model was averaged and graded at an above-average level (range, 3.4-4.4).

DISCUSSION AND CONCLUSION

Our results indicate that generative chatbots can potentially serve as patient education resources alongside physicians. Although some models lacked sufficient content, each performed above average in overall quality. With the lowest readability and highest DISCERN scores, Gemini 1.0 outperformed ChatGPT, Claude, and Grok and potentially emerged as the simplest and most reliable generative chatbot regarding management of Achilles tendon rupture.

摘要

引言

包括ChatGPT（美国加利福尼亚州旧金山OpenAI公司）等生成式聊天机器人在内的人工智能（AI）的兴起，已经彻底改变了包括医疗保健在内的许多领域。患者能够促使聊天机器人生成据称准确且个性化的医疗保健内容。本研究分析了六个生成式AI聊天机器人对跟腱断裂问题的回答的可读性和质量，以评估和区分它们作为患者教育资源的潜力。

方法

所使用的六个AI模型分别是ChatGPT 3.5、ChatGPT 4、Gemini 1.0（曾用名Bard；美国加利福尼亚州山景城谷歌公司）、Gemini 1.5 Pro、Claude（美国加利福尼亚州旧金山Anthropic公司）和Grok（美国加利福尼亚州帕洛阿尔托xAI公司），且未事先给出提示。由五位骨科医生确定了关于跟腱断裂的10个常见患者问题，并分别向每个模型提问。使用弗莱什-金凯德阅读年级水平、冈宁雾度和SMOG（简单费解度测量法）来衡量生成式回答的可读性。随后由五位不知情的骨科医生使用DISCERN标准对回答质量进行评分。

结果

Gemini 1.0生成的回答在易读性方面（最接近美国平均阅读水平）与ChatGPT 3.5、ChatGPT 4和Claude的回答相比，具有统计学上的显著差异。此外，平均DISCERN得分显示，Gemini 1.0（63.0±5.1）和ChatGPT 4（63.8±6.2）的回答质量明显高于ChatGPT 3.5（53.8±3.8）、Claude（55.0±3.8）和Grok（54.2±4.8）。然而，每个模型的整体质量（问题16，DISCERN）经平均后评为高于平均水平（范围为3.4 - 4.4）。

讨论与结论

我们的结果表明，生成式聊天机器人有可能与医生一起作为患者教育资源。尽管一些模型缺乏足够的内容，但每个模型的整体质量均高于平均水平。Gemini 1.0的可读性最低但DISCERN得分最高，优于ChatGPT、Claude和Grok，在跟腱断裂管理方面可能是最简单且最可靠的生成式聊天机器人。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/f27e/11872741/4c8b718181e7/cureus-0017-00000078313-i01.jpg

相似文献

Evaluating the Quality and Readability of Generative Artificial Intelligence (AI) Chatbot Responses in the Management of Achilles Tendon Rupture.评估生成式人工智能（AI）聊天机器人在跟腱断裂管理中的回复质量和可读性。

Cureus. 2025 Jan 31;17(1):e78313. doi: 10.7759/cureus.78313. eCollection 2025 Jan.

Accuracy and Readability of Artificial Intelligence Chatbot Responses to Vasectomy-Related Questions: Public Beware.人工智能聊天机器人对输精管切除术相关问题回答的准确性和可读性：公众需谨慎。

Cureus. 2024 Aug 28;16(8):e67996. doi: 10.7759/cureus.67996. eCollection 2024 Aug.

Evaluating the Quality and Readability of Information Provided by Generative Artificial Intelligence Chatbots on Clavicle Fracture Treatment Options.评估生成式人工智能聊天机器人提供的关于锁骨骨折治疗方案信息的质量和可读性。

Cureus. 2025 Jan 9;17(1):e77200. doi: 10.7759/cureus.77200. eCollection 2025 Jan.

Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study.评估人工智能聊天机器人提供的关于化疗心脏毒性的患者教育材料的质量和可读性：一项观察性横断面研究。

Medicine (Baltimore). 2025 Apr 11;104(15):e42135. doi: 10.1097/MD.0000000000042135.

Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.评估人工智能聊天机器人提供的心脏导管插入术患者教育材料的可读性：一项观察性横断面研究。

Cureus. 2024 Jul 4;16(7):e63865. doi: 10.7759/cureus.63865. eCollection 2024 Jul.

Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.评估 ChatGPT®、BARD®、 Gemini®、Copilot®、Perplexity® 在姑息治疗方面的可读性、可靠性和质量。

Medicine (Baltimore). 2024 Aug 16;103(33):e39305. doi: 10.1097/MD.0000000000039305.

Dr. Google to Dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis.谷歌博士对 ChatGPT 博士：评估人工智能生成的关于阑尾炎的医学信息的内容和质量。

Surg Endosc. 2024 May;38(5):2887-2893. doi: 10.1007/s00464-024-10739-5. Epub 2024 Mar 5.

Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study.评估人工智能聊天机器人对心肺复苏术 100 个最常见查询的回答的易读性、可靠性和质量：一项观察性研究。

Medicine (Baltimore). 2024 May 31;103(22):e38352. doi: 10.1097/MD.0000000000038352.

Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.小儿肱骨髁上骨折和股骨干骨折：Chat生成式预训练变换器与谷歌Gemini建议对比美国矫形外科医师学会临床实践指南的分析

J Pediatr Orthop. 2025 Apr 1;45(4):e338-e344. doi: 10.1097/BPO.0000000000002890. Epub 2025 Jan 14.

Assessing the readability, quality and reliability of responses produced by ChatGPT, Gemini, and Perplexity regarding most frequently asked keywords about low back pain.评估ChatGPT、Gemini和Perplexity针对有关腰痛的最常见关键词所给出回答的可读性、质量和可靠性。

PeerJ. 2025 Jan 22;13:e18847. doi: 10.7717/peerj.18847. eCollection 2025.

引用本文的文献

A structured evaluation of LLM-generated step-by-step instructions in cadaveric brachial plexus dissection.对大语言模型生成的尸体臂丛神经解剖分步指导的结构化评估。

BMC Med Educ. 2025 Jul 1;25(1):903. doi: 10.1186/s12909-025-07493-0.

Artificial Intelligence in Orthopedic Surgery: Current Applications, Challenges, and Future Directions.骨科手术中的人工智能：当前应用、挑战及未来方向。

MedComm (2020). 2025 Jun 25;6(7):e70260. doi: 10.1002/mco2.70260. eCollection 2025 Jul.

Evaluating the Reliability and Quality of Sarcoidosis-Related Information Provided by AI Chatbots.评估人工智能聊天机器人提供的结节病相关信息的可靠性和质量。

Healthcare (Basel). 2025 Jun 5;13(11):1344. doi: 10.3390/healthcare13111344.

本文引用的文献

Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.利用人工智能在减重手术中的应用：ChatGPT-4、Bing 和 Bard 在生成临床医生水平的减重手术建议方面的比较分析。

Surg Obes Relat Dis. 2024 Jul;20(7):603-608. doi: 10.1016/j.soard.2024.03.011. Epub 2024 Mar 24.

The quality and readability of patient information provided by ChatGPT: can AI reliably explain common ENT operations?ChatGPT 提供的患者信息的质量和可读性：人工智能能可靠地解释常见的耳鼻喉科手术吗？

Eur Arch Otorhinolaryngol. 2024 Nov;281(11):6147-6153. doi: 10.1007/s00405-024-08598-w. Epub 2024 Mar 26.

Surg Endosc. 2024 May;38(5):2887-2893. doi: 10.1007/s00464-024-10739-5. Epub 2024 Mar 5.

Usefulness and Accuracy of Artificial Intelligence Chatbot Responses to Patient Questions for Neurosurgical Procedures.人工智能聊天机器人对神经外科手术患者问题回答的实用性和准确性

Neurosurgery. 2024 Feb 14. doi: 10.1227/neu.0000000000002856.

Evaluation of information from artificial intelligence on rotator cuff repair surgery.人工智能在肩袖修复手术方面信息的评估。

JSES Int. 2023 Oct 21;8(1):53-57. doi: 10.1016/j.jseint.2023.09.009. eCollection 2024 Jan.

Comparison of large language models in management advice for melanoma: Google's AI BARD, BingAI and ChatGPT.大语言模型在黑色素瘤管理建议方面的比较：谷歌的人工智能BARD、必应人工智能和ChatGPT。

Skin Health Dis. 2023 Nov 28;4(1):e313. doi: 10.1002/ski2.313. eCollection 2024 Feb.

Assessment of Quality and Readability of Information Provided by ChatGPT in Relation to Anterior Cruciate Ligament Injury.ChatGPT提供的关于前交叉韧带损伤信息的质量和可读性评估

J Pers Med. 2024 Jan 18;14(1):104. doi: 10.3390/jpm14010104.

Can Artificial Intelligence Improve the Readability of Patient Education Materials on Aortic Stenosis? A Pilot Study.人工智能能否提高主动脉瓣狭窄患者教育材料的可读性？一项试点研究。

Cardiol Ther. 2024 Mar;13(1):137-147. doi: 10.1007/s40119-023-00347-0. Epub 2024 Jan 9.

Evaluation of the reliability and readability of ChatGPT-4 responses regarding hypothyroidism during pregnancy.评估 ChatGPT-4 在妊娠期间甲状腺功能减退症相关问题的回复的可靠性和可读性。

Sci Rep. 2024 Jan 2;14(1):243. doi: 10.1038/s41598-023-50884-w.

Online Patient Education Resources for Anterior Cruciate Ligament Reconstruction: An Assessment of the Accuracy and Reliability of Information on the Internet Over the Past Decade.前交叉韧带重建的在线患者教育资源：过去十年互联网上信息的准确性和可靠性评估

Cureus. 2023 Oct 6;15(10):e46599. doi: 10.7759/cureus.46599. eCollection 2023 Oct.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

评估生成式人工智能（AI）聊天机器人在跟腱断裂管理中的回复质量和可读性。

Evaluating the Quality and Readability of Generative Artificial Intelligence (AI) Chatbot Responses in the Management of Achilles Tendon Rupture.

作者信息

机构信息

出版信息

INTRODUCTION

METHODS

RESULTS

DISCUSSION AND CONCLUSION

引言

方法

结果

讨论与结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献