Collins Christopher E, Giammanco Peter A, Guirgus Monica, Kricfalusi Mikayla, Rice Richard C, Nayak Rusheel, Ruckle David, Filler Ryan, Elsissy Joseph G
Orthopedic Surgery, California University of Science and Medicine, Colton, USA.
Orthopedic Surgery, Arrowhead Regional Medical Center, Colton, USA.
Cureus. 2025 Jan 31;17(1):e78313. doi: 10.7759/cureus.78313. eCollection 2025 Jan.
The rise of artificial intelligence (AI), including generative chatbots like ChatGPT (OpenAI, San Francisco, CA, USA), has revolutionized many fields, including healthcare. Patients have gained the ability to prompt chatbots to generate purportedly accurate and individualized healthcare content. This study analyzed the readability and quality of answers to Achilles tendon rupture questions from six generative AI chatbots to evaluate and distinguish their potential as patient education resources.
The six AI models used were ChatGPT 3.5, ChatGPT 4, Gemini 1.0 (previously Bard; Google, Mountain View, CA, USA), Gemini 1.5 Pro, Claude (Anthropic, San Francisco, CA, USA) and Grok (xAI, Palo Alto, CA, USA) without prior prompting. Each was asked 10 common patient questions about Achilles tendon rupture, determined by five orthopaedic surgeons. The readability of generative responses was measured using Flesch-Kincaid Reading Grade Level, Gunning Fog, and SMOG (Simple Measure of Gobbledygook). The response quality was subsequently graded using the DISCERN criteria by five blinded orthopaedic surgeons.
Gemini 1.0 generated statistically significant differences in ease of readability (closest to average American reading level) than responses from ChatGPT 3.5, ChatGPT 4, and Claude. Additionally, mean DISCERN scores demonstrated significantly higher quality of responses from Gemini 1.0 (63.0±5.1) and ChatGPT 4 (63.8±6.2) than ChatGPT 3.5 (53.8±3.8), Claude (55.0±3.8), and Grok (54.2±4.8). However, the overall quality (question 16, DISCERN) of each model was averaged and graded at an above-average level (range, 3.4-4.4).
Our results indicate that generative chatbots can potentially serve as patient education resources alongside physicians. Although some models lacked sufficient content, each performed above average in overall quality. With the lowest readability and highest DISCERN scores, Gemini 1.0 outperformed ChatGPT, Claude, and Grok and potentially emerged as the simplest and most reliable generative chatbot regarding management of Achilles tendon rupture.
包括ChatGPT(美国加利福尼亚州旧金山OpenAI公司)等生成式聊天机器人在内的人工智能(AI)的兴起,已经彻底改变了包括医疗保健在内的许多领域。患者能够促使聊天机器人生成据称准确且个性化的医疗保健内容。本研究分析了六个生成式AI聊天机器人对跟腱断裂问题的回答的可读性和质量,以评估和区分它们作为患者教育资源的潜力。
所使用的六个AI模型分别是ChatGPT 3.5、ChatGPT 4、Gemini 1.0(曾用名Bard;美国加利福尼亚州山景城谷歌公司)、Gemini 1.5 Pro、Claude(美国加利福尼亚州旧金山Anthropic公司)和Grok(美国加利福尼亚州帕洛阿尔托xAI公司),且未事先给出提示。由五位骨科医生确定了关于跟腱断裂的10个常见患者问题,并分别向每个模型提问。使用弗莱什-金凯德阅读年级水平、冈宁雾度和SMOG(简单费解度测量法)来衡量生成式回答的可读性。随后由五位不知情的骨科医生使用DISCERN标准对回答质量进行评分。
Gemini 1.0生成的回答在易读性方面(最接近美国平均阅读水平)与ChatGPT 3.5、ChatGPT 4和Claude的回答相比,具有统计学上的显著差异。此外,平均DISCERN得分显示,Gemini 1.0(63.0±5.1)和ChatGPT 4(63.8±6.2)的回答质量明显高于ChatGPT 3.5(53.8±3.8)、Claude(55.0±3.8)和Grok(54.2±4.8)。然而,每个模型的整体质量(问题16,DISCERN)经平均后评为高于平均水平(范围为3.4 - 4.4)。
我们的结果表明,生成式聊天机器人有可能与医生一起作为患者教育资源。尽管一些模型缺乏足够的内容,但每个模型的整体质量均高于平均水平。Gemini 1.0的可读性最低但DISCERN得分最高,优于ChatGPT、Claude和Grok,在跟腱断裂管理方面可能是最简单且最可靠的生成式聊天机器人。