Liang Zuru, Wang Ming, Abdelatif Nasef Mohamed Nasef, Arunakul Marut, Borbon Carlo Angelo V, Chong Keen Wai, Chow Man Wai, Hua Yinghui, Oji David, Ahumada Ximena, Siu Kwai Ming, Tan Ken Jin, Tanaka Yasuhito, Taniguchi Akira, Yung Patrick Shu-Hang, Ling Samuel Ka-Kin
Department of Orthopaedics and Traumatology, The Chinese University of Hong Kong, Hong Kong, SAR, China.
DrNasef OrthoClinic for Foot and Ankle Orthopedic Disorders, Cairo, Egypt.
Orthop J Sports Med. 2025 Apr 30;13(4):23259671251332596. doi: 10.1177/23259671251332596. eCollection 2025 Apr.
Large language model (LLM)-based chatbots have shown potential in providing health information and patient education. However, the reliability of these chatbots in offering medical advice for specific conditions like Achilles tendinopathy remains uncertain. Mixed outcomes in the field of orthopaedics highlight the need for further examination of these chatbots' reliability.
Three leading LLM-based chatbots can provide accurate and complete responses to inquiries related to Achilles tendinopathy.
Cross-sectional study.
Eighteen questions derived from the Dutch clinical guideline on Achilles tendinopathy were posed to 3 leading LLM-based chatbots: ChatGPT 4.0, Claude 2, and Gemini. The responses were incorporated into an online survey assessed by orthopaedic surgeons specializing in Achilles tendinopathy. Responses were evaluated using a 4-point scoring system, where 1 indicates unsatisfactory and 4 indicates excellent. The total scores for the 18 responses were aggregated for each rater and compared across the chatbots. The intraclass correlation coefficient was calculated to assess consistency among the raters' evaluations.
Thirteen specialists from 9 diverse countries and regions participated. Analysis showed no significant difference in the mean total scores among the chatbots: ChatGPT (59.7 ± 5.5), Claude 2 (53.4 ± 9.7), and Gemini (53.6 ± 8.4). The proportions of unsatisfactory responses (score 1) were low and comparable across chatbots: 0.9% for ChatGPT 4.0, 3.4% for Claude 2, and 3.4% for Gemini. In terms of excellent responses (score 4), ChatGPT 4.0 outperformed the others, with 43.6% of the responses rated as excellent, significantly higher than Claude 2 at 27.4% and Gemini at 25.2% ( < .001 for both comparisons). Intraclass correlation coefficients indicated poor reliability for ChatGPT 4.0 (0.420) and moderate reliability for Claude 2 (0.522) and Gemini (0.575).
While LLM-based chatbots such as ChatGPT 4.0 can deliver high-quality responses to queries regarding Achilles tendinopathy, the inconsistency among specialist evaluations and the absence of standardized assessment criteria significantly challenge our ability to draw definitive conclusions. These issues underscore the need for a cautious and standardized approach when considering the integration of LLM-based chatbots into clinical settings.
基于大语言模型(LLM)的聊天机器人在提供健康信息和患者教育方面已显示出潜力。然而,这些聊天机器人在为跟腱炎等特定病症提供医疗建议时的可靠性仍不确定。骨科领域的结果不一,凸显了进一步检验这些聊天机器人可靠性的必要性。
三款领先的基于LLM的聊天机器人能够对与跟腱炎相关的询问提供准确且完整的回答。
横断面研究。
从荷兰跟腱炎临床指南中提取的18个问题被抛给三款领先的基于LLM的聊天机器人:ChatGPT 4.0、Claude 2和Gemini。回答被纳入一项由专门研究跟腱炎的骨科医生评估的在线调查。回答使用4分评分系统进行评估,1分表示不满意,4分表示优秀。对每个评分者的18个回答的总分进行汇总,并在聊天机器人之间进行比较。计算组内相关系数以评估评分者评估之间的一致性。
来自9个不同国家和地区的13位专家参与了研究。分析表明,聊天机器人之间的平均总分没有显著差异:ChatGPT(59.7±5.5)、Claude 2(53.4±9.7)和Gemini(53.6±8.4)。不满意回答(1分)的比例较低,且在聊天机器人之间相当:ChatGPT 4.0为0.9%,Claude 2为3.4%,Gemini为3.4%。在优秀回答(4分)方面,ChatGPT 4.0表现优于其他两者,43.6%的回答被评为优秀,显著高于Claude 2的27.4%和Gemini的25.2%(两项比较均P<0.001)。组内相关系数表明ChatGPT 4.0的可靠性较差(0.420),Claude 2(0.522)和Gemini(0.575)的可靠性中等。
虽然像ChatGPT 4.0这样的基于LLM的聊天机器人能够对与跟腱炎相关的询问提供高质量回答,但专家评估之间的不一致以及缺乏标准化评估标准严重挑战了我们得出明确结论的能力。这些问题强调了在考虑将基于LLM的聊天机器人整合到临床环境中时需要采取谨慎和标准化的方法。