Lang Siegmund, Vitale Jacopo, Galbusera Fabio, Fekete Tamás, Boissiere Louis, Charles Yann Philippe, Yucekul Altug, Yilgor Caglar, Núñez-Pereira Susana, Haddad Sleiman, Gomez-Rice Alejandro, Mehta Jwalant, Pizones Javier, Pellisé Ferran, Obeid Ibrahim, Alanay Ahmet, Kleinstück Frank, Loibl Markus
Department of Trauma Surgery, University Hospital Regensburg, Franz-Josef-Strauss-Allee 11, 93053, Regensburg, Germany.
Department of Spine Surgery, Schulthess Klinik, Zurich, Switzerland.
Spine Deform. 2025 Mar;13(2):361-372. doi: 10.1007/s43390-024-00955-3. Epub 2024 Nov 4.
Large language models (LLM) have the potential to bridge knowledge gaps in patient education and enrich patient-surgeon interactions. This study evaluated three chatbots for delivering empathetic and precise adolescent idiopathic scoliosis (AIS) related information and management advice. Specifically, we assessed the accuracy, clarity, and relevance of the information provided, aiming to determine the effectiveness of LLMs in addressing common patient queries and enhancing their understanding of AIS.
We sourced 20 webpages for the top frequently asked questions (FAQs) about AIS and formulated 10 critical questions based on them. Three advanced LLMs-ChatGPT 3.5, ChatGPT 4.0, and Google Bard-were selected to answer these questions, with responses limited to 200 words. The LLMs' responses were evaluated by a blinded group of experienced deformity surgeons (members of the European Spine Study Group) from seven European spine centers. A pre-established 4-level rating system from excellent to unsatisfactory was used with a further rating for clarity, comprehensiveness, and empathy on the 5-point Likert scale. If not rated 'excellent', the raters were asked to report the reasons for their decision for each question. Lastly, raters were asked for their opinion towards AI in healthcare in general in six questions.
The responses among all LLMs were 'excellent' in 26% of responses, with ChatGPT-4.0 leading (39%), followed by Bard (17%). ChatGPT-4.0 was rated superior to Bard and ChatGPT 3.5 (p = 0.003). Discrepancies among raters were significant (p < 0.0001), questioning inter-rater reliability. No substantial differences were noted in answer distribution by question (p = 0.43). The answers on diagnosis (Q2) and causes (Q4) of AIS were top-rated. The most dissatisfaction was seen in the answers regarding definitions (Q1) and long-term results (Q7). Exhaustiveness, clarity, empathy, and length of the answers were positively rated (> 3.0 on 5.0) and did not demonstrate any differences among LLMs. However, GPT-3.5 struggled with language suitability and empathy, while Bard's responses were overly detailed and less empathetic. Overall, raters found that 9% of answers were off-topic and 22% contained clear mistakes.
Our study offers crucial insights into the strengths and weaknesses of current LLMs in AIS patient and parent education, highlighting the promise of advancements like ChatGPT-4.o and Gemini alongside the need for continuous improvement in empathy, contextual understanding, and language appropriateness.
大语言模型(LLM)有潜力弥合患者教育中的知识差距,并丰富患者与外科医生的互动。本研究评估了三款聊天机器人,以提供有关青少年特发性脊柱侧凸(AIS)的共情且精确的信息及管理建议。具体而言,我们评估了所提供信息的准确性、清晰度和相关性,旨在确定大语言模型在解答患者常见疑问及增强他们对AIS的理解方面的有效性。
我们从20个网页收集了关于AIS的常见问题(FAQ),并据此制定了10个关键问题。选择了三款先进的大语言模型——ChatGPT 3.5、ChatGPT 4.0和谷歌巴德——来回答这些问题,回答限制在200字以内。大语言模型的回答由来自七个欧洲脊柱中心的一组经验丰富的脊柱畸形外科医生(欧洲脊柱研究小组成员)进行盲评。使用预先建立的从优秀到不满意的4级评分系统,并在5点李克特量表上对清晰度、全面性和共情进行进一步评分。如果评分不是“优秀”,要求评分者报告对每个问题做出决定的原因。最后,在六个问题中询问评分者对医疗保健中人工智能的总体看法。
所有大语言模型的回答中,26%为“优秀”,ChatGPT-4.0领先(39%),其次是巴德(17%)。ChatGPT-4.0的评分高于巴德和ChatGPT 3.5(p = 0.003)。评分者之间的差异显著(p < 0.0001),对评分者间的可靠性提出质疑。按问题分类的答案分布没有显著差异(p = 0.43)。关于AIS诊断(问题2)和病因(问题4)的答案评分最高。在关于定义(问题1)和长期结果(问题7)的答案中,不满意程度最高。答案的详尽性、清晰度、共情和长度得到了正面评价(在5.0分制中> 3.0),且在大语言模型之间没有显示出任何差异。然而,GPT-3.5在语言适用性和共情方面存在困难,而巴德的回答过于详细且共情不足。总体而言,评分者发现9%的答案离题,22%包含明显错误。
我们的研究为当前大语言模型在AIS患者及家长教育中的优势和劣势提供了关键见解,突出了ChatGPT-4.0和Gemini等进步的前景,同时也表明在共情、情境理解和语言恰当性方面仍需不断改进。