Rossettini Giacomo, Bargeri Silvia, Cook Chad, Guida Stefania, Palese Alvisa, Rodeghiero Lia, Pillastrini Paolo, Turolla Andrea, Castellini Greta, Gianola Silvia
School of Physiotherapy, University of Verona, Verona, Italy.
Department of Physiotherapy, Faculty of Medicine, Health and Sports, Universidad Europea de Madrid, Madrid, Spain.
Front Digit Health. 2025 Jun 27;7:1574287. doi: 10.3389/fdgth.2025.1574287. eCollection 2025.
Artificial Intelligence (AI) chatbots, which generate human-like responses based on extensive data, are becoming important tools in healthcare by providing information on health conditions, treatments, and preventive measures, acting as virtual assistants. However, their performance in aligning with clinical practice guidelines (CPGs) for providing answers to complex clinical questions on lumbosacral radicular pain is still unclear. We aim to evaluate AI chatbots' performance against CPG recommendations for diagnosing and treating lumbosacral radicular pain.
We performed a cross-sectional study to assess AI chatbots' responses against CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinical questions based on these CPGs were posed to the latest versions (updated in 2024) of six AI chatbots: ChatGPT-3.5, ChatGPT-4o, Microsoft Copilot, Google Gemini, Claude, and Perplexity. The chatbots' responses were evaluated for (a) consistency of text responses using Plagiarism Checker X, (b) intra- and inter-rater reliability using Fleiss' Kappa, and (c) match rate with CPGs. Statistical analyses were performed with STATA/MP 16.1.
We found high variability in the text consistency of AI chatbot responses (median range 26%-68%). Intra-rater reliability ranged from "almost perfect" to "substantial," while inter-rater reliability varied from "almost perfect" to "moderate." Perplexity had the highest match rate at 67%, followed by Google Gemini at 63%, and Microsoft Copilot at 44%. ChatGPT-3.5, ChatGPT-4o, and Claude showed the lowest performance, each with a 33% match rate.
Despite the variability in internal consistency and good intra- and inter-rater reliability, the AI Chatbots' recommendations often did not align with CPGs recommendations for diagnosing and treating lumbosacral radicular pain. Clinicians and patients should exercise caution when relying on these AI models, since one to two-thirds of the recommendations provided may be inappropriate or misleading according to specific chatbots.
人工智能(AI)聊天机器人基于大量数据生成类似人类的回复,通过提供有关健康状况、治疗方法和预防措施的信息,充当虚拟助手,正成为医疗保健领域的重要工具。然而,它们在遵循临床实践指南(CPG)为腰骶神经根性疼痛的复杂临床问题提供答案方面的表现仍不明确。我们旨在评估人工智能聊天机器人针对CPG关于腰骶神经根性疼痛诊断和治疗建议的表现。
我们进行了一项横断面研究,以评估人工智能聊天机器人针对CPG关于腰骶神经根性疼痛诊断和治疗建议的回复。基于这些CPG提出的临床问题被抛给六个人工智能聊天机器人的最新版本(2024年更新):ChatGPT-3.5、ChatGPT-4o、Microsoft Copilot、Google Gemini、Claude和Perplexity。对聊天机器人的回复进行了以下评估:(a)使用Plagiarism Checker X评估文本回复的一致性,(b)使用Fleiss' Kappa评估评分者内部和评分者之间的可靠性,以及(c)与CPG的匹配率。使用STATA/MP 16.1进行统计分析。
我们发现人工智能聊天机器人回复的文本一致性存在很大差异(中位数范围为26%-68%)。评分者内部可靠性范围从“几乎完美”到“实质性”,而评分者之间的可靠性从“几乎完美”到“中等”不等。Perplexity的匹配率最高,为67%,其次是Google Gemini,为63%,Microsoft Copilot为44%。ChatGPT-3.5、ChatGPT-4o和Claude表现最差,匹配率均为33%。
尽管内部一致性存在差异,且评分者内部和评分者之间的可靠性良好,但人工智能聊天机器人的建议通常与CPG关于腰骶神经根性疼痛诊断和治疗的建议不一致。临床医生和患者在依赖这些人工智能模型时应谨慎,因为根据特定的聊天机器人,提供的建议中有三分之一到三分之二可能不合适或具有误导性。