Scaff Simone P S, Reis Felipe J J, Ferreira Giovanni E, Jacob Maria Fernanda, Saragiotto Bruno T
Masters and Doctoral Programs in Physical Therapy, Universidade Cidade de Sao Paulo, Sao Paulo, Brazil.
Physical Therapy Department, Instituto Federal do Rio de Janeiro, Rio de Janeiro, Brazil; Department of Physiotherapy, Human Physiology and Anatomy, Vrije Universiteit Brussel, Brussel, Belgium.
Ann Rheum Dis. 2025 Jan;84(1):143-149. doi: 10.1136/ard-2024-226202. Epub 2025 Jan 2.
The aim of this study was to assess the accuracy and readability of the answers generated by large language model (LLM)-chatbots to common patient questions about low back pain (LBP).
This cross-sectional study analysed responses to 30 LBP-related questions, covering self-management, risk factors and treatment. The questions were developed by experienced clinicians and researchers and were piloted with a group of consumer representatives with lived experience of LBP. The inquiries were inputted in prompt form into ChatGPT 3.5, Bing, Bard (Gemini) and ChatGPT 4.0. Responses were evaluated in relation to their accuracy, readability and presence of disclaimers about health advice. The accuracy was assessed by comparing the recommendations generated with the main guidelines for LBP. The responses were analysed by two independent reviewers and classified as accurate, inaccurate or unclear. Readability was measured with the Flesch Reading Ease Score (FRES).
Out of 120 responses yielding 1069 recommendations, 55.8% were accurate, 42.1% inaccurate and 1.9% unclear. Treatment and self-management domains showed the highest accuracy while risk factors had the most inaccuracies. Overall, LLM-chatbots provided answers that were 'reasonably difficult' to read, with a mean (SD) FRES score of 50.94 (3.06). Disclaimer about health advice was present around 70%-100% of the responses produced.
The use of LLM-chatbots as tools for patient education and counselling in LBP shows promising but variable results. These chatbots generally provide moderately accurate recommendations. However, the accuracy may vary depending on the topic of each question. The reliability level of the answers was inadequate, potentially affecting the patient's ability to comprehend the information.
本研究旨在评估大语言模型(LLM)聊天机器人针对患者关于腰痛(LBP)的常见问题所给出答案的准确性和可读性。
这项横断面研究分析了对30个与LBP相关问题的回答,涵盖自我管理、风险因素和治疗。这些问题由经验丰富的临床医生和研究人员提出,并在一组有LBP亲身经历的消费者代表中进行了预试验。这些询问以提示形式输入到ChatGPT 3.5、必应、巴德(Gemini)和ChatGPT 4.0中。根据回答的准确性、可读性以及是否存在关于健康建议的免责声明对其进行评估。通过将生成的建议与LBP的主要指南进行比较来评估准确性。由两名独立的评审员对回答进行分析,并分类为准确、不准确或不清楚。使用弗莱什易读性分数(FRES)来衡量可读性。
在产生1069条建议的120个回答中,55.8%准确,42.1%不准确,1.9%不清楚。治疗和自我管理领域的准确性最高,而风险因素方面的不准确回答最多。总体而言,LLM聊天机器人提供的答案“阅读难度适中”,平均(标准差)FRES评分为50.94(3.06)。在大约70%-100%的回答中出现了关于健康建议的免责声明。
将LLM聊天机器人用作LBP患者教育和咨询的工具显示出有前景但结果参差不齐。这些聊天机器人通常提供适度准确的建议。然而,准确性可能因每个问题的主题而异。答案的可靠程度不足,可能会影响患者理解信息的能力。