Gerontology, Public Health and Education Department, National Institute of Geriatrics, Rheumatology and Rehabilitation, Warsaw, Poland; Department of Ultrasound, Institute of Fundamental Technological Research, Polish Academy of Sciences.
Gerontology, Public Health and Education Department, National Institute of Geriatrics, Rheumatology and Rehabilitation, Warsaw, Poland.
Int J Med Inform. 2024 Oct;190:105562. doi: 10.1016/j.ijmedinf.2024.105562. Epub 2024 Jul 19.
BACKGROUND: Chatbots using the Large Language Model (LLM) generate human responses to questions from all categories. Due to staff shortages in healthcare systems, patients waiting for an appointment increasingly use chatbots to get information about their condition. Given the number of chatbots currently available, assessing the responses they generate is essential. METHODS: Five chatbots with free access were selected (Gemini, Microsoft Copilot, PiAI, ChatGPT, ChatSpot) and blinded using letters (A, B, C, D, E). Each chatbot was asked questions about cardiology, oncology, and psoriasis. Responses were compared to guidelines from the European Society of Cardiology, American Academy of Dermatology and American Society of Clinical Oncology. All answers were assessed using readability scales (Flesch Reading Scale, Gunning Fog Scale Level, Flesch-Kincaid Grade Level and Dale-Chall Score). Using a 3-point Likert scale, two independent medical professionals assessed the compliance of the responses with the guidelines. RESULTS: A total of 45 questions were asked of all chatbots. Chatbot C gave the shortest answers, 7.0 (6.0 - 8.0), and Chatbot A the longest 17.5 (13.0 - 24.5). The Flesch Reading Ease Scale ranged from 16.3 (12.2 - 21.9) (Chatbot D) to 39.8 (29.0 - 50.4) (Chatbot A). Flesch-Kincaid Grade Level ranged from 12.5 (10.6 - 14.6) (Chatbot A) to 15.9 (15.1 - 17.1) (Chatbot D). Gunning Fog Scale Level ranged from 15.77 (Chatbot A) to 19.73 (Chatbot D). Dale-Chall Score ranged from 10.3 (9.3 - 11.3) (Chatbot A) to 11.9 (11.5 - 12.4) (Chatbot D). CONCLUSION: This study indicates that chatbots vary in length, quality, and readability. They answer each question in their own way, based on the data they have pulled from the web. Reliability of the responses generated by chatbots is high. This suggests that people who want information from a chatbot need to be careful and verify the answers they receive, particularly when they ask about medical and health aspects.
背景:使用大型语言模型(LLM)的聊天机器人可以针对来自各个类别的问题生成人类回复。由于医疗系统人手短缺,越来越多的患者在预约时开始使用聊天机器人来获取有关其病情的信息。鉴于目前可用的聊天机器人数量众多,评估它们生成的回复至关重要。
方法:选择了五个免费访问的聊天机器人(Gemini、Microsoft Copilot、PiAI、ChatGPT 和 ChatSpot),并使用字母(A、B、C、D、E)进行盲法处理。每个聊天机器人都被问到了心脏病学、肿瘤学和银屑病方面的问题。将回复与欧洲心脏病学会、美国皮肤病学会和美国临床肿瘤学会的指南进行比较。使用可读性量表(Flesch 阅读量表、Gunning Fog 量表等级、Flesch-Kincaid 等级和 Dale-Chall 分数)对所有答案进行评估。两位独立的医学专业人员使用 3 分李克特量表评估回复与指南的一致性。
结果:总共向所有聊天机器人提出了 45 个问题。聊天机器人 C 给出的回复最短,为 7.0(6.0-8.0),而聊天机器人 A 给出的回复最长,为 17.5(13.0-24.5)。Flesch 阅读易读性量表范围从 16.3(12.2-21.9)(聊天机器人 D)到 39.8(29.0-50.4)(聊天机器人 A)。Flesch-Kincaid 等级范围从 12.5(10.6-14.6)(聊天机器人 A)到 15.9(15.1-17.1)(聊天机器人 D)。Gunning Fog 量表等级范围从 15.77(聊天机器人 A)到 19.73(聊天机器人 D)。Dale-Chall 分数范围从 10.3(9.3-11.3)(聊天机器人 A)到 11.9(11.5-12.4)(聊天机器人 D)。
结论:本研究表明,聊天机器人在长度、质量和可读性方面存在差异。它们根据从网络上提取的数据以自己的方式回答每个问题。聊天机器人生成的回复可靠性很高。这表明,希望从聊天机器人获取信息的人需要小心并验证他们收到的答案,特别是当他们询问医疗和健康方面的问题时。
JAMA Ophthalmol. 2015-4
JAMA Netw Open. 2024-7-1
BMC Med Inform Decis Mak. 2024-7-29
Sci Rep. 2024-7-26
Proc (Bayl Univ Med Cent). 2025-6-20