评估人工智能（AI）聊天机器人生成的回复在做出针对特定患者的药物治疗和医疗相关决策时的准确性和质量。

Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions.

作者信息

Shiferaw Meron W, Zheng Taylor, Winter Abigail, Mike Leigh Ann, Chan Lingtak-Neander

机构信息

School of Pharmacy, University of Washington, 1959 NE Pacific Street, Box 357630, Seattle, WA, 98195, USA.

出版信息

BMC Med Inform Decis Mak. 2024 Dec 24;24(1):404. doi: 10.1186/s12911-024-02824-5.

DOI:10.1186/s12911-024-02824-5

PMID:39719573

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11668057/

Abstract

BACKGROUND

Interactive artificial intelligence tools such as ChatGPT have gained popularity, yet little is known about their reliability as a reference tool for healthcare-related information for healthcare providers and trainees. The objective of this study was to assess the consistency, quality, and accuracy of the responses generated by ChatGPT on healthcare-related inquiries.

METHODS

A total of 18 open-ended questions including six questions in three defined clinical areas (2 each to address "what", "why", and "how", respectively) were submitted to ChatGPT v3.5 based on real-world usage experience. The experiment was conducted in duplicate using 2 computers. Five investigators independently ranked each response using a 4-point scale to rate the quality of the bot's responses. The Delphi method was used to compare each investigator's score with the goal of reaching at least 80% consistency. The accuracy of the responses was checked using established professional references and resources. When the responses were in question, the bot was asked to provide reference material used for the investigators to determine the accuracy and quality. The investigators determined the consistency, accuracy, and quality by establishing a consensus.

RESULTS

The speech pattern and length of the responses were consistent within the same user but different between users. Occasionally, ChatGPT provided 2 completely different responses to the same question. Overall, ChatGPT provided more accurate responses (8 out of 12) to the "what" questions with less reliable performance to the "why" and "how" questions. We identified errors in calculation, unit of measurement, and misuse of protocols by ChatGPT. Some of these errors could result in clinical decisions leading to harm. We also identified citations and references shown by ChatGPT that did not exist in the literature.

CONCLUSIONS

ChatGPT is not ready to take on the coaching role for either healthcare learners or healthcare professionals. The lack of consistency in the responses to the same question is problematic for both learners and decision-makers. The intrinsic assumptions made by the chatbot could lead to erroneous clinical decisions. The unreliability in providing valid references is a serious flaw in using ChatGPT to drive clinical decision making.

摘要

背景

ChatGPT等交互式人工智能工具已广受欢迎，但对于其作为医疗保健提供者和实习生获取医疗相关信息的参考工具的可靠性，人们了解甚少。本研究的目的是评估ChatGPT对医疗相关问题的回答的一致性、质量和准确性。

方法

根据实际使用经验，向ChatGPT v3.5提交了总共18个开放式问题，包括三个特定临床领域的六个问题（每个领域分别有2个问题，分别涉及“是什么”“为什么”和“如何做”）。使用2台计算机重复进行该实验。五名研究人员使用4分制独立对每个回答进行评分，以评估机器人回答的质量。采用德尔菲法比较每位研究人员的评分，目标是达成至少80%的一致性。使用既定的专业参考文献和资源检查回答的准确性。当回答存在疑问时，要求机器人提供用于研究人员确定准确性和质量的参考材料。研究人员通过达成共识来确定一致性、准确性和质量。

结果

回答的语言模式和长度在同一用户内是一致的，但不同用户之间有所不同。偶尔，ChatGPT会对同一个问题给出两个完全不同的回答。总体而言，ChatGPT对“是什么”问题的回答更准确（12个中有8个），而对“为什么”和“如何做”问题的表现则不太可靠。我们发现ChatGPT存在计算错误、计量单位错误和协议使用不当的问题。其中一些错误可能导致临床决策失误并造成伤害。我们还发现ChatGPT显示的参考文献在文献中并不存在。

结论

ChatGPT还不适用于指导医疗学习者或医疗专业人员。对同一问题的回答缺乏一致性，这对学习者和决策者来说都是个问题。聊天机器人的内在假设可能导致错误的临床决策。在提供有效参考文献方面的不可靠性是使用ChatGPT推动临床决策时的一个严重缺陷。

相似文献

Assessing the accuracy and quality of artificial intelligence (AI) chatbot-generated responses in making patient-specific drug-therapy and healthcare-related decisions.评估人工智能（AI）聊天机器人生成的回复在做出针对特定患者的药物治疗和医疗相关决策时的准确性和质量。

BMC Med Inform Decis Mak. 2024 Dec 24;24(1):404. doi: 10.1186/s12911-024-02824-5.

The use of ChatGPT and Google Gemini in responding to orthognathic surgery-related questions: A comparative study.ChatGPT与谷歌Gemini在回答正颌外科相关问题中的应用：一项比较研究。

J World Fed Orthod. 2025 Feb;14(1):20-26. doi: 10.1016/j.ejwf.2024.09.004. Epub 2024 Oct 28.

Evaluating the Efficacy of ChatGPT as a Patient Education Tool in Prostate Cancer: Multimetric Assessment.评估 ChatGPT 在前列腺癌患者教育中的疗效：多指标评估。

J Med Internet Res. 2024 Aug 14;26:e55939. doi: 10.2196/55939.

Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI.评估药物流产信息的准确性：ChatGPT与谷歌巴德人工智能的比较分析

Cureus. 2024 Jan 2;16(1):e51544. doi: 10.7759/cureus.51544. eCollection 2024 Jan.

The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.人工智能聊天机器人大型语言模型在解决骨骼生物学和骨骼健康问题方面的表现。

J Bone Miner Res. 2024 Mar 22;39(2):106-115. doi: 10.1093/jbmr/zjad007.

Caution! AI Bot Has Entered the Patient Chat: ChatGPT Has Limitations in Providing Accurate Urologic Healthcare Advice.注意！人工智能机器人已进入患者聊天界面：ChatGPT在提供准确的泌尿科医疗建议方面存在局限性。

Urology. 2023 Oct;180:278-284. doi: 10.1016/j.urology.2023.07.010. Epub 2023 Jul 17.

Can ChatGPT reliably answer the most common patient questions regarding total shoulder arthroplasty?ChatGPT能否可靠地回答患者关于全肩关节置换术最常见的问题？

J Shoulder Elbow Surg. 2025 May;34(5):e254-e264. doi: 10.1016/j.jse.2024.08.025. Epub 2024 Oct 16.

Unveiling the ChatGPT phenomenon: Evaluating the consistency and accuracy of endodontic question answers.揭开ChatGPT现象的面纱：评估牙髓病学问题答案的一致性和准确性。

Int Endod J. 2024 Jan;57(1):108-113. doi: 10.1111/iej.13985. Epub 2023 Oct 9.

Accuracy of Information and References Using ChatGPT-3 for Retrieval of Clinical Radiological Information.使用 ChatGPT-3 检索临床放射学信息的信息和参考文献的准确性。

Can Assoc Radiol J. 2024 Feb;75(1):69-73. doi: 10.1177/08465371231171125. Epub 2023 Apr 20.

Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam.评估问题特征对 ChatGPT 表现和回应解释一致性的影响：来自台湾护理执照考试的见解。

Int J Nurs Stud. 2024 May;153:104717. doi: 10.1016/j.ijnurstu.2024.104717. Epub 2024 Feb 8.

引用本文的文献

AI-powered insights in pediatric nephrology: current applications and future opportunities.人工智能助力小儿肾脏病学的见解：当前应用与未来机遇

Pediatr Nephrol. 2025 Sep 16. doi: 10.1007/s00467-025-06911-1.

Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies.神经外科特定的、经过同行评审的人工智能聊天机器人与通用人工智能聊天机器人在双语资格考试中的比较表现：评估准确性、一致性和错误最小化策略。

Acta Neurochir (Wien). 2025 Sep 9;167(1):241. doi: 10.1007/s00701-025-06628-y.

AI Chatbots in Pediatric Orthopedics: How Accurate Are Their Answers to Parents' Questions on Bowlegs and Knock Knees?儿科骨科中的人工智能聊天机器人：它们对家长关于膝内翻和膝外翻问题的回答有多准确？

Healthcare (Basel). 2025 May 27;13(11):1271. doi: 10.3390/healthcare13111271.

Rehabilitation Engineering Research Center on Mobile Rehabilitation: State of the Science Conference Report-Future Directions for mRehab for People with Disabilities.移动康复康复工程研究中心：科学现状会议报告——残疾人移动康复的未来方向

Int J Environ Res Public Health. 2025 Mar 31;22(4):532. doi: 10.3390/ijerph22040532.

Comparative evaluation of artificial intelligence models GPT-4 and GPT-3.5 in clinical decision-making in sports surgery and physiotherapy: a cross-sectional study.人工智能模型GPT-4和GPT-3.5在运动外科和物理治疗临床决策中的比较评估：一项横断面研究。

BMC Med Inform Decis Mak. 2025 Apr 14;25(1):163. doi: 10.1186/s12911-025-02996-8.

本文引用的文献

ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation.基于复杂病例临床案例生成的ChatGPT鉴别诊断列表：诊断准确性评估。

JMIR Med Inform. 2023 Oct 9;11:e48808. doi: 10.2196/48808.

Accuracy and Reliability of Chatbot Responses to Physician Questions.聊天机器人对医生提问回答的准确性和可靠性。

JAMA Netw Open. 2023 Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483.

Comparative Evaluation of Diagnostic Accuracy Between Google Bard and Physicians.谷歌巴德与医生之间诊断准确性的比较评估

Am J Med. 2023 Nov;136(11):1119-1123.e18. doi: 10.1016/j.amjmed.2023.08.003. Epub 2023 Aug 27.

Evaluating the performance of ChatGPT in clinical pharmacy: A comparative study of ChatGPT and clinical pharmacists.评估 ChatGPT 在临床药学中的性能：ChatGPT 与临床药师的对比研究。

Br J Clin Pharmacol. 2024 Jan;90(1):232-238. doi: 10.1111/bcp.15896. Epub 2023 Sep 13.

Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations.聊天机器人与医学生在自由应答临床推理考试中的表现对比

JAMA Intern Med. 2023 Sep 1;183(9):1028-1030. doi: 10.1001/jamainternmed.2023.2909.

Assessing the Accuracy and Clinical Utility of ChatGPT in Laboratory Medicine.评估ChatGPT在检验医学中的准确性和临床实用性。

Clin Chem. 2023 Aug 2;69(8):939-940. doi: 10.1093/clinchem/hvad058.

Performance of ChatGPT on the pharmacist licensing examination in Taiwan.ChatGPT 在台湾药剂师执照考试中的表现。

J Chin Med Assoc. 2023 Jul 1;86(7):653-658. doi: 10.1097/JCMA.0000000000000942. Epub 2023 Jul 5.

Exploring the Boundaries of Reality: Investigating the Phenomenon of Artificial Intelligence Hallucination in Scientific Writing Through ChatGPT References.探索现实的边界：通过ChatGPT参考文献研究科学写作中的人工智能幻觉现象。

Cureus. 2023 Apr 11;15(4):e37432. doi: 10.7759/cureus.37432. eCollection 2023 Apr.

Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine.GPT-4作为医学人工智能聊天机器人的益处、局限性和风险

N Engl J Med. 2023 Mar 30;388(13):1233-1239. doi: 10.1056/NEJMsr2214184.

Artificial Intelligence and Machine Learning in Clinical Medicine, 2023.临床医学中的人工智能与机器学习，2023年。

N Engl J Med. 2023 Mar 30;388(13):1201-1208. doi: 10.1056/NEJMra2302038.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验