Shiferaw Meron W, Zheng Taylor, Winter Abigail, Mike Leigh Ann, Chan Lingtak-Neander
School of Pharmacy, University of Washington, 1959 NE Pacific Street, Box 357630, Seattle, WA, 98195, USA.
BMC Med Inform Decis Mak. 2024 Dec 24;24(1):404. doi: 10.1186/s12911-024-02824-5.
Interactive artificial intelligence tools such as ChatGPT have gained popularity, yet little is known about their reliability as a reference tool for healthcare-related information for healthcare providers and trainees. The objective of this study was to assess the consistency, quality, and accuracy of the responses generated by ChatGPT on healthcare-related inquiries.
A total of 18 open-ended questions including six questions in three defined clinical areas (2 each to address "what", "why", and "how", respectively) were submitted to ChatGPT v3.5 based on real-world usage experience. The experiment was conducted in duplicate using 2 computers. Five investigators independently ranked each response using a 4-point scale to rate the quality of the bot's responses. The Delphi method was used to compare each investigator's score with the goal of reaching at least 80% consistency. The accuracy of the responses was checked using established professional references and resources. When the responses were in question, the bot was asked to provide reference material used for the investigators to determine the accuracy and quality. The investigators determined the consistency, accuracy, and quality by establishing a consensus.
The speech pattern and length of the responses were consistent within the same user but different between users. Occasionally, ChatGPT provided 2 completely different responses to the same question. Overall, ChatGPT provided more accurate responses (8 out of 12) to the "what" questions with less reliable performance to the "why" and "how" questions. We identified errors in calculation, unit of measurement, and misuse of protocols by ChatGPT. Some of these errors could result in clinical decisions leading to harm. We also identified citations and references shown by ChatGPT that did not exist in the literature.
ChatGPT is not ready to take on the coaching role for either healthcare learners or healthcare professionals. The lack of consistency in the responses to the same question is problematic for both learners and decision-makers. The intrinsic assumptions made by the chatbot could lead to erroneous clinical decisions. The unreliability in providing valid references is a serious flaw in using ChatGPT to drive clinical decision making.
ChatGPT等交互式人工智能工具已广受欢迎,但对于其作为医疗保健提供者和实习生获取医疗相关信息的参考工具的可靠性,人们了解甚少。本研究的目的是评估ChatGPT对医疗相关问题的回答的一致性、质量和准确性。
根据实际使用经验,向ChatGPT v3.5提交了总共18个开放式问题,包括三个特定临床领域的六个问题(每个领域分别有2个问题,分别涉及“是什么”“为什么”和“如何做”)。使用2台计算机重复进行该实验。五名研究人员使用4分制独立对每个回答进行评分,以评估机器人回答的质量。采用德尔菲法比较每位研究人员的评分,目标是达成至少80%的一致性。使用既定的专业参考文献和资源检查回答的准确性。当回答存在疑问时,要求机器人提供用于研究人员确定准确性和质量的参考材料。研究人员通过达成共识来确定一致性、准确性和质量。
回答的语言模式和长度在同一用户内是一致的,但不同用户之间有所不同。偶尔,ChatGPT会对同一个问题给出两个完全不同的回答。总体而言,ChatGPT对“是什么”问题的回答更准确(12个中有8个),而对“为什么”和“如何做”问题的表现则不太可靠。我们发现ChatGPT存在计算错误、计量单位错误和协议使用不当的问题。其中一些错误可能导致临床决策失误并造成伤害。我们还发现ChatGPT显示的参考文献在文献中并不存在。
ChatGPT还不适用于指导医疗学习者或医疗专业人员。对同一问题的回答缺乏一致性,这对学习者和决策者来说都是个问题。聊天机器人的内在假设可能导致错误的临床决策。在提供有效参考文献方面的不可靠性是使用ChatGPT推动临床决策时的一个严重缺陷。