Jin Zhaosheng, Abola Ramon, Bargnes Vincent, Tsivitis Alexandra, Rahman Sadiq, Schwartz Jonathon, Bergese Sergio D, Schabel Joy E
Department of Anesthesiology, Stony Brook University Hospital, Stony Brook, New York, NY, United States.
Front Artif Intell. 2025 May 21;8:1582096. doi: 10.3389/frai.2025.1582096. eCollection 2025.
The popularization of large language chatbots such as ChatGPT has led to growing utility in various biomedical fields. It has been shown that chatbots can provide reasonably accurate responses to medical exam style questions. On the other hand, chatbots have known limitations which may hinder their utility in medical education. We conducted a pragmatically designed study to evaluate the accuracy and completeness of ChatGPT generated responses to various styles of prompts, based on entry-level anesthesiology topics. Ninety-five unique prompts were constructed using topics from the Anesthesia Knowledge Test 1 (AKT-1), a standardized exam undertaken by US anesthesiology residents after 1 month of specialty training. A combination of focused and open-ended prompts was used to evaluate its ability to present and organize information. We also included prompts for journal references, lecture outlines, as well as biased (medically inaccurate) prompts. The responses were independently scored using a 3-point Likert scale, by two board-certified anesthesiologists with extensive experience in medical education. Fifty-two (55%) responses were rated as completely accurate by both evaluators. For longer responses prompts, most of the responses were also deemed complete. Notably, the chatbot frequently generated inaccurate responses when asked for specific literature references and when the input prompt contained deliberate errors (biased prompts). Another recurring observation was the conflation of adjacent concepts (e.g., a specific characteristic was attributed to the wrong drug under the same pharmacological class). Some of the inaccuracies could potentially result in significant harm if applied to clinical situations. While chatbots such as ChatGPT can generate medically accurate responses in most cases, its reliability is not yet suited for medical and clinical education. Content generated by ChatGPT and other chatbots will require validation prior to use.
ChatGPT等大型语言聊天机器人的普及已使其在各个生物医学领域的应用越来越广泛。研究表明,聊天机器人能够对医学考试风格的问题给出合理准确的回答。然而,聊天机器人也存在一些已知的局限性,这可能会阻碍它们在医学教育中的应用。我们开展了一项设计务实的研究,以评估ChatGPT针对基于麻醉学入门主题的各种类型提示所生成回答的准确性和完整性。我们使用美国麻醉学住院医师在接受1个月专业培训后参加的标准化考试“麻醉知识测试1(AKT - 1)”中的主题构建了95个独特的提示。我们使用了聚焦式提示和开放式提示相结合的方式来评估其呈现和组织信息的能力。我们还纳入了关于期刊参考文献、讲座大纲的提示,以及有偏差的(医学上不准确的)提示。两位具有丰富医学教育经验的 board - certified麻醉师使用3点李克特量表对回答进行独立评分。52条(55%)回答被两位评估者都评为完全准确。对于较长的回答提示,大多数回答也被认为是完整的。值得注意的是,当被要求提供特定文献参考时,以及当输入提示包含故意错误(有偏差的提示)时,聊天机器人经常会生成不准确的回答。另一个反复出现的观察结果是相邻概念的混淆(例如,同一药理学类别下的特定特征被错误地归因于错误的药物)。如果将其中一些不准确之处应用于临床情况,可能会造成重大危害。虽然像ChatGPT这样的聊天机器人在大多数情况下能够生成医学上准确的回答,但其可靠性尚不适用于医学和临床教育。在使用之前,ChatGPT和其他聊天机器人生成的内容需要进行验证。