Department of Surgery, Peninsula Health, Melbourne, Victoria, Australia.
Faculty of Science, Medicine, and Health, Monash University, Melbourne, Victoria, Australia.
ANZ J Surg. 2024 Feb;94(1-2):68-77. doi: 10.1111/ans.18666. Epub 2023 Aug 21.
The COVID-19 pandemic has significantly disrupted clinical experience and exposure of medical students and junior doctors. Artificial Intelligence (AI) integration in medical education has the potential to enhance learning and improve patient care. This study aimed to evaluate the effectiveness of three popular large language models (LLMs) in serving as clinical decision-making support tools for junior doctors.
A series of increasingly complex clinical scenarios were presented to ChatGPT, Google's Bard and Bing's AI. Their responses were evaluated against standard guidelines, and for reliability by the Flesch Reading Ease Score, Flesch-Kincaid Grade Level, the Coleman-Liau Index, and the modified DISCERN score for assessing suitability. Lastly, the LLMs outputs were assessed by using the Likert scale for accuracy, informativeness, and accessibility by three experienced specialists.
In terms of readability and reliability, ChatGPT stood out among the three LLMs, recording the highest scores in Flesch Reading Ease (31.2 ± 3.5), Flesch-Kincaid Grade Level (13.5 ± 0.7), Coleman-Lau Index (13) and DISCERN (62 ± 4.4). These results suggest statistically significant superior comprehensibility and alignment with clinical guidelines in the medical advice given by ChatGPT. Bard followed closely behind, with BingAI trailing in all categories. The only non-significant statistical differences (P > 0.05) were found between ChatGPT and Bard's readability indices, and between the Flesch Reading Ease scores of ChatGPT/Bard and BingAI.
This study demonstrates the potential utility of LLMs in fostering self-directed and personalized learning, as well as bolstering clinical decision-making support for junior doctors. However further development is needed for its integration into education.
COVID-19 大流行极大地扰乱了医学生和初级医生的临床经验和接触。人工智能(AI)在医学教育中的整合具有增强学习和改善患者护理的潜力。本研究旨在评估三种流行的大型语言模型(LLM)作为初级医生临床决策支持工具的有效性。
向 ChatGPT、谷歌的 Bard 和必应的 AI 呈现一系列越来越复杂的临床场景。根据标准指南评估他们的反应,并使用 Flesch 阅读易读性评分、Flesch-Kincaid 年级水平、Coleman-Liau 指数和修改后的 DISCERN 评分评估可靠性,以评估适合度。最后,三位经验丰富的专家使用李克特量表评估 LLM 的输出的准确性、信息量和可及性。
在可读性和可靠性方面,ChatGPT 在三种 LLM 中脱颖而出,在 Flesch 阅读易读性(31.2±3.5)、Flesch-Kincaid 年级水平(13.5±0.7)、Coleman-Lau 指数(13)和 DISCERN(62±4.4)方面得分最高。这些结果表明,在提供的医学建议中,ChatGPT 的理解度和与临床指南的一致性具有统计学上的显著优势。Bard 紧随其后,BingAI 在所有类别中都落后。唯一没有统计学差异的是(P>0.05)在 ChatGPT 和 Bard 的可读性指数之间,以及在 ChatGPT/Bard 的 Flesch 阅读易读性得分和 BingAI 之间。
本研究表明 LLM 具有促进自我指导和个性化学习以及增强初级医生临床决策支持的潜力。然而,需要进一步开发才能将其整合到教育中。