Balta Kaan Y, Javidan Arshia P, Walser Eric, Arntfield Robert, Prager Ross
Schulich School of Medicine & Dentistry, Western University, London, Ontario, Canada.
Division of Vascular Surgery, Department of Surgery, University of Toronto, Toronto, Ontario, Canada.
J Intensive Care Med. 2025 Feb;40(2):184-190. doi: 10.1177/08850666241267871. Epub 2024 Aug 8.
We assessed 2 versions of the large language model (LLM) ChatGPT-versions 3.5 and 4.0-in generating appropriate, consistent, and readable recommendations on core critical care topics. How do successive large language models compare in terms of generating appropriate, consistent, and readable recommendations on core critical care topics? A set of 50 LLM-generated responses to clinical questions were evaluated by 2 independent intensivists based on a 5-point Likert scale for appropriateness, consistency, and readability. ChatGPT 4.0 showed significantly higher median appropriateness scores compared to ChatGPT 3.5 (4.0 vs 3.0, < .001). However, there was no significant difference in consistency between the 2 versions (40% vs 28%, = 0.291). Readability, assessed by the Flesch-Kincaid Grade Level, was also not significantly different between the 2 models (14.3 vs 14.4, = 0.93). Both models produced "hallucinations"-misinformation delivered with high confidence-which highlights the risk of relying on these tools without domain expertise. Despite potential for clinical application, both models lacked consistency producing different results when asked the same question multiple times. The study underscores the need for clinicians to understand the strengths and limitations of LLMs for safe and effective implementation in critical care settings. https://osf.io/8chj7/.
我们评估了大语言模型(LLM)ChatGPT的两个版本——3.5版和4.0版——在生成关于核心重症监护主题的恰当、一致且可读的建议方面的表现。在生成关于核心重症监护主题的恰当、一致且可读的建议方面,连续的大语言模型相比如何?两位独立的重症监护医生根据5分制李克特量表,对一组由大语言模型生成的针对临床问题的50条回复进行了恰当性、一致性和可读性评估。与ChatGPT 3.5相比,ChatGPT 4.0的中位数恰当性得分显著更高(4.0对3.0,<0.001)。然而,两个版本在一致性方面没有显著差异(40%对28%,P = 0.291)。通过弗莱什-金凯德年级水平评估的可读性,在两个模型之间也没有显著差异(14.3对14.4,P = 0.93)。两个模型都产生了“幻觉”——以高度自信传递的错误信息——这凸显了在没有领域专业知识的情况下依赖这些工具的风险。尽管有临床应用的潜力,但当多次被问到相同问题时,两个模型都缺乏一致性,会产生不同的结果。该研究强调临床医生需要了解大语言模型的优势和局限性,以便在重症监护环境中安全有效地应用。https://osf.io/8chj7/