Faculty of School of Life and Health Sciences, Nursing Department, The Jerusalem College of Technology-Lev Academic Center, Jerusalem, Israel; The Department of Vascular Surgery, The Chaim Sheba Medical Center, Tel Hashomer, Ramat Gan, Tel Aviv, Israel.
Shaare Zedek Medical Center, Jerusalem, Israel.
Int J Nurs Stud. 2024 Jul;155:104771. doi: 10.1016/j.ijnurstu.2024.104771. Epub 2024 Apr 9.
To assess the clinical reasoning capabilities of two large language models, ChatGPT-4 and Claude-2.0, compared to those of neonatal nurses during neonatal care scenarios.
A cross-sectional study with a comparative evaluation using a survey instrument that included six neonatal intensive care unit clinical scenarios.
32 neonatal intensive care nurses with 5-10 years of experience working in the neonatal intensive care units of three medical centers.
Participants responded to 6 written clinical scenarios. Simultaneously, we asked ChatGPT-4 and Claude-2.0 to provide initial assessments and treatment recommendations for the same scenarios. The responses from ChatGPT-4 and Claude-2.0 were then scored by certified neonatal nurse practitioners for accuracy, completeness, and response time.
Both models demonstrated capabilities in clinical reasoning for neonatal care, with Claude-2.0 significantly outperforming ChatGPT-4 in clinical accuracy and speed. However, limitations were identified across the cases in diagnostic precision, treatment specificity, and response lag.
While showing promise, current limitations reinforce the need for deep refinement before ChatGPT-4 and Claude-2.0 can be considered for integration into clinical practice. Additional validation of these tools is important to safely leverage this Artificial Intelligence technology for enhancing clinical decision-making.
The study provides an understanding of the reasoning accuracy of new Artificial Intelligence models in neonatal clinical care. The current accuracy gaps of ChatGPT-4 and Claude-2.0 need to be addressed prior to clinical usage.
评估两种大型语言模型(ChatGPT-4 和 Claude-2.0)与新生儿护士在新生儿护理场景中的临床推理能力。
一项横断面研究,采用比较评估方法,使用问卷调查工具,包括 6 个新生儿重症监护病房临床场景。
32 名具有 5-10 年在三家医疗中心新生儿重症监护病房工作经验的新生儿重症监护护士。
参与者对 6 个书面临床场景做出回答。同时,我们要求 ChatGPT-4 和 Claude-2.0 对相同场景提供初步评估和治疗建议。然后,由认证的新生儿护士从业者对 ChatGPT-4 和 Claude-2.0 的回复进行评分,以评估准确性、完整性和响应时间。
两种模型都表现出了在新生儿护理方面的临床推理能力,Claude-2.0 在临床准确性和速度方面明显优于 ChatGPT-4。然而,在所有案例中,在诊断精度、治疗特异性和响应滞后方面都存在局限性。
虽然表现出了潜力,但当前的局限性突显了在将 ChatGPT-4 和 Claude-2.0 整合到临床实践之前,需要进行深度细化。对这些工具的进一步验证对于安全地利用这种人工智能技术增强临床决策至关重要。
该研究了解了新的人工智能模型在新生儿临床护理中的推理准确性。在临床使用之前,需要解决 ChatGPT-4 和 Claude-2.0 当前的准确性差距问题。