Hose Bat-Zion, Handley Jessica L, Biro Joshua, Reddy Sahithi, Krevat Seth, Hettinger Aaron Zachary, Ratwani Raj M
National Center for Human Factors in Healthcare, MedStar Health Research Institute, Washington, District of Columbia, USA
Georgetown University Medical Center, Washington, District of Columbia, USA.
BMJ Qual Saf. 2025 Jan 28;34(2):130-132. doi: 10.1136/bmjqs-2024-017918.
Generative artificial intelligence (AI) technologies have the potential to revolutionise healthcare delivery but require classification and monitoring of patient safety risks. To address this need, we developed and evaluated a preliminary classification system for categorising generative AI patient safety errors. Our classification system is organised around two AI system stages (input and output) with specific error types by stage. We applied our classification system to two generative AI applications to assess its effectiveness in categorising safety issues: patient-facing conversational large language models (LLMs) and an ambient digital scribe (ADS) system for clinical documentation. In the LLM analysis, we identified 45 errors across 27 patient medical queries, with omission being the most common (42% of errors). Of the identified errors, 50% were categorised as low clinical significance, 25% as moderate clinical significance and 25% as high clinical significance. Similarly, in the ADS simulation, we identified 66 errors across 11 patient visits, with omission being the most common (83% of errors). Of the identified errors, 55% were categorised as low clinical significance and 45% were categorised as moderate clinical significance. These findings demonstrate the classification system's utility in categorising output errors from two different AI healthcare applications, providing a starting point for developing a robust process to better understand AI-enabled errors.
生成式人工智能(AI)技术有潜力彻底改变医疗服务的提供方式,但需要对患者安全风险进行分类和监测。为满足这一需求,我们开发并评估了一个用于对生成式人工智能患者安全错误进行分类的初步分类系统。我们的分类系统围绕人工智能系统的两个阶段(输入和输出)进行组织,并按阶段划分了特定的错误类型。我们将分类系统应用于两个生成式人工智能应用程序,以评估其在对安全问题进行分类方面的有效性:面向患者的对话式大语言模型(LLMs)和用于临床文档记录的环境数字抄写员(ADS)系统。在大语言模型分析中,我们在27个患者医疗查询中识别出了45个错误,其中遗漏最为常见(占错误的42%)。在已识别的错误中,50%被归类为临床意义较低,25%为中等临床意义,25%为高临床意义。同样,在环境数字抄写员模拟中,我们在11次患者就诊中识别出了66个错误,其中遗漏最为常见(占错误的83%)。在已识别的错误中,55%被归类为临床意义较低,45%被归类为中等临床意义。这些发现证明了该分类系统在对两种不同的人工智能医疗应用程序的输出错误进行分类方面的实用性,为开发一个强大的流程以更好地理解人工智能导致的错误提供了一个起点。