Jt Comm J Qual Patient Saf. 2024 Dec;50(12):877-881. doi: 10.1016/j.jcjq.2024.08.001. Epub 2024 Aug 6.
Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports.
A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied.
The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label.
The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.
使用事件报告系统收集的数据具有挑战性,因为它是大量主要为定性信息。大型语言模型(LLM),如 ChatGPT,在文本总结和标记方面提供了新颖的功能,这可能支持安全数据趋势分析和早期识别预防患者伤害的机会。本研究评估了专用 LLM(GPT-3.5)自动标记横断面真实世界产科事件报告的能力。
从 2022 年 12 月至 2023 年 5 月期间提交给住院产科病房的 370 份事件报告中提取了一个样本。由临床审查员分配人工注释标签,并被视为金标准。提示 LLM 仅依靠其预先训练的知识和提示中包含的信息来标记事件报告。主要评估指标为敏感性、特异性、阳性预测值和阴性预测值。次要评估指标为模型对应用标签的解释的人类感知质量。
LLM 表现出标记事件报告的高敏感性和特异性的能力。模型总共应用了 79 个标签,而审查员应用了 49 个标签。模型的总体敏感性为 85.7%,特异性为 97.9%。阳性预测值和阴性预测值分别为 53.2%和 99.6%。对于 60.8%的标签,审查员认可模型应用标签的理由。
专用 LLM 表现出标记产科事件报告的高敏感性和特异性的能力。LLM 有可能使事件报告系统的数据更有效地利用。