Asgari Elham, Montaña-Brown Nina, Dubois Magda, Khalil Saleh, Balloch Jasmine, Yeung Joshua Au, Pimenta Dominic
Tortus AI, London, UK.
Guy's and St Thomas NHS Trust, London, UK.
NPJ Digit Med. 2025 May 13;8(1):274. doi: 10.1038/s41746-025-01670-7.
Integrating large language models (LLMs) into healthcare can enhance workflow efficiency and patient care by automating tasks such as summarising consultations. However, the fidelity between LLM outputs and ground truth information is vital to prevent miscommunication that could lead to compromise in patient safety. We propose a framework comprising (1) an error taxonomy for classifying LLM outputs, (2) an experimental structure for iterative comparisons in our LLM document generation pipeline, (3) a clinical safety framework to evaluate the harms of errors, and (4) a graphical user interface, CREOLA, to facilitate these processes. Our clinical error metrics were derived from 18 experimental configurations involving LLMs for clinical note generation, consisting of 12,999 clinician-annotated sentences. We observed a 1.47% hallucination rate and a 3.45% omission rate. By refining prompts and workflows, we successfully reduced major errors below previously reported human note-taking rates, highlighting the framework's potential for safer clinical documentation.
将大语言模型(LLMs)整合到医疗保健中,可以通过自动执行诸如总结会诊等任务来提高工作流程效率和患者护理水平。然而,大语言模型输出与真实信息之间的保真度对于防止可能导致患者安全受损的沟通失误至关重要。我们提出了一个框架,包括(1)用于对大语言模型输出进行分类的错误分类法,(2)在我们的大语言模型文档生成管道中进行迭代比较的实验结构,(3)用于评估错误危害的临床安全框架,以及(4)一个图形用户界面CREOLA,以促进这些过程。我们的临床错误指标来自18种涉及用于生成临床记录的大语言模型的实验配置,包括12999条由临床医生注释的句子。我们观察到幻觉率为1.47%,遗漏率为3.45%。通过优化提示和工作流程,我们成功地将主要错误降低到低于先前报告的人工记录率,突出了该框架在更安全的临床文档记录方面的潜力。