Department of Applied Artificial Intelligence, Hanyang University ERICA, Ansan, Republic of Korea.
Department of Emergency Medicine, College of Medicine, Hanyang University, Seoul, Republic of Korea.
Am J Emerg Med. 2024 Mar;77:29-38. doi: 10.1016/j.ajem.2023.11.063. Epub 2023 Dec 10.
OBJECTIVE: The manual recording of electronic health records (EHRs) by clinicians in the emergency department (ED) is time-consuming and challenging. In light of recent advancements in large language models (LLMs) such as GPT and BERT, this study aimed to design and validate LLMs for automatic clinical diagnoses. The models were designed to identify 12 medical symptoms and 2 patient histories from simulated clinician-patient conversations within 6 primary symptom scenarios in emergency triage rooms. MATERIALS AND METHOD: We developed classification models by fine-tuning BERT, a transformer-based pre-trained model. We subsequently analyzed these models using eXplainable artificial intelligence (XAI) and the Shapley additive explanation (SHAP) method. A Turing test was conducted to ascertain the reliability of the XAI results by comparing them to the outcomes of tasks performed and explained by medical workers. An emergency medicine specialist assessed the results of both XAI and the medical workers. RESULTS: We fine-tuned four pre-trained LLMs and compared their classification performance. The KLUE-RoBERTa-based model demonstrated the highest performance (F1-score: 0.965, AUROC: 0.893) on human-transcribed script data. The XAI results using SHAP showed an average Jaccard similarity of 0.722 when compared with explanations of medical workers for 15 samples. The Turing test results revealed a small 6% gap, with XAI and medical workers receiving the mean scores of 3.327 and 3.52, respectively. CONCLUSION: This paper highlights the potential of LLMs for automatic EHR recording in Korean EDs. The KLUE-RoBERTa-based model demonstrated superior classification performance. Furthermore, XAI using SHAP provided reliable explanations for model outputs. The reliability of these explanations was confirmed by a Turing test.
目的:临床医生在急诊科手动记录电子健康记录(EHR)既耗时又具有挑战性。鉴于 GPT 和 BERT 等大型语言模型(LLM)的最新进展,本研究旨在设计和验证用于自动临床诊断的 LLM。这些模型旨在从模拟的医患对话中识别 12 种医学症状和 2 种患者病史,涉及急诊分诊室的 6 种主要症状场景。
材料与方法:我们通过微调 BERT(一种基于转换器的预训练模型)开发了分类模型。然后,我们使用可解释人工智能(XAI)和 Shapley 可加性解释(SHAP)方法分析这些模型。通过将 XAI 结果与医疗工作者执行和解释的任务结果进行比较,进行图灵测试以确定 XAI 结果的可靠性。一位急诊医学专家评估了 XAI 和医疗工作者的结果。
结果:我们微调了四个预训练的 LLM,并比较了它们的分类性能。基于 KLUE-RoBERTa 的模型在人类转录脚本数据上表现出最高的性能(F1 分数:0.965,AUROC:0.893)。当比较 15 个样本的医疗工作者解释时,使用 SHAP 的 XAI 结果的平均杰卡德相似性为 0.722。图灵测试结果显示,XAI 和医疗工作者的平均得分分别为 3.327 和 3.52,存在 6%的微小差距。
结论:本文强调了 LLM 在韩国急诊科自动记录 EHR 的潜力。基于 KLUE-RoBERTa 的模型表现出优越的分类性能。此外,使用 SHAP 的 XAI 为模型输出提供了可靠的解释。通过图灵测试验证了这些解释的可靠性。
JMIR Med Inform. 2024-5-10
Asia Pac J Ophthalmol (Phila). 2024
J Am Med Inform Assoc. 2024-10-1
BMC Med Res Methodol. 2024-5-17
J Med Internet Res. 2025-6-9
BMC Med Inform Decis Mak. 2024-9-9