Choi Dong Hyun, Kim Yoonjic, Choi Sae Won, Kim Ki Hong, Choi Yeongho, Shin Sang Do
Department of Biomedical Engineering, Seoul National University College of Medicine, Seoul, Korea.
Laboratory of Emergency Medical Services, Seoul National University Hospital Biomedical Research Institute, Seoul, Korea.
J Korean Med Sci. 2024 Dec 2;39(46):e291. doi: 10.3346/jkms.2024.39.e291.
Injuries pose a significant global health challenge due to their high incidence and mortality rates. Although injury surveillance is essential for prevention, it is resource-intensive. This study aimed to develop and validate locally deployable large language models (LLMs) to extract core injury-related information from Emergency Department (ED) clinical notes.
We conducted a diagnostic study using retrospectively collected data from January 2014 to December 2020 from two urban academic tertiary hospitals. One served as the derivation cohort and the other as the external test cohort. Adult patients presenting to the ED with injury-related complaints were included. Primary outcomes included classification accuracies for information extraction tasks related to injury mechanism, place of occurrence, activity, intent, and severity. We fine-tuned a single generalizable Llama-2 model and five distinct Bidirectional Encoder Representations from Transformers (BERT) models for each task to extract information from initial ED physician notes. The Llama-2 model was able to perform different tasks by modifying the instruction prompt. Data recorded in injury registries provided the gold standard labels. Model performance was assessed using accuracy and macro-average F1 scores.
The derivation and external test cohorts comprised 36,346 and 32,232 patients, respectively. In the derivation cohort's test set, the Llama-2 model achieved accuracies (95% confidence intervals) of 0.899 (0.889-0.909) for injury mechanism, 0.774 (0.760-0.789) for place of occurrence, 0.679 (0.665-0.694) for activity, 0.972 (0.967-0.977) for intent, and 0.935 (0.926-0.943) for severity. The Llama-2 model outperformed the BERT models in accuracy and macro-average F1 scores across all tasks in both cohorts. Imposing constraints on the Llama-2 model to avoid uncertain predictions further improved its accuracy.
Locally deployable LLMs, trained to extract core injury-related information from free-text ED clinical notes, demonstrated good performance. Generative LLMs can serve as versatile solutions for various injury-related information extraction tasks.
由于伤害的高发病率和死亡率,其对全球健康构成了重大挑战。尽管伤害监测对于预防至关重要,但它需要大量资源。本研究旨在开发并验证可在本地部署的大语言模型(LLMs),以从急诊科(ED)临床记录中提取与伤害相关的核心信息。
我们进行了一项诊断性研究,使用了从2014年1月至2020年12月从两家城市学术型三级医院回顾性收集的数据。一家医院作为推导队列,另一家作为外部测试队列。纳入了因与伤害相关的主诉而到急诊科就诊的成年患者。主要结局包括与伤害机制、发生地点、活动、意图和严重程度相关的信息提取任务的分类准确率。我们针对每个任务对单个可通用的Llama-2模型和五个不同的基于变换器的双向编码器表征(BERT)模型进行了微调,以从急诊科医生的初始记录中提取信息。Llama-2模型能够通过修改指令提示来执行不同任务。伤害登记处记录的数据提供了金标准标签。使用准确率和宏平均F1分数评估模型性能。
推导队列和外部测试队列分别包括36346例和32232例患者。在推导队列的测试集中,Llama-2模型在伤害机制方面的准确率(95%置信区间)为0.899(0.889 - 0.909),发生地点为0.774(0.760 - 0.789),活动为0.679(0.665 - 0.694),意图为0.972(0.967 - 0.977),严重程度为0.935(0.926 - 0.943)。在两个队列的所有任务中,Llama-2模型在准确率和宏平均F1分数方面均优于BERT模型。对Llama-2模型施加约束以避免不确定的预测进一步提高了其准确率。
经过训练从自由文本的急诊科临床记录中提取与伤害相关的核心信息的可在本地部署的大语言模型表现良好。生成式大语言模型可为各种与伤害相关的信息提取任务提供通用解决方案。