Kim Junmo, Kim Joo Seong, Lee Ji-Hyang, Kim Min-Gyu, Kim Taehyun, Cho Chaeeun, Park Rae Woong, Kim Kwangsoo
Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, Republic of Korea.
Division of Gastroenterology, Department of Internal Medicine, Dongguk University Ilsan Hospital, Dongguk University College of Medicine, Goyang, Republic of Korea.
Commun Med (Lond). 2025 Jun 13;5(1):232. doi: 10.1038/s43856-025-00914-7.
Pretraining electronic health record (EHR) data using language models has enhanced performance across various medical tasks. Despite the potential of EHR pretraining models, predicting adverse drug events (ADEs) using EHR pretraining models has not been explored.
We used observational medical outcomes partnership common data model (CDM)-based EHR data from Seoul National University Hospital (SNUH) between January 2001 and December 2023 and Ajou University Medical Center (AUMC) between January 2004 and December 2023. In total 510,879 and 419,505 adult inpatients from SNUH and AUMC are included in internal and external datasets. For pretraining, the model was trained to infer randomly masked tokens using preceding and following history. In this process, we introduced domain embedding (DE) to provide information about the domain of masked tokens, preventing the model from finding codes from irrelevant domains. For qualitative analysis, we identified important features using the attention matrix from each finetuned model.
Here we show that EHR pretraining models with DE outperform the models without pretraining and DE in predicting various ADEs, with the average area under the receiver operating characteristic curve (AUROC) of 0.958 and 0.964 in internal and external validations, respectively. For feature importance analysis, we demonstrate that the results are consistent with priorly reported background clinical knowledge. In addition to cohort-level interpretation, patient-level interpretation is also available.
The CDM-based EHR pretraining model with DE can improve prediction performance for various ADEs and can provide proper explanation at cohort and patient level. Our model has the potential to serve as a foundation model due to its strong prediction performance, interpretability, and compatibility.
使用语言模型对电子健康记录(EHR)数据进行预训练已提高了各种医疗任务的性能。尽管EHR预训练模型具有潜力,但尚未探索使用EHR预训练模型预测药物不良事件(ADEs)。
我们使用了基于观察性医疗结果合作组织通用数据模型(CDM)的EHR数据,这些数据来自2001年1月至2023年12月的首尔国立大学医院(SNUH)以及2004年1月至2023年12月的亚洲大学医学中心(AUMC)。SNUH和AUMC的内部和外部数据集中分别纳入了510,879名和419,505名成年住院患者。对于预训练,该模型被训练使用前后历史来推断随机掩码令牌。在此过程中,我们引入了领域嵌入(DE)以提供关于掩码令牌领域的信息,防止模型从无关领域中找到代码。为了进行定性分析,我们使用每个微调模型的注意力矩阵来识别重要特征。
我们在此表明,具有DE的EHR预训练模型在预测各种ADEs方面优于未进行预训练和没有DE的模型,内部验证和外部验证中受试者操作特征曲线下面积(AUROC)的平均值分别为0.958和0.964。对于特征重要性分析,我们证明结果与先前报道的背景临床知识一致。除了队列水平的解释外,还可以进行患者水平的解释。
基于CDM的具有DE的EHR预训练模型可以提高对各种ADEs的预测性能,并可以在队列和患者水平上提供适当的解释。由于其强大的预测性能、可解释性和兼容性,我们的模型有潜力作为基础模型。