Zhou Xinyu, Dhingra Lovedeep Singh, Aminorroaya Arya, Adejumo Philip, Khera Rohan
Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
Yale School of Medicine, New Haven, CT, USA.
AMIA Annu Symp Proc. 2025 May 22;2024:1332-1339. eCollection 2024.
Mapping electronic health records (EHR) data to common data models (CDMs) enables the standardization of clinical records, enhancing interoperability and enabling large-scale, multi-centered clinical investigations. Using 2 large publicly available datasets, we developed transformer-based natural language processing models to map medication-related concepts from the EHR at a large and diverse healthcare system to standard concepts in OMOP CDM. We validated the model outputs against standard concepts manually mapped by clinicians. Our best model reached out-of-box accuracies of 96.5% in mapping the 200 most common drugs and 83.0% in mapping 200 random drugs in the EHR. For these tasks, this model outperformed a state-of-the-art large language model (SFR-Embedding-Mistral, 89.5% and 66.5% in accuracy for the two tasks), a widely used software for schema mapping (Usagi, 90.0% and 70.0% in accuracy), and direct string match (7.5% and 7.5% accuracy). Transformer-based deep learning models outperform existing approaches in the standardized mapping of EHR elements and can facilitate an end-to-end automated EHR transformation pipeline.
将电子健康记录(EHR)数据映射到通用数据模型(CDM)能够实现临床记录的标准化,增强互操作性,并支持大规模、多中心的临床研究。利用两个大型公开可用数据集,我们开发了基于Transformer的自然语言处理模型,以将来自大型多样化医疗系统中EHR的药物相关概念映射到OMOP CDM中的标准概念。我们对照临床医生手动映射的标准概念对模型输出进行了验证。我们的最佳模型在映射EHR中200种最常见药物时的开箱即用准确率达到96.5%,在映射200种随机药物时的准确率达到83.0%。对于这些任务,该模型优于一个先进的大语言模型(SFR-Embedding-Mistral,两项任务的准确率分别为89.5%和66.5%)、一个广泛使用的模式映射软件(Usagi,准确率分别为90.0%和70.0%)以及直接字符串匹配(准确率为7.5%和7.5%)。基于Transformer的深度学习模型在EHR元素的标准化映射方面优于现有方法,并且可以促进端到端的自动化EHR转换流程。