Luo Jiawei, Huang Shixin, Lan Lan, Yang Shu, Cao Tingqian, Yin Jin, Qiu Jiajun, Yang Xiaoyan, Guo Yingqiang, Zhou Xiaobo
Department of Cardiovascular Surgery and West China Biomedical Big Data Center, West China Hospital/West China School of Medicine, Sichuan University, Chengdu, Sichuan, 610041, China; Med-X Center for Informatics, Sichuan University, Chengdu, 610041, China.
Department of Scientific Research, The People's Hospital of Yubei District of Chongqing, Chongqing, 401120, China; School of Communications and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China.
Comput Methods Programs Biomed. 2025 Feb;259:108521. doi: 10.1016/j.cmpb.2024.108521. Epub 2024 Nov 24.
Longitudinal data from Electronic Medical Records (EMRs) are increasingly utilized to construct predictive models for various clinical tasks, offering enhanced insights into patient health. However, significant discrepancies exist in preprocessing the irregular and intricate EMR data across studies due to the absence of universally accepted tools and standardization methods. This study introduces the Electronic Medical Record Longitudinal Irregular Data Preprocessing (EMR-LIP) framework, a lightweight approach for optimizing the preprocessing of longitudinal, irregular EMR data, aiming to enhance research efficiency, consistency, reproducibility, and comparability.
EMR-LIP modularizes the preprocessing of longitudinal irregular EMR data, offering tools with a low level of encapsulation. Compared to other pipelines, EMR-LIP categorizes variables in a more granular manner, designing specific preprocessing techniques for each type. To demonstrate its versatility, EMR-LIP was applied in an empirical study to two public EMR databases, MIMIC-IV and eICU-CRD. Data processed with EMR-LIP was then used to test several renowned deep learning models on a range of commonly used benchmark tasks.
In both the MIMIC-IV and eICU-CRD databases, models based on EMR-LIP showed superior baseline performance compared to previous studies. Interestingly, using data preprocessed by EMR-LIP, traditional models such as LSTM and GRU outperformed more complex models, achieving an AUROC of up to 0.94 for in-hospital death prediction. Additionally, models based on EMR-LIP showed stable performance across various resampling intervals and exhibited better fairness in performance across different ethnic groups.
EMR-LIP streamlines the preprocessing of irregular longitudinal EMR data, offering an end-to-end solution for model-ready data creation, and has been open-sourced for collaborative refinement by the research community.
电子病历(EMR)的纵向数据越来越多地用于构建各种临床任务的预测模型,从而能更深入地了解患者健康状况。然而,由于缺乏普遍认可的工具和标准化方法,不同研究在对不规则且复杂的EMR数据进行预处理时存在显著差异。本研究介绍了电子病历纵向不规则数据预处理(EMR-LIP)框架,这是一种轻量级方法,用于优化纵向不规则EMR数据的预处理,旨在提高研究效率、一致性、可重复性和可比性。
EMR-LIP将纵向不规则EMR数据的预处理模块化,提供低封装级别的工具。与其他流程相比,EMR-LIP对变量进行更细致的分类,为每种类型设计特定的预处理技术。为证明其通用性,EMR-LIP在一项实证研究中应用于两个公共EMR数据库,即MIMIC-IV和eICU-CRD。然后,使用EMR-LIP处理的数据在一系列常用基准任务上测试几个著名的深度学习模型。
在MIMIC-IV和eICU-CRD数据库中,基于EMR-LIP的模型均显示出比以往研究更好的基线性能。有趣的是,使用EMR-LIP预处理的数据,LSTM和GRU等传统模型的表现优于更复杂的模型,在院内死亡预测方面的曲线下面积(AUROC)高达0.94。此外,基于EMR-LIP的模型在不同重采样间隔下表现稳定,在不同种族群体中的性能公平性更好。
EMR-LIP简化了不规则纵向EMR数据的预处理,为创建适用于模型的数据提供了端到端解决方案,并且已开源供研究社区进行协作改进。