Vafaei Sadr Alireza, Li Jiang, Hwang Wenke, Yeasin Mohammed, Wang Ming, Lehmann Harold, Zand Ramin, Abedi Vida
Department of Public Health Sciences, College of Medicine, Pennsylvania State University, Hershey, PA, USA.
Département de Physique Théorique and Center for Astroparticle Physics, University of Geneva, Geneva, Switzerland.
Sci Rep. 2025 May 17;15(1):17176. doi: 10.1038/s41598-025-02276-5.
Missing data in electronic health records (EHRs) poses a significant challenge for analysis. This study introduces Pympute, a comprehensive Python package designed for efficient and robust missing value imputation for EHRs. Pympute's core algorithm, Flexible, intelligently selects the optimal imputation method for each variable based on its characteristics. Pympute offers a comprehensive suite of functionalities. It benchmarks the performance of ten existing machine learning imputation algorithms against Flexible on real-world EHR datasets containing laboratory measurements. Additionally, Pympute facilitates data simulation, generating realistic datasets mimicking real-world data distributions for controlled evaluation of imputation performance. Finally, Pympute investigates how missingness and skewness, influence the selection of optimal imputation algorithms within the Flexible framework. Our findings validate that Pympute's Flexible method significantly improves imputation performance compared to the single model approach. Notably, simulating data solely based on covariance does not accurately reflect real-world selection behavior. Furthermore, skewness in the data distribution prompts Flexible to favor nonlinear imputation models. This study highlights the importance of considering data distribution patterns when selecting imputation algorithms. Pympute addresses this challenge by offering a versatile and user-friendly solution for diverse EHR data scenarios.
电子健康记录(EHR)中的缺失数据给分析带来了重大挑战。本研究介绍了Pympute,这是一个全面的Python包,旨在为EHR中的缺失值插补提供高效且稳健的方法。Pympute的核心算法Flexible,会根据每个变量的特征智能地为其选择最优的插补方法。Pympute提供了一套全面的功能。它在包含实验室测量值的真实世界EHR数据集上,将十种现有的机器学习插补算法的性能与Flexible算法进行基准测试。此外,Pympute便于进行数据模拟,生成模拟真实世界数据分布的逼真数据集,用于对插补性能进行可控评估。最后,Pympute研究了缺失性和偏度如何在Flexible框架内影响最优插补算法的选择。我们的研究结果证实,与单一模型方法相比,Pympute的Flexible方法显著提高了插补性能。值得注意的是,仅基于协方差模拟数据并不能准确反映真实世界的选择行为。此外,数据分布中的偏度促使Flexible倾向于选择非线性插补模型。本研究强调了在选择插补算法时考虑数据分布模式的重要性。Pympute通过为各种EHR数据场景提供通用且用户友好的解决方案来应对这一挑战。