电子健康记录中结构化缺失数据的特征描述与管理：数据分析

Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.

作者信息

Beaulieu-Jones Brett K, Lavage Daniel R, Snyder John W, Moore Jason H, Pendergrass Sarah A, Bauer Christopher R

机构信息

Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States.

Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, United States.

出版信息

JMIR Med Inform. 2018 Feb 23;6(1):e11. doi: 10.2196/medinform.8960.

DOI:10.2196/medinform.8960

PMID:29475824

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5845101/

Abstract

BACKGROUND

Missing data is a challenge for all studies; however, this is especially true for electronic health record (EHR)-based analyses. Failure to appropriately consider missing data can lead to biased results. While there has been extensive theoretical work on imputation, and many sophisticated methods are now available, it remains quite challenging for researchers to implement these methods appropriately. Here, we provide detailed procedures for when and how to conduct imputation of EHR laboratory results.

OBJECTIVE

The objective of this study was to demonstrate how the mechanism of missingness can be assessed, evaluate the performance of a variety of imputation methods, and describe some of the most frequent problems that can be encountered.

METHODS

We analyzed clinical laboratory measures from 602,366 patients in the EHR of Geisinger Health System in Pennsylvania, USA. Using these data, we constructed a representative set of complete cases and assessed the performance of 12 different imputation methods for missing data that was simulated based on 4 mechanisms of missingness (missing completely at random, missing not at random, missing at random, and real data modelling).

RESULTS

Our results showed that several methods, including variations of Multivariate Imputation by Chained Equations (MICE) and softImpute, consistently imputed missing values with low error; however, only a subset of the MICE methods was suitable for multiple imputation.

CONCLUSIONS

The analyses we describe provide an outline of considerations for dealing with missing EHR data, steps that researchers can perform to characterize missingness within their own data, and an evaluation of methods that can be applied to impute clinical data. While the performance of methods may vary between datasets, the process we describe can be generalized to the majority of structured data types that exist in EHRs, and all of our methods and code are publicly available.

摘要

背景

缺失数据对所有研究而言都是一项挑战；然而，对于基于电子健康记录（EHR）的分析来说尤其如此。未能恰当地考虑缺失数据可能导致有偏差的结果。虽然在插补方面已有大量的理论工作，并且现在有许多复杂的方法可用，但研究人员要恰当地实施这些方法仍然颇具挑战性。在此，我们提供关于何时以及如何对EHR实验室结果进行插补的详细程序。

目的

本研究的目的是演示如何评估缺失机制，评估各种插补方法的性能，并描述可能遇到的一些最常见问题。

方法

我们分析了美国宾夕法尼亚州盖辛格医疗系统EHR中602366名患者的临床实验室指标。利用这些数据，我们构建了一组具有代表性的完整病例，并评估了基于4种缺失机制（完全随机缺失、非随机缺失、随机缺失和真实数据建模）模拟的12种不同缺失数据插补方法的性能。

结果

我们的结果表明，包括链式方程多元插补（MICE）变体和softImpute在内的几种方法，在插补缺失值时误差始终较低；然而，只有一部分MICE方法适用于多重插补。

结论

我们所描述的分析提供了处理缺失EHR数据的考虑要点概述、研究人员可以采取的用于刻画自身数据中缺失情况的步骤，以及对可用于插补临床数据的方法的评估。虽然不同数据集上方法的性能可能有所不同，但我们所描述的过程可推广到EHR中存在的大多数结构化数据类型，并且我们所有的方法和代码都是公开可用的。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1998/5845101/8d75eb3c1c48/medinform_v6i1e11_fig1.jpg

相似文献

Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.电子健康记录中结构化缺失数据的特征描述与管理：数据分析

JMIR Med Inform. 2018 Feb 23;6(1):e11. doi: 10.2196/medinform.8960.

Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data.基于现代机器学习方法在电子健康记录数据中的应用表现。

Epidemiology. 2023 Mar 1;34(2):206-215. doi: 10.1097/EDE.0000000000001578. Epub 2022 Dec 9.

Imputation of missing values for electronic health record laboratory data.电子健康记录实验室数据缺失值的插补

NPJ Digit Med. 2021 Oct 11;4(1):147. doi: 10.1038/s41746-021-00518-0.

Dealing with missing delirium assessments in prospective clinical studies of the critically ill: a simulation study and reanalysis of two delirium studies.处理危重症患者前瞻性临床研究中缺失的谵妄评估：一项模拟研究和两项谵妄研究的重新分析。

BMC Med Res Methodol. 2021 May 6;21(1):97. doi: 10.1186/s12874-021-01274-1.

Integration of genetic and clinical information to improve imputation of data missing from electronic health records.整合遗传和临床信息，以改善电子健康记录中缺失数据的推断。

J Am Med Inform Assoc. 2019 Oct 1;26(10):1056-1063. doi: 10.1093/jamia/ocz041.

Extremely missing numerical data in Electronic Health Records for machine learning can be managed through simple imputation methods considering informative missingness: A comparative of solutions in a COVID-19 mortality case study.在电子健康记录中，针对机器学习的极度缺失数值数据可以通过考虑信息性缺失的简单插补方法来处理：一项关于COVID-19死亡率案例研究中各种解决方案的比较

Comput Methods Programs Biomed. 2023 Dec;242:107803. doi: 10.1016/j.cmpb.2023.107803. Epub 2023 Sep 7.

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study.预后模型的性能取决于缺失值插补算法的选择：一项模拟研究。

J Clin Epidemiol. 2024 Dec;176:111539. doi: 10.1016/j.jclinepi.2024.111539. Epub 2024 Sep 24.

Analyzing missingness patterns in real-world data using the SMDI toolkit: application to a linked EHR-claims pharmacoepidemiology study.使用 SMDI 工具包分析真实世界数据中的缺失模式：在一个链接的电子病历-索赔药物流行病学研究中的应用。

BMC Med Res Methodol. 2024 Oct 19;24(1):246. doi: 10.1186/s12874-024-02330-2.

A real data-driven simulation strategy to select an imputation method for mixed-type trait data.一种基于真实数据驱动的选择混合类型性状数据插补方法的模拟策略。

PLoS Comput Biol. 2023 Mar 22;19(3):e1010154. doi: 10.1371/journal.pcbi.1010154. eCollection 2023 Mar.

Prediction Model Performance With Different Imputation Strategies: A Simulation Study Using a North American ICU Registry.不同插补策略下预测模型性能：使用北美 ICU 登记处的模拟研究。

Pediatr Crit Care Med. 2022 Jan 1;23(1):e29-e44. doi: 10.1097/PCC.0000000000002835.

引用本文的文献

Benchmarking Missing Data Imputation Methods for Time Series Using Real-World Test Cases.使用实际测试案例对时间序列的缺失数据插补方法进行基准测试。

Proc Mach Learn Res. 2025 Jun;287:480-501.

Assessing Public Health Capacity for Infectious Disease Modeling: A Qualitative Study of State and Local Agencies.评估传染病建模的公共卫生能力：对州和地方机构的定性研究

Int J Environ Res Public Health. 2025 Aug 20;22(8):1301. doi: 10.3390/ijerph22081301.

Epidemiological Insights into Colorectal Cancer Survival in Kazakhstan (2014-2023): A Retrospective Analysis Using the National Electronic Registry of Oncological Patients.哈萨克斯坦结直肠癌生存情况的流行病学洞察（2014 - 2023年）：一项使用国家肿瘤患者电子登记系统的回顾性分析

Cancers (Basel). 2025 Jul 14;17(14):2336. doi: 10.3390/cancers17142336.

Implicit bias in ICU electronic health record data: measurement frequencies and missing data rates of clinical variables.重症监护病房电子健康记录数据中的隐性偏差：临床变量的测量频率和缺失数据率

BMC Med Inform Decis Mak. 2025 Jul 1;25(1):241. doi: 10.1186/s12911-025-03058-9.

Conceptual framework as a guide to choose appropriate imputation method for missing values in a clinical structured dataset.概念框架作为选择临床结构化数据集中缺失值的适当插补方法的指南。

BMC Med Res Methodol. 2025 Feb 20;25(1):43. doi: 10.1186/s12874-025-02496-3.

Decentralized Clinical Trials in the Era of Real-World Evidence: A Statistical Perspective.真实世界证据时代的去中心化临床试验：统计学视角

Clin Transl Sci. 2025 Feb;18(2):e70117. doi: 10.1111/cts.70117.

A generative model for evaluating missing data methods in large epidemiological cohorts.一种用于评估大型流行病学队列中缺失数据方法的生成模型。

BMC Med Res Methodol. 2025 Feb 8;25(1):34. doi: 10.1186/s12874-025-02487-4.

Application of machine learning techniques for warfarin dosage prediction: a case study on the MIMIC-III dataset.机器学习技术在华法林剂量预测中的应用：以MIMIC-III数据集为例的研究

PeerJ Comput Sci. 2025 Jan 2;11:e2612. doi: 10.7717/peerj-cs.2612. eCollection 2025.

Moving Beyond Medical Statistics: A Systematic Review on Missing Data Handling in Electronic Health Records.超越医学统计学：电子健康记录中缺失数据处理的系统评价

Health Data Sci. 2024 Dec 4;4:0176. doi: 10.34133/hds.0176. eCollection 2024.

A novel MissForest-based missing values imputation approach with recursive feature elimination in medical applications.一种基于 MissForest 的新的缺失值插补方法，在医学应用中采用递归特征消除。

BMC Med Res Methodol. 2024 Nov 8;24(1):269. doi: 10.1186/s12874-024-02392-2.

本文引用的文献

Reproducibility of computational workflows is automated using continuous analysis.计算工作流程的可重复性通过持续分析实现自动化。

Nat Biotechnol. 2017 Apr;35(4):342-346. doi: 10.1038/nbt.3780. Epub 2017 Mar 13.

MISSING DATA IMPUTATION IN THE ELECTRONIC HEALTH RECORD USING DEEPLY LEARNED AUTOENCODERS.使用深度自编码器对电子健康记录中的缺失数据进行插补

Pac Symp Biocomput. 2017;22:207-218. doi: 10.1142/9789813207813_0021.

Nearest neighbor imputation algorithms: a critical evaluation.最近邻插补算法：批判性评估

BMC Med Inform Decis Mak. 2016 Jul 25;16 Suppl 3(Suppl 3):74. doi: 10.1186/s12911-016-0318-z.

Approach to addressing missing data for electronic medical records and pharmacy claims data research.电子病历和药房报销数据研究中缺失数据的处理方法。

Pharmacotherapy. 2015 Apr;35(4):380-7. doi: 10.1002/phar.1569.

Strategies for handling missing data in electronic health record derived data.电子健康记录衍生数据中缺失数据的处理策略。

EGEMS (Wash DC). 2013 Dec 17;1(3):1035. doi: 10.13063/2327-9214.1035. eCollection 2013.

What is the difference between missing completely at random and missing at random?完全随机缺失和随机缺失之间的区别是什么？

Int J Epidemiol. 2014 Aug;43(4):1336-9. doi: 10.1093/ije/dyu080. Epub 2014 Apr 4.

Disease genetics: phenome-wide association studies go large.疾病遗传学：全表型组关联研究规模扩大。

Nat Rev Genet. 2014 Jan;15(1):2. doi: 10.1038/nrg3637. Epub 2013 Dec 10.

Practical considerations for sensitivity analysis after multiple imputation applied to epidemiological studies with incomplete data.适用于不完全数据的流行病学研究中多次插补后进行敏感性分析的实用考虑因素。

BMC Med Res Methodol. 2012 Jun 8;12:73. doi: 10.1186/1471-2288-12-73.

Multiple imputation using chained equations: Issues and guidance for practice.使用链式方程进行多重插补：实践中的问题和指导。

Stat Med. 2011 Feb 20;30(4):377-99. doi: 10.1002/sim.4067. Epub 2010 Nov 30.

Multiple imputation with large data sets: a case study of the Children's Mental Health Initiative.大数据集的多重填补：儿童心理健康倡议的案例研究

Am J Epidemiol. 2009 May 1;169(9):1133-9. doi: 10.1093/aje/kwp026. Epub 2009 Mar 24.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

电子健康记录中结构化缺失数据的特征描述与管理：数据分析

Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis.

作者信息

机构信息

出版信息

BACKGROUND

OBJECTIVE

METHODS

RESULTS

CONCLUSIONS

背景

目的

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献