Suppr超能文献

基于现代机器学习方法在电子健康记录数据中的应用表现。

Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data.

机构信息

From the Department of Biostatistics and Epidemiology, School of Public Health, Rutgers University, Piscataway, NJ.

Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA.

出版信息

Epidemiology. 2023 Mar 1;34(2):206-215. doi: 10.1097/EDE.0000000000001578. Epub 2022 Dec 9.

Abstract

BACKGROUND

Missing data are common in studies using electronic health records (EHRs)-derived data. Missingness in EHR data is related to healthcare utilization patterns, resulting in complex and potentially missing not at random missingness mechanisms. Prior research has suggested that machine learning-based multiple imputation methods may outperform traditional methods and may perform well even in settings of missing not at random missingness.

METHODS

We used plasmode simulations based on a nationwide EHR-derived de-identified database for patients with metastatic urothelial carcinoma to compare the performance of multiple imputation using chained equations, random forests, and denoising autoencoders in terms of bias and precision of hazard ratio estimates under varying proportions of observations with missing values and missingness mechanisms (missing completely at random, missing at random, and missing not at random).

RESULTS

Multiple imputation by chained equations and random forest methods had low bias and similar standard errors for parameter estimates under missingness completely at random. Under missingness at random, denoising autoencoders had higher bias than multiple imputation by chained equations and random forests. Contrary to results of prior studies of denoising autoencoders, all methods exhibited substantial bias under missingness not at random, with bias increasing in direct proportion to the amount of missing data.

CONCLUSIONS

We found no advantage of denoising autoencoders for multiple imputation in the setting of an epidemiologic study conducted using EHR data. Results suggested that denoising autoencoders may overfit the data leading to poor confounder control. Use of more flexible imputation approaches does not mitigate bias induced by missingness not at random and can produce estimates with spurious precision.

摘要

背景

在使用电子健康记录(EHR)衍生数据的研究中,数据缺失很常见。EHR 数据的缺失与医疗保健利用模式有关,导致复杂且潜在的非随机缺失机制。先前的研究表明,基于机器学习的多重插补方法可能优于传统方法,即使在非随机缺失的情况下也可能表现良好。

方法

我们使用基于全国性 EHR 衍生的去识别数据库的 plasmode 模拟,比较了链式方程、随机森林和去噪自动编码器在不同缺失值比例和缺失机制(完全随机缺失、随机缺失和非随机缺失)下对危险比估计值的偏倚和精度的表现。

结果

在完全随机缺失的情况下,链式方程和随机森林方法的多重插补具有较低的偏倚和相似的参数估计标准误差。在随机缺失的情况下,去噪自动编码器的偏倚高于链式方程和随机森林的多重插补。与先前关于去噪自动编码器的研究结果相反,所有方法在非随机缺失下都表现出显著的偏倚,偏倚随着缺失数据量的增加而直接增加。

结论

我们发现,在使用 EHR 数据进行流行病学研究的情况下,去噪自动编码器在多重插补中没有优势。结果表明,去噪自动编码器可能过度拟合数据,导致混杂因素控制不佳。使用更灵活的插补方法并不能减轻非随机缺失引起的偏倚,并且可能产生具有虚假精度的估计值。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验