基于 MICE 使用随机森林和参数插补模型比较缺失数据插补：CALIBER 研究。

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

出版信息

Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3939843/

Abstract

Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.

摘要

多元链式方程插补（MICE）常用于填补流行病学研究中的缺失数据。“真实”的插补模型可能包含默认插补模型中未包含的非线性关系。随机森林插补是一种机器学习技术，它可以适应非线性和交互作用，并且不需要指定特定的回归模型。我们在两项模拟研究中比较了参数 MICE 和基于随机森林的 MICE 算法。第一项研究使用了来自 CALIBER 数据库（2001-2010 年使用链接定制研究和电子记录进行心血管疾病研究）的 10,128 例稳定型心绞痛患者中随机抽取的 2,000 人的 1,000 个随机样本，所有协变量均有完整数据。变量被人为地“随机缺失”，并比较了使用不同插补方法获得的参数估计的偏差和效率。两种 MICE 方法均产生了（对数）风险比的无偏估计，但随机森林的效率更高，置信区间更窄。第二项研究使用了部分观测变量与完全观测变量以非线性方式相关的模拟数据。使用随机森林 MICE 进行参数估计的偏差较小，置信区间的覆盖范围也更好。这表明随机森林插补可能对插补部分缺失数据的复杂流行病学数据集有用。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2a06/3939843/ae858bd2decf/kwt31201.jpg

相似文献

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.

A fair comparison of tree-based and parametric methods in multiple imputation by chained equations.

Stat Med. 2020 Apr 15;39(8):1156-1166. doi: 10.1002/sim.8468. Epub 2020 Jan 29.

Imputing missing covariates in time-to-event analysis within distributed research networks: A simulation study.

Pharmacoepidemiol Drug Saf. 2023 Mar;32(3):330-340. doi: 10.1002/pds.5563. Epub 2022 Nov 30.

Performance of Multiple Imputation Using Modern Machine Learning Methods in Electronic Health Records Data.

Epidemiology. 2023 Mar 1;34(2):206-215. doi: 10.1097/EDE.0000000000001578. Epub 2022 Dec 9.

SuperMICE: An Ensemble Machine Learning Approach to Multiple Imputation by Chained Equations.

Am J Epidemiol. 2022 Feb 19;191(3):516-525. doi: 10.1093/aje/kwab271.

Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study.

BMC Med Res Methodol. 2010 Jan 19;10:7. doi: 10.1186/1471-2288-10-7.

Imputation strategies when a continuous outcome is to be dichotomized for responder analysis: a simulation study.

BMC Med Res Methodol. 2019 Jul 23;19(1):161. doi: 10.1186/s12874-019-0793-x.

Multiple imputation for handling missing outcome data when estimating the relative risk.

BMC Med Res Methodol. 2017 Sep 6;17(1):134. doi: 10.1186/s12874-017-0414-5.

Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction.

BMC Med Res Methodol. 2020 Jul 25;20(1):199. doi: 10.1186/s12874-020-01080-1.

Logistic regression vs. predictive mean matching for imputing binary covariates.

Stat Methods Med Res. 2023 Nov;32(11):2172-2183. doi: 10.1177/09622802231198795. Epub 2023 Sep 26.

引用本文的文献

Physical health and cognitive ability factors in predicting retirement adjustment based on machine learning approach: results from the China Health and Retirement Longitudinal Study.

Front Psychol. 2025 Aug 20;16:1601723. doi: 10.3389/fpsyg.2025.1601723. eCollection 2025.

A report of the Iranian Parkinson's disease registry.

NPJ Parkinsons Dis. 2025 Aug 21;11(1):251. doi: 10.1038/s41531-025-01108-7.

Machine learning identification of key genes in cardioembolic stroke and atherosclerosis: their association with pan-cancer and immune cells.

Eur J Med Res. 2025 Jul 24;30(1):665. doi: 10.1186/s40001-025-02940-6.

Multi-omics Integrative Analysis for Incomplete Data Using Weighted -Value Adjustment Approaches.

J Agric Biol Environ Stat. 2025;30(3):601-617. doi: 10.1007/s13253-024-00603-3. Epub 2024 Feb 28.

A novel approach for classifying patients with adrenal tumors based on decision support systems and artificial intelligence.

Hormones (Athens). 2025 Jun 30. doi: 10.1007/s42000-025-00682-y.

Explainable machine learning algorithms to identify predictors of intention to use family planning among women of reproductive-age in Ethiopia: Evidence from the Performance Monitoring and Accountability (PMA) 2021 survey data set.

BMJ Public Health. 2025 Apr 17;3(1):e000962. doi: 10.1136/bmjph-2024-000962. eCollection 2025.

Risk factors for development of diabetic foot ulcer disease in two large contemporary UK cohorts.

Diabetes Obes Metab. 2025 Sep;27(9):4782-4792. doi: 10.1111/dom.16519. Epub 2025 Jun 24.

Cognitive impairment in Chagas disease patients in Brazil, 2007-2021: A cross-sectional study.

PLoS Negl Trop Dis. 2025 May 29;19(5):e0012981. doi: 10.1371/journal.pntd.0012981. eCollection 2025 May.

Comparison of imputation methods for univariate categorical longitudinal data.

Qual Quant. 2025;59(2):1767-1791. doi: 10.1007/s11135-024-02028-z. Epub 2024 Dec 26.

Polygenic transcriptome risk scores enhance predictive accuracy in atopic dermatitis.

J Transl Med. 2025 May 23;23(1):575. doi: 10.1186/s12967-025-06570-8.

本文引用的文献

Multiple imputation of covariates by fully conditional specification: Accommodating the substantive model.

Stat Methods Med Res. 2015 Aug;24(4):462-87. doi: 10.1177/0962280214521348. Epub 2014 Feb 12.

Data resource profile: cardiovascular disease research using linked bespoke studies and electronic health records (CALIBER).

Int J Epidemiol. 2012 Dec;41(6):1625-38. doi: 10.1093/ije/dys188. Epub 2012 Dec 5.

Auxiliary variables in multiple imputation in regression with missing X: a warning against including too many in small sample research.

BMC Med Res Methodol. 2012 Dec 5;12:184. doi: 10.1186/1471-2288-12-184.

Multiple imputation of missing covariates with non-linear effects and interactions: an evaluation of statistical methods.

BMC Med Res Methodol. 2012 Apr 10;12:46. doi: 10.1186/1471-2288-12-46.

Brief review of regression-based and machine learning methods in genetic epidemiology: the Genetic Analysis Workshop 17 experience.

Genet Epidemiol. 2011;35 Suppl 1(Suppl 1):S5-11. doi: 10.1002/gepi.20642.

Potential responders to FOLFOX therapy for colorectal cancer by Random Forests analysis.

Br J Cancer. 2012 Jan 3;106(1):126-32. doi: 10.1038/bjc.2011.505. Epub 2011 Nov 17.

MissForest--non-parametric missing value imputation for mixed-type data.

Bioinformatics. 2012 Jan 1;28(1):112-8. doi: 10.1093/bioinformatics/btr597. Epub 2011 Oct 28.

Imputation of missing values of tumour stage in population-based cancer registration.

BMC Med Res Methodol. 2011 Sep 19;11:129. doi: 10.1186/1471-2288-11-129.

Neutrophils and clinical outcomes in patients with acute coronary syndromes and/or cardiac revascularisation. A systematic review on more than 34,000 subjects.

Thromb Haemost. 2011 Oct;106(4):591-9. doi: 10.1160/TH11-02-0096. Epub 2011 Aug 25.

Low lymphocyte count and cardiovascular diseases.

Curr Med Chem. 2011;18(21):3226-33. doi: 10.2174/092986711796391633.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

基于 MICE 使用随机森林和参数插补模型比较缺失数据插补：CALIBER 研究。

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

出版信息

Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.

DOI:10.1093/aje/kwt312

PMID:24589914

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3939843/

Abstract

摘要

基于 MICE 使用随机森林和参数插补模型比较缺失数据插补：CALIBER 研究。

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

基于 MICE 使用随机森林和参数插补模型比较缺失数据插补：CALIBER 研究。

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

出版信息

相似文献

引用本文的文献

本文引用的文献