Am J Epidemiol. 2014 Mar 15;179(6):764-74. doi: 10.1093/aje/kwt312. Epub 2014 Jan 12.
Multivariate imputation by chained equations (MICE) is commonly used for imputing missing data in epidemiologic research. The "true" imputation model may contain nonlinearities which are not included in default imputation models. Random forest imputation is a machine learning technique which can accommodate nonlinearities and interactions and does not require a particular regression model to be specified. We compared parametric MICE with a random forest-based MICE algorithm in 2 simulation studies. The first study used 1,000 random samples of 2,000 persons drawn from the 10,128 stable angina patients in the CALIBER database (Cardiovascular Disease Research using Linked Bespoke Studies and Electronic Records; 2001-2010) with complete data on all covariates. Variables were artificially made "missing at random," and the bias and efficiency of parameter estimates obtained using different imputation methods were compared. Both MICE methods produced unbiased estimates of (log) hazard ratios, but random forest was more efficient and produced narrower confidence intervals. The second study used simulated data in which the partially observed variable depended on the fully observed variables in a nonlinear way. Parameter estimates were less biased using random forest MICE, and confidence interval coverage was better. This suggests that random forest imputation may be useful for imputing complex epidemiologic data sets in which some patients have missing data.
多元链式方程插补(MICE)常用于填补流行病学研究中的缺失数据。“真实”的插补模型可能包含默认插补模型中未包含的非线性关系。随机森林插补是一种机器学习技术,它可以适应非线性和交互作用,并且不需要指定特定的回归模型。我们在两项模拟研究中比较了参数 MICE 和基于随机森林的 MICE 算法。第一项研究使用了来自 CALIBER 数据库(2001-2010 年使用链接定制研究和电子记录进行心血管疾病研究)的 10,128 例稳定型心绞痛患者中随机抽取的 2,000 人的 1,000 个随机样本,所有协变量均有完整数据。变量被人为地“随机缺失”,并比较了使用不同插补方法获得的参数估计的偏差和效率。两种 MICE 方法均产生了(对数)风险比的无偏估计,但随机森林的效率更高,置信区间更窄。第二项研究使用了部分观测变量与完全观测变量以非线性方式相关的模拟数据。使用随机森林 MICE 进行参数估计的偏差较小,置信区间的覆盖范围也更好。这表明随机森林插补可能对插补部分缺失数据的复杂流行病学数据集有用。