Department of Statistical Science, Duke University, Durham, North Carolina 27708, USA.
Am J Epidemiol. 2010 Nov 1;172(9):1070-6. doi: 10.1093/aje/kwq260. Epub 2010 Sep 14.
Multiple imputation is particularly well suited to deal with missing data in large epidemiologic studies, because typically these studies support a wide range of analyses by many data users. Some of these analyses may involve complex modeling, including interactions and nonlinear relations. Identifying such relations and encoding them in imputation models, for example, in the conditional regressions for multiple imputation via chained equations, can be daunting tasks with large numbers of categorical and continuous variables. The authors present a nonparametric approach for implementing multiple imputation via chained equations by using sequential regression trees as the conditional models. This has the potential to capture complex relations with minimal tuning by the data imputer. Using simulations, the authors demonstrate that the method can result in more plausible imputations, and hence more reliable inferences, in complex settings than the naive application of standard sequential regression imputation techniques. They apply the approach to impute missing values in data on adverse birth outcomes with more than 100 clinical and survey variables. They evaluate the imputations using posterior predictive checks with several epidemiologic analyses of interest.
多重插补特别适合处理大型流行病学研究中的缺失数据,因为这些研究通常支持许多数据使用者进行广泛的分析。其中一些分析可能涉及复杂的建模,包括交互作用和非线性关系。在通过链式方程进行多重插补的条件回归中,确定这些关系并将其编码到插补模型中,对于具有大量分类和连续变量的情况来说,可能是一项艰巨的任务。作者提出了一种非参数方法,通过使用序贯回归树作为条件模型来实现通过链式方程进行的多重插补。通过数据插补器进行最小的调整,这种方法具有捕捉复杂关系的潜力。通过模拟,作者证明在复杂环境中,该方法可以产生更合理的插补值,从而更可靠地进行推断,而不是简单地应用标准的序贯回归插补技术。他们将该方法应用于 100 多个临床和调查变量的不良出生结局数据中缺失值的插补。他们使用感兴趣的几种流行病学分析进行后验预测检查来评估插补值。