Donders A Rogier T, van der Heijden Geert J M G, Stijnen Theo, Moons Karel G M
Center for Biostatistics, Utrecht University, Utrecht, The Netherlands.
J Clin Epidemiol. 2006 Oct;59(10):1087-91. doi: 10.1016/j.jclinepi.2006.01.014. Epub 2006 Jul 11.
In most situations, simple techniques for handling missing data (such as complete case analysis, overall mean imputation, and the missing-indicator method) produce biased results, whereas imputation techniques yield valid results without complicating the analysis once the imputations are carried out. Imputation techniques are based on the idea that any subject in a study sample can be replaced by a new randomly chosen subject from the same source population. Imputation of missing data on a variable is replacing that missing by a value that is drawn from an estimate of the distribution of this variable. In single imputation, only one estimate is used. In multiple imputation, various estimates are used, reflecting the uncertainty in the estimation of this distribution. Under the general conditions of so-called missing at random and missing completely at random, both single and multiple imputations result in unbiased estimates of study associations. But single imputation results in too small estimated standard errors, whereas multiple imputation results in correctly estimated standard errors and confidence intervals. In this article we explain why all this is the case, and use a simple simulation study to demonstrate our explanations. We also explain and illustrate why two frequently used methods to handle missing data, i.e., overall mean imputation and the missing-indicator method, almost always result in biased estimates.
在大多数情况下,处理缺失数据的简单技术(如完整病例分析、总体均值插补和缺失指示法)会产生有偏差的结果,而插补技术一旦实施,就能得出有效的结果且不会使分析复杂化。插补技术基于这样一种理念:研究样本中的任何个体都可以被从同一源总体中随机选取的新个体所替代。对变量的缺失数据进行插补,就是用从该变量分布估计中抽取的值来替代缺失值。在单一插补中,仅使用一个估计值。在多重插补中,则使用各种估计值,反映出该分布估计中的不确定性。在所谓随机缺失和完全随机缺失的一般条件下,单一插补和多重插补都能得出无偏的研究关联估计值。但单一插补会导致估计的标准误过小,而多重插补能得出正确估计的标准误和置信区间。在本文中,我们解释了为何如此,并通过一个简单的模拟研究来阐述我们的解释。我们还解释并举例说明了为何两种常用的处理缺失数据的方法,即总体均值插补和缺失指示法,几乎总是会导致有偏差的估计值。