He Yulei, Shimizu Iris, Schappert Susan, Xu Jianmin, Beresovsky Vladislav, Khan Diba, Valverde Roberto, Schenker Nathaniel
National Center for Health Statistics, Centers for Disease Control and Prevention, Hyattsville, MD, 20782, U.S.A.
J Off Stat. 2016;32(1):147-164. doi: 10.1515/jos-2016-0007. Epub 2016 Mar 10.
Multiple imputation is a popular approach to handling missing data. Although it was originally motivated by survey nonresponse problems, it has been readily applied to other data settings. However, its general behavior still remains unclear when applied to survey data with complex sample designs, including clustering. Recently, Lewis et al. (2014) compared single- and multiple-imputation analyses for certain incomplete variables in the 2008 National Ambulatory Medicare Care Survey, which has a nationally representative, multistage, and clustered sampling design. Their study results suggested that the increase of the variance estimate due to multiple imputation compared with single imputation largely disappears for estimates with large design effects. We complement their empirical research by providing some theoretical reasoning. We consider data sampled from an equally weighted, single-stage cluster design and characterize the process using a balanced, one-way normal random-effects model. Assuming that the missingness is completely at random, we derive analytic expressions for the within- and between-multiple-imputation variance estimators for the mean estimator, and thus conveniently reveal the impact of design effects on these variance estimators. We propose approximations for the fraction of missing information in clustered samples, extending previous results for simple random samples. We discuss some generalizations of this research and its practical implications for data release by statistical agencies.
多重填补是处理缺失数据的一种常用方法。尽管它最初是由调查无回答问题引发的,但已被广泛应用于其他数据设置。然而,当应用于具有复杂样本设计(包括聚类)的调查数据时,其一般行为仍不明确。最近,刘易斯等人(2014年)在2008年全国门诊医疗保险护理调查中,对某些不完整变量的单重填补分析和多重填补分析进行了比较,该调查采用了具有全国代表性的多阶段聚类抽样设计。他们的研究结果表明,与单重填补相比,多重填补导致的方差估计增加在设计效应较大的估计中基本消失。我们通过提供一些理论推理来补充他们的实证研究。我们考虑从等权重单阶段聚类设计中抽样的数据,并使用平衡的单向正态随机效应模型来描述该过程。假设缺失是完全随机的,我们推导出均值估计量的多重填补内方差估计量和多重填补间方差估计量的解析表达式,从而方便地揭示设计效应对这些方差估计量的影响。我们提出了聚类样本中缺失信息比例的近似值,扩展了简单随机样本的先前结果。我们讨论了这项研究的一些推广及其对统计机构数据发布的实际意义。