Greenland S, Finkle W D
Department of Epidemiology, UCLA School of Public Health, 90095-1772, USA.
Am J Epidemiol. 1995 Dec 15;142(12):1255-64. doi: 10.1093/oxfordjournals.aje.a117592.
Epidemiologic studies often encounter missing covariate values. While simple methods such as stratification on missing-data status, conditional-mean imputation, and complete-subject analysis are commonly employed for handling this problem, several studies have shown that these methods can be biased under reasonable circumstances. The authors review these results in the context of logistic regression and present simulation experiments showing the limitations of the methods. The method based on missing-data indicators can exhibit severe bias even when the data are missing completely at random, and regression (conditional-mean) imputation can be inordinately sensitive to model misspecification. Even complete-subject analysis can outperform these methods. More sophisticated methods, such as maximum likelihood, multiple imputation, and weighted estimating equations, have been given extensive attention in the statistics literature. While these methods are superior to simple methods, they are not commonly used in epidemiology, no doubt due to their complexity and the lack of packaged software to apply these methods. The authors contrast the results of multiple imputation to simple methods in the analysis of a case-control study of endometrial cancer, and they find a meaningful difference in results for age at menarche. In general, the authors recommend that epidemiologists avoid using the missing-indicator method and use more sophisticated methods whenever a large proportion of data are missing.
流行病学研究常常会遇到协变量值缺失的情况。虽然诸如根据缺失数据状态进行分层、条件均值插补和完全病例分析等简单方法通常用于处理这个问题,但一些研究表明,在合理的情况下这些方法可能存在偏差。作者在逻辑回归的背景下回顾了这些结果,并展示了模拟实验以说明这些方法的局限性。基于缺失数据指标的方法即使在数据完全随机缺失时也可能表现出严重偏差,而回归(条件均值)插补可能对模型设定错误过度敏感。甚至完全病例分析都可能比这些方法表现更好。更复杂的方法,如最大似然法、多重插补法和加权估计方程法,在统计学文献中受到了广泛关注。虽然这些方法优于简单方法,但它们在流行病学中并不常用,这无疑是由于其复杂性以及缺乏应用这些方法的打包软件。作者在一项子宫内膜癌病例对照研究的分析中将多重插补的结果与简单方法进行了对比,他们发现初潮年龄的结果存在显著差异。总体而言,作者建议流行病学家避免使用缺失指标法,并且在大量数据缺失时使用更复杂的方法。