Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Rockville, Maryland.
Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina.
Am J Epidemiol. 2018 Mar 1;187(3):568-575. doi: 10.1093/aje/kwx348.
Principled methods with which to appropriately analyze missing data have long existed; however, broad implementation of these methods remains challenging. In this and 2 companion papers (Am J Epidemiol. 2018;187(3):576-584 and Am J Epidemiol. 2018;187(3):585-591), we discuss issues pertaining to missing data in the epidemiologic literature. We provide details regarding missing-data mechanisms and nomenclature and encourage the conduct of principled analyses through a detailed comparison of multiple imputation and inverse probability weighting. Data from the Collaborative Perinatal Project, a multisite US study conducted from 1959 to 1974, are used to create a masked data-analytical challenge with missing data induced by known mechanisms. We illustrate the deleterious effects of missing data with naive methods and show how principled methods can sometimes mitigate such effects. For example, when data were missing at random, naive methods showed a spurious protective effect of smoking on the risk of spontaneous abortion (odds ratio (OR) = 0.43, 95% confidence interval (CI): 0.19, 0.93), while implementation of principled methods multiple imputation (OR = 1.30, 95% CI: 0.95, 1.77) or augmented inverse probability weighting (OR = 1.40, 95% CI: 1.00, 1.97) provided estimates closer to the "true" full-data effect (OR = 1.31, 95% CI: 1.05, 1.64). We call for greater acknowledgement of and attention to missing data and for the broad use of principled missing-data methods in epidemiologic research.
长期以来,已有适当分析缺失数据的原则性方法;然而,广泛实施这些方法仍然具有挑战性。在这篇论文和另外两篇论文(《美国流行病学杂志》2018 年第 187 卷第 3 期第 576-584 页和第 585-591 页)中,我们讨论了流行病学文献中与缺失数据相关的问题。我们详细介绍了缺失数据机制和命名,并通过对多种插补和逆概率加权的详细比较,鼓励进行有原则的分析。使用来自于 1959 年至 1974 年在美国多个地点进行的多中心合作围产儿研究的数据,创建了一个具有缺失数据的模拟数据分析挑战,这些缺失数据是由已知机制引起的。我们用简单的方法说明了缺失数据的有害影响,并展示了有原则的方法如何有时可以减轻这种影响。例如,当数据随机缺失时,简单的方法显示吸烟对自然流产风险具有虚假的保护作用(比值比(OR)=0.43,95%置信区间(CI):0.19,0.93),而实施有原则的方法多重插补(OR=1.30,95%CI:0.95,1.77)或增强逆概率加权(OR=1.40,95%CI:1.00,1.97)提供的估计值更接近“真实”全数据效应(OR=1.31,95%CI:1.05,1.64)。我们呼吁在流行病学研究中更加认识和关注缺失数据,并广泛使用有原则的缺失数据方法。