Moons Karel G M, Donders Rogier A R T, Stijnen Theo, Harrell Frank E
Julius Center for Health Sciences and General Practice, University Medical Center, Utrecht, P.O. Box 80035, 3508 GA Utrecht, The Netherlands.
J Clin Epidemiol. 2006 Oct;59(10):1092-101. doi: 10.1016/j.jclinepi.2006.01.009. Epub 2006 Jun 19.
Epidemiologic studies commonly estimate associations between predictors (risk factors) and outcome. Most software automatically exclude subjects with missing values. This commonly causes bias because missing values seldom occur completely at random (MCAR) but rather selectively based on other (observed) variables, missing at random (MAR). Multiple imputation (MI) of missing predictor values using all observed information including outcome is advocated to deal with selective missing values. This seems a self-fulfilling prophecy.
We tested this hypothesis using data from a study on diagnosis of pulmonary embolism. We selected five predictors of pulmonary embolism without missing values. Their regression coefficients and standard errors (SEs) estimated from the original sample were considered as "true" values. We assigned missing values to these predictors--both MCAR and MAR--and repeated this 1,000 times using simulations. Per simulation we multiple imputed the missing values without and with the outcome, and compared the regression coefficients and SEs to the truth.
Regression coefficients based on MI including outcome were close to the truth. MI without outcome yielded very biased--underestimated--coefficients. SEs and coverage of the 90% confidence intervals were not different between MI with and without outcome. Results were the same for MCAR and MAR.
For all types of missing values, imputation of missing predictor values using the outcome is preferred over imputation without outcome and is no self-fulfilling prophecy.
流行病学研究通常估计预测因素(风险因素)与结果之间的关联。大多数软件会自动排除存在缺失值的受试者。这通常会导致偏差,因为缺失值很少完全随机出现(完全随机缺失,MCAR),而是基于其他(观察到的)变量选择性地出现,即随机缺失(MAR)。提倡使用包括结果在内的所有观察信息对缺失的预测值进行多重填补(MI),以处理选择性缺失值。这似乎是一个自我实现的预言。
我们使用一项关于肺栓塞诊断研究的数据来检验这一假设。我们选择了五个无缺失值的肺栓塞预测因素。从原始样本估计的它们的回归系数和标准误(SEs)被视为“真实”值。我们给这些预测因素赋予缺失值——包括完全随机缺失和随机缺失——并使用模拟重复此过程1000次。每次模拟中,我们在不包括结果和包括结果的情况下对缺失值进行多重填补,并将回归系数和标准误与真实值进行比较。
基于包括结果的多重填补得到的回归系数接近真实值。不包括结果的多重填补产生的系数偏差很大——被低估。包括结果和不包括结果的多重填补在标准误和90%置信区间覆盖范围方面没有差异。完全随机缺失和随机缺失的结果相同。
对于所有类型的缺失值,使用结果对缺失的预测值进行填补优于不使用结果的填补,且不是自我实现的预言。