Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands.
J Clin Epidemiol. 2010 Jul;63(7):721-7. doi: 10.1016/j.jclinepi.2009.12.008. Epub 2010 Mar 24.
We compared popular methods to handle missing data with multiple imputation (a more sophisticated method that preserves data).
We used data of 804 patients with a suspicion of deep venous thrombosis (DVT). We studied three covariates to predict the presence of DVT: d-dimer level, difference in calf circumference, and history of leg trauma. We introduced missing values (missing at random) ranging from 10% to 90%. The risk of DVT was modeled with logistic regression for the three methods, that is, complete case analysis, exclusion of d-dimer level from the model, and multiple imputation.
Multiple imputation showed less bias in the regression coefficients of the three variables and more accurate coverage of the corresponding 90% confidence intervals than complete case analysis and dropping d-dimer level from the analysis. Multiple imputation showed unbiased estimates of the area under the receiver operating characteristic curve (0.88) compared with complete case analysis (0.77) and when the variable with missing values was dropped (0.65).
As this study shows that simple methods to deal with missing data can lead to seriously misleading results, we advise to consider multiple imputation. The purpose of multiple imputation is not to create data, but to prevent the exclusion of observed data.
我们比较了缺失数据的常用处理方法与多重插补(一种更复杂的保留数据的方法)。
我们使用了 804 例疑似深静脉血栓(DVT)患者的数据。我们研究了三个预测 DVT 存在的协变量:D-二聚体水平、小腿周径差异和腿部创伤史。我们引入了从 10%到 90%不等的缺失值(随机缺失)。对于三种方法,即完整病例分析、从模型中排除 D-二聚体水平和多重插补,我们使用逻辑回归对 DVT 风险进行建模。
与完整病例分析和从分析中排除 D-二聚体水平相比,多重插补显示出三个变量的回归系数的偏差更小,相应的 90%置信区间的覆盖更准确。多重插补显示出与完整病例分析(0.77)相比,接受者操作特征曲线(ROC)下面积(0.88)的无偏估计值,并且当有缺失值的变量被排除时(0.65)。
正如本研究所示,处理缺失数据的简单方法可能导致严重误导的结果,因此我们建议考虑多重插补。多重插补的目的不是创建数据,而是防止排除观测数据。