van der Heijden Geert J M G, Donders A Rogier T, Stijnen Theo, Moons Karel G M
Julius Center for Health Sciences and Primary Care, University Medical Center, P.O. Box 80035, 3508 GA Utrecht, The Netherlands.
J Clin Epidemiol. 2006 Oct;59(10):1102-9. doi: 10.1016/j.jclinepi.2006.01.015. Epub 2006 Jul 11.
To illustrate the effects of different methods for handling missing data--complete case analysis, missing-indicator method, single imputation of unconditional and conditional mean, and multiple imputation (MI)--in the context of multivariable diagnostic research aiming to identify potential predictors (test results) that independently contribute to the prediction of disease presence or absence.
We used data from 398 subjects from a prospective study on the diagnosis of pulmonary embolism. Various diagnostic predictors or tests had (varying percentages of) missing values. Per method of handling these missing values, we fitted a diagnostic prediction model using multivariable logistic regression analysis.
The receiver operating characteristic curve area for all diagnostic models was above 0.75. The predictors in the final models based on the complete case analysis, and after using the missing-indicator method, were very different compared to the other models. The models based on MI did not differ much from the models derived after using single conditional and unconditional mean imputation.
In multivariable diagnostic research complete case analysis and the use of the missing-indicator method should be avoided, even when data are missing completely at random. MI methods are known to be superior to single imputation methods. For our example study, the single imputation methods performed equally well, but this was most likely because of the low overall number of missing values.
在多变量诊断研究中,旨在识别独立有助于预测疾病存在与否的潜在预测因素(检测结果),阐述处理缺失数据的不同方法——完整病例分析、缺失指标法、无条件和有条件均值的单一插补以及多重插补(MI)的效果。
我们使用了来自一项关于肺栓塞诊断的前瞻性研究中398名受试者的数据。各种诊断预测因素或检测存在(不同百分比的)缺失值。对于处理这些缺失值的每种方法,我们使用多变量逻辑回归分析拟合了一个诊断预测模型。
所有诊断模型的受试者工作特征曲线面积均高于0.75。基于完整病例分析以及使用缺失指标法后最终模型中的预测因素,与其他模型相比差异很大。基于MI的模型与使用单一条件和无条件均值插补后得出的模型差异不大。
在多变量诊断研究中,即使数据是完全随机缺失的,也应避免完整病例分析和使用缺失指标法。已知MI方法优于单一插补方法。对于我们的示例研究,单一插补方法表现同样良好,但这很可能是因为总体缺失值数量较少。