医学研究中缺失的协变量数据：填补优于忽略。

Missing covariate data in medical research: to impute is better than to ignore.

机构信息

Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands.

出版信息

J Clin Epidemiol. 2010 Jul;63(7):721-7. doi: 10.1016/j.jclinepi.2009.12.008. Epub 2010 Mar 24.

DOI:10.1016/j.jclinepi.2009.12.008

PMID:20338724

Abstract

OBJECTIVE

We compared popular methods to handle missing data with multiple imputation (a more sophisticated method that preserves data).

STUDY DESIGN AND SETTING

We used data of 804 patients with a suspicion of deep venous thrombosis (DVT). We studied three covariates to predict the presence of DVT: d-dimer level, difference in calf circumference, and history of leg trauma. We introduced missing values (missing at random) ranging from 10% to 90%. The risk of DVT was modeled with logistic regression for the three methods, that is, complete case analysis, exclusion of d-dimer level from the model, and multiple imputation.

RESULTS

Multiple imputation showed less bias in the regression coefficients of the three variables and more accurate coverage of the corresponding 90% confidence intervals than complete case analysis and dropping d-dimer level from the analysis. Multiple imputation showed unbiased estimates of the area under the receiver operating characteristic curve (0.88) compared with complete case analysis (0.77) and when the variable with missing values was dropped (0.65).

CONCLUSION

As this study shows that simple methods to deal with missing data can lead to seriously misleading results, we advise to consider multiple imputation. The purpose of multiple imputation is not to create data, but to prevent the exclusion of observed data.

摘要

目的

我们比较了缺失数据的常用处理方法与多重插补（一种更复杂的保留数据的方法）。

研究设计和设置

我们使用了 804 例疑似深静脉血栓（DVT）患者的数据。我们研究了三个预测 DVT 存在的协变量：D-二聚体水平、小腿周径差异和腿部创伤史。我们引入了从 10%到 90%不等的缺失值（随机缺失）。对于三种方法，即完整病例分析、从模型中排除 D-二聚体水平和多重插补，我们使用逻辑回归对 DVT 风险进行建模。