Heymans Martijn W, van Buuren Stef, Knol Dirk L, van Mechelen Willem, de Vet Henrica C W
Vrije Universiteit, Institute for Health Sciences, Department of Methodology and Applied Biostatistics, Amsterdam, The Netherlands.
BMC Med Res Methodol. 2007 Jul 13;7:33. doi: 10.1186/1471-2288-7-33.
Missing data is a challenging problem in many prognostic studies. Multiple imputation (MI) accounts for imputation uncertainty that allows for adequate statistical testing. We developed and tested a methodology combining MI with bootstrapping techniques for studying prognostic variable selection.
In our prospective cohort study we merged data from three different randomized controlled trials (RCTs) to assess prognostic variables for chronicity of low back pain. Among the outcome and prognostic variables data were missing in the range of 0 and 48.1%. We used four methods to investigate the influence of respectively sampling and imputation variation: MI only, bootstrap only, and two methods that combine MI and bootstrapping. Variables were selected based on the inclusion frequency of each prognostic variable, i.e. the proportion of times that the variable appeared in the model. The discriminative and calibrative abilities of prognostic models developed by the four methods were assessed at different inclusion levels.
We found that the effect of imputation variation on the inclusion frequency was larger than the effect of sampling variation. When MI and bootstrapping were combined at the range of 0% (full model) to 90% of variable selection, bootstrap corrected c-index values of 0.70 to 0.71 and slope values of 0.64 to 0.86 were found.
We recommend to account for both imputation and sampling variation in sets of missing data. The new procedure of combining MI with bootstrapping for variable selection, results in multivariable prognostic models with good performance and is therefore attractive to apply on data sets with missing values.
在许多预后研究中,缺失数据是一个具有挑战性的问题。多重填补(MI)考虑了填补的不确定性,从而能够进行充分的统计检验。我们开发并测试了一种将多重填补与自抽样技术相结合的方法,用于研究预后变量的选择。
在我们的前瞻性队列研究中,我们合并了来自三项不同随机对照试验(RCT)的数据,以评估下腰痛慢性化的预后变量。在结局变量和预后变量中,数据缺失率在0%至48.1%之间。我们使用四种方法分别研究抽样和填补变异的影响:仅多重填补、仅自抽样,以及两种将多重填补和自抽样相结合的方法。根据每个预后变量的纳入频率(即该变量出现在模型中的比例)来选择变量。在不同的纳入水平下,评估了由这四种方法开发的预后模型的判别能力和校准能力。
我们发现,填补变异对纳入频率的影响大于抽样变异的影响。当在0%(全模型)至90%的变量选择范围内将多重填补和自抽样相结合时,发现自抽样校正的c指数值在0.70至0.71之间,斜率值在0.64至0.86之间。
我们建议在缺失数据集时同时考虑填补变异和抽样变异。将多重填补与自抽样相结合进行变量选择的新程序,可产生具有良好性能的多变量预后模型,因此对于应用于存在缺失值的数据集具有吸引力。