存在多重填补数据时预测的模型选择方法比较

A comparison of model selection methods for prediction in the presence of multiply imputed data.

作者信息

Thao Le Thi Phuong, Geskus Ronald

机构信息

Biostatistics group, Oxford University Clinical Research Unit, Ho Chi Minh City, Vietnam.

Nuffield Department of Medicine, University of Oxford, Oxford, UK.

出版信息

Biom J. 2019 Mar;61(2):343-356. doi: 10.1002/bimj.201700232. Epub 2018 Oct 23.

Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets.

在构建预后模型时，已经提出了许多用于对多重填补数据进行变量选择的方法。然而，没有一种方法能始终如一地成为最佳方法。我们进行了一项模拟研究，采用二元结局和逻辑回归模型，以比较在存在多重填补数据的情况下两类变量选择方法：（I）对自抽样数据进行模型选择，使用基于AIC或套索的向后消除法，并根据在所有多重填补和自抽样数据集中最常（例如）选择的变量拟合最终模型；（II）对原始多重填补数据进行模型选择，使用套索法。最终模型通过以下方式获得：（i）对在任何多重填补数据集中选择的变量估计值求平均值，或（ii）在50%的多重填补数据中选择的变量估计值求平均值；（iii）对堆叠的多重填补数据执行套索法，以及（iv）与（iii）相同，但使用由缺失率确定的个体权重。在所有套索模型中，我们同时使用了最优惩罚和1标准误规则。我们考虑通过重新拟合线性预测器或所有个体变量来重新校准模型，以纠正由于次优惩罚导致的过度收缩。我们将这些方法应用于一个包含951例成年结核性脑膜炎患者的真实数据集，以预测九个月内的死亡率。总体而言，在方法I和方法II中，应用带有1标准误惩罚的套索选择法都表现出最佳性能。堆叠多重填补数据是一种有吸引力的方法，因为在合并来自单独多重填补数据集的结果时，它不需要选择选择阈值。

A comparison of model selection methods for prediction in the presence of multiply imputed data.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献