Department of Epidemiology and Data Science, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam Public Health Research Institute, Amsterdam, The Netherlands.
Physical Therapy Practice Panken, Roermond, The Netherlands.
BMC Med Res Methodol. 2022 Aug 4;22(1):214. doi: 10.1186/s12874-022-01693-8.
For the development of prognostic models, after multiple imputation, variable selection is advised to be applied from the pooled model. The aim of this study is to evaluate by using a simulation study and practical data example the performance of four different pooling methods for variable selection in multiple imputed datasets. These methods are the D1, D2, D3 and recently extended Median-P-Rule (MPR) for categorical, dichotomous, and continuous variables in logistic regression models.
Four datasets (n = 200 and n = 500), with 9 variables and correlations of respectively 0.2 and 0.6 between these variables, were simulated. These datasets included 2 categorical and 2 continuous variables with 20% missing at random data. Multiple Imputation (m = 5) was applied, and the four methods were compared with selection from the full model (without missing data). The same analyzes were repeated in five multiply imputed real-world datasets (NHANES) (m = 5, p = 0.05, N = 250/300/400/500/1000).
In the simulated datasets, the differences between the pooling methods were most evident in the smaller datasets. The MPR performed equal to all other pooling methods for the selection frequency, as well as for the P-values of the continuous and dichotomous variables, however the MPR performed consistently better for pooling and selecting categorical variables in multiply imputed datasets and also regarding the stability of the selected prognostic models. Analyzes in the NHANES-dataset showed that all methods mostly selected the same models. Compared to each other however, the D2-method seemed to be the least sensitive and the MPR the most sensitive, most simple, and easy method to apply.
Considering that MPR is the most simple and easy pooling method to use for epidemiologists and applied researchers, we carefully recommend using the MPR-method to pool categorical variables with more than two levels after Multiple Imputation in combination with Backward Selection-procedures (BWS). Because MPR never performed worse than the other methods in continuous and dichotomous variables we also advice to use MPR in these types of variables.
为了开发预后模型,建议在合并模型中应用变量选择。本研究的目的是通过模拟研究和实际数据示例评估四种不同的合并方法在多个插补数据集中进行变量选择的性能。这些方法是 D1、D2、D3 和最近扩展的中位数-P-规则(MPR),用于逻辑回归模型中的分类、二分类和连续变量。
模拟了四个数据集(n=200 和 n=500),其中包含 9 个变量,变量之间的相关性分别为 0.2 和 0.6。这些数据集包括 2 个分类变量和 2 个连续变量,随机缺失率为 20%。应用了多重插补(m=5),并将这四种方法与完整模型(无缺失数据)的选择进行了比较。在五个多份插补的真实世界数据集(NHANES)(m=5,p=0.05,N=250/300/400/500/1000)中重复了相同的分析。
在模拟数据集中,较小的数据集之间的方法差异最为明显。MPR 在选择频率以及连续和二分类变量的 P 值方面与所有其他合并方法一样,但在分类变量的多重插补数据集中以及在选择预后模型的稳定性方面,MPR 表现始终更好。NHANES 数据集的分析表明,所有方法大多选择了相同的模型。然而,彼此相比,D2 方法似乎最不敏感,而 MPR 则是最敏感、最简单、最容易应用的方法。
考虑到 MPR 是最简单易用的合并方法,我们建议在多变量插补后使用 MPR 方法对具有两个以上水平的分类变量进行合并,并结合后向选择程序(BWS)。因为 MPR 在连续和二分类变量中从未表现出比其他方法更差,所以我们也建议在这些类型的变量中使用 MPR。