一种简单的池化方法在多重插补数据集的变量选择中表现优于复杂方法。

A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods.

机构信息

Department of Epidemiology and Data Science, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam Public Health Research Institute, Amsterdam, The Netherlands.

Physical Therapy Practice Panken, Roermond, The Netherlands.

出版信息

BMC Med Res Methodol. 2022 Aug 4;22(1):214. doi: 10.1186/s12874-022-01693-8.

DOI:10.1186/s12874-022-01693-8

PMID:35927610

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9351113/

Abstract

BACKGROUND

For the development of prognostic models, after multiple imputation, variable selection is advised to be applied from the pooled model. The aim of this study is to evaluate by using a simulation study and practical data example the performance of four different pooling methods for variable selection in multiple imputed datasets. These methods are the D1, D2, D3 and recently extended Median-P-Rule (MPR) for categorical, dichotomous, and continuous variables in logistic regression models.

METHODS

Four datasets (n = 200 and n = 500), with 9 variables and correlations of respectively 0.2 and 0.6 between these variables, were simulated. These datasets included 2 categorical and 2 continuous variables with 20% missing at random data. Multiple Imputation (m = 5) was applied, and the four methods were compared with selection from the full model (without missing data). The same analyzes were repeated in five multiply imputed real-world datasets (NHANES) (m = 5, p = 0.05, N = 250/300/400/500/1000).

RESULTS

In the simulated datasets, the differences between the pooling methods were most evident in the smaller datasets. The MPR performed equal to all other pooling methods for the selection frequency, as well as for the P-values of the continuous and dichotomous variables, however the MPR performed consistently better for pooling and selecting categorical variables in multiply imputed datasets and also regarding the stability of the selected prognostic models. Analyzes in the NHANES-dataset showed that all methods mostly selected the same models. Compared to each other however, the D2-method seemed to be the least sensitive and the MPR the most sensitive, most simple, and easy method to apply.

CONCLUSIONS

Considering that MPR is the most simple and easy pooling method to use for epidemiologists and applied researchers, we carefully recommend using the MPR-method to pool categorical variables with more than two levels after Multiple Imputation in combination with Backward Selection-procedures (BWS). Because MPR never performed worse than the other methods in continuous and dichotomous variables we also advice to use MPR in these types of variables.

摘要

背景

为了开发预后模型，建议在合并模型中应用变量选择。本研究的目的是通过模拟研究和实际数据示例评估四种不同的合并方法在多个插补数据集中进行变量选择的性能。这些方法是 D1、D2、D3 和最近扩展的中位数-P-规则（MPR），用于逻辑回归模型中的分类、二分类和连续变量。

方法

模拟了四个数据集（n=200 和 n=500），其中包含 9 个变量，变量之间的相关性分别为 0.2 和 0.6。这些数据集包括 2 个分类变量和 2 个连续变量，随机缺失率为 20%。应用了多重插补（m=5），并将这四种方法与完整模型（无缺失数据）的选择进行了比较。在五个多份插补的真实世界数据集（NHANES）（m=5，p=0.05，N=250/300/400/500/1000）中重复了相同的分析。

结果

在模拟数据集中，较小的数据集之间的方法差异最为明显。MPR 在选择频率以及连续和二分类变量的 P 值方面与所有其他合并方法一样，但在分类变量的多重插补数据集中以及在选择预后模型的稳定性方面，MPR 表现始终更好。NHANES 数据集的分析表明，所有方法大多选择了相同的模型。然而，彼此相比，D2 方法似乎最不敏感，而 MPR 则是最敏感、最简单、最容易应用的方法。

结论

考虑到 MPR 是最简单易用的合并方法，我们建议在多变量插补后使用 MPR 方法对具有两个以上水平的分类变量进行合并，并结合后向选择程序（BWS）。因为 MPR 在连续和二分类变量中从未表现出比其他方法更差，所以我们也建议在这些类型的变量中使用 MPR。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1e69/9351113/3f65d5e8dafe/12874_2022_1693_Fig1_HTML.jpg

相似文献

A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods.

BMC Med Res Methodol. 2022 Aug 4;22(1):214. doi: 10.1186/s12874-022-01693-8.

Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis.

BMC Med Res Methodol. 2017 Aug 22;17(1):129. doi: 10.1186/s12874-017-0404-7.

Inference following multiple imputation for generalized additive models: an investigation of the median p-value rule with applications to the Pulmonary Hypertension Association Registry and Colorado COVID-19 hospitalization data.

BMC Med Res Methodol. 2022 May 21;22(1):148. doi: 10.1186/s12874-022-01613-w.

Variable selection for multiply-imputed data with application to dioxin exposure study.

Stat Med. 2013 Sep 20;32(21):3646-59. doi: 10.1002/sim.5783. Epub 2013 Mar 25.

Imputation strategies when a continuous outcome is to be dichotomized for responder analysis: a simulation study.

BMC Med Res Methodol. 2019 Jul 23;19(1):161. doi: 10.1186/s12874-019-0793-x.

Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods.

J Comput Graph Stat. 2022;31(4):1063-1075. doi: 10.1080/10618600.2022.2035739. Epub 2022 Mar 28.

Multiple imputation methods for handling missing values in a longitudinal categorical variable with restrictions on transitions over time: a simulation study.

BMC Med Res Methodol. 2019 Jan 10;19(1):14. doi: 10.1186/s12874-018-0653-0.

Effect of Variable Selection Strategy on the Performance of Prognostic Models When Using Multiple Imputation.

Circ Cardiovasc Qual Outcomes. 2019 Nov;12(11):e005927. doi: 10.1161/CIRCOUTCOMES.119.005927. Epub 2019 Nov 13.

A real data-driven simulation strategy to select an imputation method for mixed-type trait data.

PLoS Comput Biol. 2023 Mar 22;19(3):e1010154. doi: 10.1371/journal.pcbi.1010154. eCollection 2023 Mar.

Pooling test statistics across multiply imputed datasets for nonnormal items.

Behav Res Methods. 2024 Mar;56(3):1229-1243. doi: 10.3758/s13428-023-02088-3. Epub 2023 Mar 27.

引用本文的文献

Disparities in Mental Health Symptoms Among Sexual and Gender Diverse Subgroups in a National Sample of College Students.

Psychol Sex Orientat Gend Divers. 2024 Mar 14. doi: 10.1037/sgd0000714.

Evaluating the median -value method for assessing the statistical significance of tests when using multiple imputation.

J Appl Stat. 2024 Oct 25;52(6):1161-1176. doi: 10.1080/02664763.2024.2418473. eCollection 2025.

ASA score is an independent predictor of 1-year outcome after moderate-to-severe traumatic brain injury.

Scand J Trauma Resusc Emerg Med. 2025 Feb 6;33(1):25. doi: 10.1186/s13049-025-01338-x.

Investigation of the causal relationship between patient portal utilization and patient's self-care self-efficacy and satisfaction in care among patients with cancer.

BMC Med Inform Decis Mak. 2025 Jan 8;25(1):12. doi: 10.1186/s12911-024-02837-0.

Prediction of the chance of successful immune tolerance induction in persons with severe hemophilia A and inhibitors: a clinical prediction model.

Res Pract Thromb Haemost. 2024 Oct 3;8(7):102580. doi: 10.1016/j.rpth.2024.102580. eCollection 2024 Oct.

Exploring the Interactions Between Psychotic Symptoms, Cognition, and Environmental Risk Factors: A Bayesian Analysis of Networks.

Schizophr Bull. 2025 Jul 7;51(4):1134-1145. doi: 10.1093/schbul/sbae174.

Predictors of moderate-to-severe side-effects following COVID-19 mRNA booster vaccination: a prospective cohort study among primary health care providers in Belgium.

BMC Infect Dis. 2024 Oct 10;24(1):1135. doi: 10.1186/s12879-024-09969-8.

Subscapular skinfold thickness, not other anthropometric and dual-energy X-ray absorptiometry-measured adiposity, is positively associated with the presence of age-related macular degeneration: a cross-sectional study from National Health and Nutrition Examination Survey 2005-2006.

BMJ Open Ophthalmol. 2024 Jul 31;9(1):e001505. doi: 10.1136/bmjophth-2023-001505.

A novel prediction score determining individual clinical outcome 3 months after juvenile stroke (PREDICT-score).

J Neurol. 2024 Sep;271(9):6238-6246. doi: 10.1007/s00415-024-12552-5. Epub 2024 Jul 31.

Does pain intensity after total knee arthroplasty depend on somatosensory functioning in knee osteoarthritis patients? A prospective cohort study.

Clin Rheumatol. 2024 Jun;43(6):2047-2059. doi: 10.1007/s10067-024-06976-7. Epub 2024 Apr 26.

本文引用的文献

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

Stat Med. 2021 Jan 30;40(2):369-381. doi: 10.1002/sim.8779. Epub 2020 Oct 21.

State of the art in selection of variables and functional forms in multivariable analysis-outstanding issues.

Diagn Progn Res. 2020 Apr 2;4:3. doi: 10.1186/s41512-020-00074-3. eCollection 2020.

Effect of Variable Selection Strategy on the Performance of Prognostic Models When Using Multiple Imputation.

Circ Cardiovasc Qual Outcomes. 2019 Nov;12(11):e005927. doi: 10.1161/CIRCOUTCOMES.119.005927. Epub 2019 Nov 13.

A comparison of model selection methods for prediction in the presence of multiply imputed data.

Biom J. 2019 Mar;61(2):343-356. doi: 10.1002/bimj.201700232. Epub 2018 Oct 23.

Variable selection - A review and recommendations for the practicing statistician.

Biom J. 2018 May;60(3):431-449. doi: 10.1002/bimj.201700067. Epub 2018 Jan 2.

Methods for significance testing of categorical covariates in logistic regression models after multiple imputation: power and applicability analysis.

BMC Med Res Methodol. 2017 Aug 22;17(1):129. doi: 10.1186/s12874-017-0404-7.

Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration.

Ann Intern Med. 2015 Jan 6;162(1):W1-73. doi: 10.7326/M14-0698.

Missing data in a multi-item instrument were best handled by multiple imputation at the item score level.

J Clin Epidemiol. 2014 Mar;67(3):335-42. doi: 10.1016/j.jclinepi.2013.09.009. Epub 2013 Dec 2.

Analyzing longitudinal data with missing values.

Rehabil Psychol. 2011 Nov;56(4):267-88. doi: 10.1037/a0025579. Epub 2011 Oct 3.

Multiple imputation using chained equations: Issues and guidance for practice.

Stat Med. 2011 Feb 20;30(4):377-99. doi: 10.1002/sim.4067. Epub 2010 Nov 30.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一种简单的池化方法在多重插补数据集的变量选择中表现优于复杂方法。

A simple pooling method for variable selection in multiply imputed datasets outperformed complex methods.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献