多变量模型变量的选择：通过重采样量化模型稳定性的机会和限制。

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

机构信息

Center for Medical Statistics, Informatics and Intelligent Systems, Section for Clinical Biometrics, Medical University of Vienna, Vienna, Austria.

Institute of Biometry and Clinical Epidemiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany.

出版信息

Stat Med. 2021 Jan 30;40(2):369-381. doi: 10.1002/sim.8779. Epub 2020 Oct 21.

DOI:10.1002/sim.8779

PMID:33089538

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC7820988/

Abstract

Statistical models are often fitted to obtain a concise description of the association of an outcome variable with some covariates. Even if background knowledge is available to guide preselection of covariates, stepwise variable selection is commonly applied to remove irrelevant ones. This practice may introduce additional variability and selection is rarely certain. However, these issues are often ignored and model stability is not questioned. Several resampling-based measures were proposed to describe model stability, including variable inclusion frequencies (VIFs), model selection frequencies, relative conditional bias (RCB), and root mean squared difference ratio (RMSDR). The latter two were recently proposed to assess bias and variance inflation induced by variable selection. Here, we study the consistency and accuracy of resampling estimates of these measures and the optimal choice of the resampling technique. In particular, we compare subsampling and bootstrapping for assessing stability of linear, logistic, and Cox models obtained by backward elimination in a simulation study. Moreover, we exemplify the estimation and interpretation of all suggested measures in a study on cardiovascular risk. The VIF and the model selection frequency are only consistently estimated in the subsampling approach. By contrast, the bootstrap is advantageous in terms of bias and precision for estimating the RCB as well as the RMSDR. Though, unbiased estimation of the latter quantity requires independence of covariates, which is rarely encountered in practice. Our study stresses the importance of addressing model stability after variable selection and shows how to cope with it.

摘要

统计模型通常用于获得一个简洁的描述，说明因变量与一些协变量之间的关联。即使有背景知识可以指导协变量的预选，逐步变量选择通常也被用于去除不相关的变量。这种做法可能会引入额外的可变性，选择通常是不确定的。然而，这些问题经常被忽视，模型稳定性也没有被质疑。已经提出了几种基于重采样的方法来描述模型稳定性，包括变量纳入频率(VIF)、模型选择频率、相对条件偏差(RCB)和均方根差异比(RMSDR)。后两者最近被提出用于评估变量选择引起的偏差和方差膨胀。在这里，我们研究了这些方法的重采样估计的一致性和准确性，以及重采样技术的最优选择。特别是，我们在模拟研究中比较了子采样和引导抽样在评估通过向后消除法获得的线性、逻辑和 Cox 模型的稳定性方面的差异。此外，我们在心血管风险研究中举例说明了所有建议的措施的估计和解释。VIF 和模型选择频率仅在子采样方法中被一致估计。相比之下，引导抽样在估计 RCB 和 RMSDR 的偏差和精度方面具有优势。然而，后一数量的无偏估计要求协变量相互独立，这在实践中很少遇到。我们的研究强调了在变量选择后处理模型稳定性的重要性，并展示了如何应对这一问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ae6c/7820988/17dcc34de0c7/SIM-40-369-g001.jpg

相似文献

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.多变量模型变量的选择：通过重采样量化模型稳定性的机会和限制。

Stat Med. 2021 Jan 30;40(2):369-381. doi: 10.1002/sim.8779. Epub 2020 Oct 21.

Subsampling versus bootstrapping in resampling-based model selection for multivariable regression.基于重采样的多变量回归模型选择中的子采样与自助法

Biometrics. 2016 Mar;72(1):272-80. doi: 10.1111/biom.12381. Epub 2015 Aug 19.

Bootstrap model selection had similar performance for selecting authentic and noise variables compared to backward variable elimination: a simulation study.与向后变量消除法相比，自举模型选择在选择真实变量和噪声变量方面具有相似的性能：一项模拟研究。

J Clin Epidemiol. 2008 Oct;61(10):1009-17.e1. doi: 10.1016/j.jclinepi.2007.11.014. Epub 2008 Jun 9.

Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology.评估生态随机森林建模中变量选择方法的准确性和稳定性。

Environ Monit Assess. 2017 Jul;189(7):316. doi: 10.1007/s10661-017-6025-0. Epub 2017 Jun 6.

Variable selection - A review and recommendations for the practicing statistician.变量选择——给执业统计学家的一篇综述与建议

Biom J. 2018 May;60(3):431-449. doi: 10.1002/bimj.201700067. Epub 2018 Jan 2.

Categorical variables with many categories are preferentially selected in bootstrap-based model selection procedures for multivariable regression models.在基于自助法的多变量回归模型选择过程中，优先选择具有多个类别的分类变量。

Biom J. 2016 May;58(3):652-73. doi: 10.1002/bimj.201400185. Epub 2016 Mar 22.

On stability issues in deriving multivariable regression models.关于推导多变量回归模型中的稳定性问题。

Biom J. 2015 Jul;57(4):531-55. doi: 10.1002/bimj.201300222. Epub 2014 Dec 15.

Variable selection in the presence of missing data: resampling and imputation.存在缺失数据时的变量选择：重采样与插补

Biostatistics. 2015 Jul;16(3):596-610. doi: 10.1093/biostatistics/kxv003. Epub 2015 Feb 18.

Augmented backward elimination: a pragmatic and purposeful way to develop statistical models.增强反向消除法：一种开发统计模型的实用且有目的的方法。

PLoS One. 2014 Nov 21;9(11):e113677. doi: 10.1371/journal.pone.0113677. eCollection 2014.

A bootstrap resampling procedure for model building: application to the Cox regression model.一种用于模型构建的自助重采样程序：在Cox回归模型中的应用。

Stat Med. 1992 Dec;11(16):2093-109. doi: 10.1002/sim.4780111607.

引用本文的文献

Descriptive Analysis and Factors Associated With Relapse in Dogs With Presumptive Idiopathic Immune-Mediated Polyarthritis.疑似特发性免疫介导性多关节炎犬的描述性分析及复发相关因素

J Vet Intern Med. 2025 Sep-Oct;39(5):e70241. doi: 10.1111/jvim.70241.

Letter to: identifying influencing factors and constructing a prediction model for long COVID-19 in hemodialysis patients.致：识别血液透析患者长期新冠病毒感染的影响因素并构建预测模型

Int Urol Nephrol. 2025 Aug 9. doi: 10.1007/s11255-025-04633-9.

Development and validation of a prognostic staging system for primary plasma cell leukemia.原发性浆细胞白血病预后分期系统的开发与验证

J Hematol Oncol. 2025 Jul 15;18(1):72. doi: 10.1186/s13045-025-01723-0.

Response to Letter to "Predictors of Pathologic Non-response to Neoadjuvant Approaches in Locally Advanced Rectal Cancer".对致《局部晚期直肠癌新辅助治疗病理无反应的预测因素》信件的回复

Ann Surg Oncol. 2025 Jul 8. doi: 10.1245/s10434-025-17834-4.

Letter to Predictors of Pathologic Nonresponse to Neoadjuvant Approaches in Locally Advanced Rectal Cancer.致局部晚期直肠癌新辅助治疗病理无反应预测因素的信。

Ann Surg Oncol. 2025 Sep;32(9):6748-6749. doi: 10.1245/s10434-025-17744-5. Epub 2025 Jun 24.

Variable selection methods for descriptive modeling.用于描述性建模的变量选择方法。

PLoS One. 2025 Jun 2;20(6):e0321601. doi: 10.1371/journal.pone.0321601. eCollection 2025.

Prediction Modeling With Many Correlated and Zero-Inflated Predictors: Assessing the Nonnegative Garrote Approach.具有多个相关和零膨胀预测变量的预测建模：评估非负约束岭回归方法。

Stat Med. 2025 Apr;44(8-9):e70062. doi: 10.1002/sim.70062.

Socioeconomic determinants of low birth weight and its association with peripubertal obesity in Brazil.巴西低出生体重的社会经济决定因素及其与青春期前肥胖的关联。

Front Public Health. 2025 Mar 19;13:1424342. doi: 10.3389/fpubh.2025.1424342. eCollection 2025.

Circulating Metabolite Profiles and Risk of Coronary Heart Disease Among Racially and Geographically Diverse Populations.不同种族和地理人群的循环代谢物谱与冠心病风险

Circ Genom Precis Med. 2024 Aug;17(4):e004437. doi: 10.1161/CIRCGEN.123.004437. Epub 2024 Jul 1.

What question are we trying to answer? Embracing causal inference.我们试图回答什么问题？拥抱因果推断。

Front Vet Sci. 2024 May 21;11:1402981. doi: 10.3389/fvets.2024.1402981. eCollection 2024.

本文引用的文献

Re-estimation improved the performance of two Framingham cardiovascular risk equations and the Pooled Cohort equations: A nationwide registry analysis.再估算改进了两种弗雷明汉心血管风险方程和队列方程的性能：一项全国性登记分析。

Sci Rep. 2020 May 18;10(1):8140. doi: 10.1038/s41598-020-64629-6.

Using simulation studies to evaluate statistical methods.运用模拟研究评估统计方法。

Stat Med. 2019 May 20;38(11):2074-2102. doi: 10.1002/sim.8086. Epub 2019 Jan 16.

External validation of two Framingham cardiovascular risk equations and the Pooled Cohort equations: A nationwide registry analysis.两种弗雷明汉心血管风险方程和 pooled cohort 方程的外部验证：全国注册分析。

Int J Cardiol. 2019 May 15;283:165-170. doi: 10.1016/j.ijcard.2018.11.001. Epub 2018 Nov 5.

Variable selection - A review and recommendations for the practicing statistician.变量选择——给执业统计学家的一篇综述与建议

Biom J. 2018 May;60(3):431-449. doi: 10.1002/bimj.201700067. Epub 2018 Jan 2.

Biom J. 2016 May;58(3):652-73. doi: 10.1002/bimj.201400185. Epub 2016 Mar 22.

Pitfalls of hypothesis tests and model selection on bootstrap samples: Causes and consequences in biometrical applications.自抽样样本上假设检验和模型选择的陷阱：生物统计学应用中的原因与后果

Biom J. 2016 May;58(3):447-73. doi: 10.1002/bimj.201400246. Epub 2015 Sep 15.

Subsampling versus bootstrapping in resampling-based model selection for multivariable regression.基于重采样的多变量回归模型选择中的子采样与自助法

Biometrics. 2016 Mar;72(1):272-80. doi: 10.1111/biom.12381. Epub 2015 Aug 19.

On stability issues in deriving multivariable regression models.关于推导多变量回归模型中的稳定性问题。

Biom J. 2015 Jul;57(4):531-55. doi: 10.1002/bimj.201300222. Epub 2014 Dec 15.

2013 ACC/AHA guideline on the assessment of cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines.2013年美国心脏病学会/美国心脏协会心血管风险评估指南：美国心脏病学会/美国心脏协会实践指南工作组报告

J Am Coll Cardiol. 2014 Jul 1;63(25 Pt B):2935-2959. doi: 10.1016/j.jacc.2013.11.005. Epub 2013 Nov 12.

Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples.在高维自助抽样样本中针对有偏复杂度选择调整预测误差估计值。

Stat Appl Genet Mol Biol. 2008;7(1):Article12. doi: 10.2202/1544-6115.1346. Epub 2008 Mar 14.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

多变量模型变量的选择：通过重采样量化模型稳定性的机会和限制。

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献