Suppr超能文献

多变量模型变量的选择:通过重采样量化模型稳定性的机会和限制。

Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.

机构信息

Center for Medical Statistics, Informatics and Intelligent Systems, Section for Clinical Biometrics, Medical University of Vienna, Vienna, Austria.

Institute of Biometry and Clinical Epidemiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany.

出版信息

Stat Med. 2021 Jan 30;40(2):369-381. doi: 10.1002/sim.8779. Epub 2020 Oct 21.

Abstract

Statistical models are often fitted to obtain a concise description of the association of an outcome variable with some covariates. Even if background knowledge is available to guide preselection of covariates, stepwise variable selection is commonly applied to remove irrelevant ones. This practice may introduce additional variability and selection is rarely certain. However, these issues are often ignored and model stability is not questioned. Several resampling-based measures were proposed to describe model stability, including variable inclusion frequencies (VIFs), model selection frequencies, relative conditional bias (RCB), and root mean squared difference ratio (RMSDR). The latter two were recently proposed to assess bias and variance inflation induced by variable selection. Here, we study the consistency and accuracy of resampling estimates of these measures and the optimal choice of the resampling technique. In particular, we compare subsampling and bootstrapping for assessing stability of linear, logistic, and Cox models obtained by backward elimination in a simulation study. Moreover, we exemplify the estimation and interpretation of all suggested measures in a study on cardiovascular risk. The VIF and the model selection frequency are only consistently estimated in the subsampling approach. By contrast, the bootstrap is advantageous in terms of bias and precision for estimating the RCB as well as the RMSDR. Though, unbiased estimation of the latter quantity requires independence of covariates, which is rarely encountered in practice. Our study stresses the importance of addressing model stability after variable selection and shows how to cope with it.

摘要

统计模型通常用于获得一个简洁的描述,说明因变量与一些协变量之间的关联。即使有背景知识可以指导协变量的预选,逐步变量选择通常也被用于去除不相关的变量。这种做法可能会引入额外的可变性,选择通常是不确定的。然而,这些问题经常被忽视,模型稳定性也没有被质疑。已经提出了几种基于重采样的方法来描述模型稳定性,包括变量纳入频率(VIF)、模型选择频率、相对条件偏差(RCB)和均方根差异比(RMSDR)。后两者最近被提出用于评估变量选择引起的偏差和方差膨胀。在这里,我们研究了这些方法的重采样估计的一致性和准确性,以及重采样技术的最优选择。特别是,我们在模拟研究中比较了子采样和引导抽样在评估通过向后消除法获得的线性、逻辑和 Cox 模型的稳定性方面的差异。此外,我们在心血管风险研究中举例说明了所有建议的措施的估计和解释。VIF 和模型选择频率仅在子采样方法中被一致估计。相比之下,引导抽样在估计 RCB 和 RMSDR 的偏差和精度方面具有优势。然而,后一数量的无偏估计要求协变量相互独立,这在实践中很少遇到。我们的研究强调了在变量选择后处理模型稳定性的重要性,并展示了如何应对这一问题。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ae6c/7820988/17dcc34de0c7/SIM-40-369-g001.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验