• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

变量选择——给执业统计学家的一篇综述与建议

Variable selection - A review and recommendations for the practicing statistician.

作者信息

Heinze Georg, Wallisch Christine, Dunkler Daniela

机构信息

Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, 1090, Austria.

出版信息

Biom J. 2018 May;60(3):431-449. doi: 10.1002/bimj.201700067. Epub 2018 Jan 2.

DOI:10.1002/bimj.201700067
PMID:29292533
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5969114/
Abstract

Statistical models support medical research by facilitating individualized outcome prognostication conditional on independent variables or by estimating effects of risk factors adjusted for covariates. Theory of statistical models is well-established if the set of independent variables to consider is fixed and small. Hence, we can assume that effect estimates are unbiased and the usual methods for confidence interval estimation are valid. In routine work, however, it is not known a priori which covariates should be included in a model, and often we are confronted with the number of candidate variables in the range 10-30. This number is often too large to be considered in a statistical model. We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change-in-estimate criterion, background knowledge, or combinations thereof. These methods were usually developed in the context of a linear regression model and then transferred to more generalized linear models or models for censored survival data. Variable selection, in particular if used in explanatory modeling where effect estimates are of central interest, can compromise stability of a final model, unbiasedness of regression coefficients, and validity of p-values or confidence intervals. Therefore, we give pragmatic recommendations for the practicing statistician on application of variable selection methods in general (low-dimensional) modeling problems and on performing stability investigations and inference. We also propose some quantities based on resampling the entire variable selection process to be routinely reported by software packages offering automated variable selection algorithms.

摘要

统计模型通过促进基于自变量的个体化结果预测或通过估计经协变量调整后的风险因素的效应来支持医学研究。如果要考虑的自变量集是固定的且数量较少,统计模型理论就已确立。因此,我们可以假设效应估计是无偏的,并且常用的置信区间估计方法是有效的。然而,在日常工作中,事先并不知道哪些协变量应包含在模型中,而且我们常常面临候选变量数量在10到30之间的情况。这个数量通常太大,无法在统计模型中进行考虑。我们概述了各种可用的变量选择方法,这些方法基于显著性或信息准则、惩罚似然、估计变化准则、背景知识或它们的组合。这些方法通常是在线性回归模型的背景下开发的,然后转移到更广义的线性模型或删失生存数据模型中。变量选择,特别是在用于解释性建模(其中效应估计是核心关注点)时,可能会损害最终模型的稳定性、回归系数的无偏性以及p值或置信区间的有效性。因此,我们就变量选择方法在一般(低维)建模问题中的应用以及进行稳定性研究和推断,向执业统计学家给出实用建议。我们还基于对整个变量选择过程进行重采样提出了一些量,供提供自动变量选择算法的软件包常规报告。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bd1/5969114/b9e3ddb56cab/BIMJ-60-431-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bd1/5969114/5b2205f78a43/BIMJ-60-431-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bd1/5969114/b9e3ddb56cab/BIMJ-60-431-g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bd1/5969114/5b2205f78a43/BIMJ-60-431-g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/5bd1/5969114/b9e3ddb56cab/BIMJ-60-431-g002.jpg

相似文献

1
Variable selection - A review and recommendations for the practicing statistician.变量选择——给执业统计学家的一篇综述与建议
Biom J. 2018 May;60(3):431-449. doi: 10.1002/bimj.201700067. Epub 2018 Jan 2.
2
Generalized additive modeling with implicit variable selection by likelihood-based boosting.基于似然提升的具有隐变量选择的广义相加模型
Biometrics. 2006 Dec;62(4):961-71. doi: 10.1111/j.1541-0420.2006.00578.x.
3
Augmented backward elimination: a pragmatic and purposeful way to develop statistical models.增强反向消除法:一种开发统计模型的实用且有目的的方法。
PLoS One. 2014 Nov 21;9(11):e113677. doi: 10.1371/journal.pone.0113677. eCollection 2014.
4
Variable selection for binary spatial regression: Penalized quasi-likelihood approach.二元空间回归的变量选择:惩罚拟似然方法。
Biometrics. 2016 Dec;72(4):1164-1172. doi: 10.1111/biom.12525. Epub 2016 Apr 8.
5
Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer.带有强制协变量的稀疏回归难题及其在乳腺癌组织学分级基因评估中的应用
BMC Med Res Methodol. 2017 Jan 25;17(1):12. doi: 10.1186/s12874-017-0291-y.
6
Joint variable selection for fixed and random effects in linear mixed-effects models.线性混合效应模型中固定效应和随机效应的联合变量选择
Biometrics. 2010 Dec;66(4):1069-77. doi: 10.1111/j.1541-0420.2010.01391.x.
7
Simultaneous variable selection and estimation for joint models of longitudinal and failure time data with interval censoring.同时对带有区间删失的纵向和失效时间数据的联合模型进行变量选择和估计。
Biometrics. 2022 Mar;78(1):151-164. doi: 10.1111/biom.13387. Epub 2020 Oct 28.
8
Selection of variables for multivariable models: Opportunities and limitations in quantifying model stability by resampling.多变量模型变量的选择:通过重采样量化模型稳定性的机会和限制。
Stat Med. 2021 Jan 30;40(2):369-381. doi: 10.1002/sim.8779. Epub 2020 Oct 21.
9
Five myths about variable selection.关于变量选择的五个误区。
Transpl Int. 2017 Jan;30(1):6-10. doi: 10.1111/tri.12895.
10
Hazard regression with interval-censored data.带有区间删失数据的风险回归
Biometrics. 1997 Dec;53(4):1485-94.

引用本文的文献

1
Long-term recovery of sensorimotor functions and prediction of participation in survivors of critical illness: a prospective cohort study.危重症幸存者的感觉运动功能长期恢复及参与度预测:一项前瞻性队列研究。
J Intensive Care. 2025 Sep 8;13(1):49. doi: 10.1186/s40560-025-00808-9.
2
Socioeconomic vulnerability and osteoporosis treatment disparities during COVID-19 lockdown among U.S. medicare enrollees who initiated romosozumab.美国启动罗莫佐单抗治疗的医疗保险参保者在新冠疫情封锁期间的社会经济脆弱性与骨质疏松症治疗差异
Osteoporos Int. 2025 Sep 8. doi: 10.1007/s00198-025-07677-w.
3
Descriptive Analysis and Factors Associated With Relapse in Dogs With Presumptive Idiopathic Immune-Mediated Polyarthritis.

本文引用的文献

1
Five myths about variable selection.关于变量选择的五个误区。
Transpl Int. 2017 Jan;30(1):6-10. doi: 10.1111/tri.12895.
2
A review of statistical updating methods for clinical prediction models.临床预测模型的统计更新方法综述。
Stat Methods Med Res. 2018 Jan;27(1):185-197. doi: 10.1177/0962280215626466. Epub 2016 Jul 26.
3
Pathway-Based Genomics Prediction using Generalized Elastic Net.使用广义弹性网络的基于通路的基因组学预测
疑似特发性免疫介导性多关节炎犬的描述性分析及复发相关因素
J Vet Intern Med. 2025 Sep-Oct;39(5):e70241. doi: 10.1111/jvim.70241.
4
The relationship between prenatal heat exposure and birth outcomes: How much does the heat metric matter?产前热暴露与出生结局之间的关系:热指标有多重要?
PLoS One. 2025 Sep 3;20(9):e0330498. doi: 10.1371/journal.pone.0330498. eCollection 2025.
5
Sex Differences in the Association Between Physical Functioning and Cognition in Two Central European Populations.中欧两个人群中身体功能与认知之间关联的性别差异
Eur J Neurol. 2025 Sep;32(9):e70325. doi: 10.1111/ene.70325.
6
In the Words of Others: ERP Evidence of Speaker-Specific Phonological Prediction.他人之言:说话者特定语音预测的事件相关电位证据
Psychophysiology. 2025 Sep;62(9):e70135. doi: 10.1111/psyp.70135.
7
Design aspects for prognostic factor studies.预后因素研究的设计方面。
BMJ Open. 2025 Aug 31;15(8):e095065. doi: 10.1136/bmjopen-2024-095065.
8
Bayesian variable selection for logistic regression with a differentially misclassified binary covariate.具有差异误分类二元协变量的逻辑回归的贝叶斯变量选择
Commun Stat Simul Comput. 2025 May 5. doi: 10.1080/03610918.2025.2496305.
9
Predicting pain reduction following laparoscopic surgery for endometriosis: a retrospective cohort study using UK national and research databases.预测子宫内膜异位症腹腔镜手术后的疼痛减轻情况:一项使用英国国家和研究数据库的回顾性队列研究。
BMJ Open. 2025 Aug 27;15(8):e099374. doi: 10.1136/bmjopen-2025-099374.
10
Modelling in-hospital length of stay: A comparison of linear and ensemble models for competing risk analysis.住院时长建模:用于竞争风险分析的线性模型与集成模型比较
PLoS One. 2025 Aug 26;20(8):e0322101. doi: 10.1371/journal.pone.0322101. eCollection 2025.
PLoS Comput Biol. 2016 Mar 9;12(3):e1004790. doi: 10.1371/journal.pcbi.1004790. eCollection 2016 Mar.
4
Subsampling versus bootstrapping in resampling-based model selection for multivariable regression.基于重采样的多变量回归模型选择中的子采样与自助法
Biometrics. 2016 Mar;72(1):272-80. doi: 10.1111/biom.12381. Epub 2015 Aug 19.
5
Statistical learning and selective inference.统计学习与选择性推断。
Proc Natl Acad Sci U S A. 2015 Jun 23;112(25):7629-34. doi: 10.1073/pnas.1507583112.
6
Statistical foundations for model-based adjustments.基于模型的调整的统计基础。
Annu Rev Public Health. 2015 Mar 18;36:89-108. doi: 10.1146/annurev-publhealth-031914-122559.
7
On stability issues in deriving multivariable regression models.关于推导多变量回归模型中的稳定性问题。
Biom J. 2015 Jul;57(4):531-55. doi: 10.1002/bimj.201300222. Epub 2014 Dec 15.
8
Augmented backward elimination: a pragmatic and purposeful way to develop statistical models.增强反向消除法:一种开发统计模型的实用且有目的的方法。
PLoS One. 2014 Nov 21;9(11):e113677. doi: 10.1371/journal.pone.0113677. eCollection 2014.
9
STRengthening analytical thinking for observational studies: the STRATOS initiative.加强观察性研究的分析思维:STRATOS倡议。
Stat Med. 2014 Dec 30;33(30):5413-32. doi: 10.1002/sim.6265. Epub 2014 Jul 30.
10
Is a cutoff of 10% appropriate for the change-in-estimate criterion of confounder identification?对于混杂因素识别的估计量变化准则,截断值为 10% 是否合适?
J Epidemiol. 2014;24(2):161-7. doi: 10.2188/jea.je20130062. Epub 2013 Dec 7.