• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

中介维度流行病学研究中识别结局真实预测因子的变量选择算法的不稳定性。

Instability of Variable-selection Algorithms Used to Identify True Predictors of an Outcome in Intermediate-dimension Epidemiologic Studies.

机构信息

From the Team of Environmental Epidemiology, IAB, Institute for Advanced Biosciences, Inserm, CNRS, CHU-Grenoble-Alpes, University Grenoble-Alpes, Grenoble, France.

出版信息

Epidemiology. 2021 May 1;32(3):402-411. doi: 10.1097/EDE.0000000000001340.

DOI:10.1097/EDE.0000000000001340
PMID:33652445
Abstract

BACKGROUND

Machine-learning algorithms are increasingly used in epidemiology to identify true predictors of a health outcome when many potential predictors are measured. However, these algorithms can provide different outputs when repeatedly applied to the same dataset, which can compromise research reproducibility. We aimed to illustrate that commonly used algorithms are unstable and, using the example of Least Absolute Shrinkage and Selection Operator (LASSO), that stabilization method choice is crucial.

METHODS

In a simulation study, we tested the stability and performance of widely used machine-learning algorithms (LASSO, Elastic-Net, and Deletion-Substitution-Addition [DSA]). We then assessed the effectiveness of six methods to stabilize LASSO and their impact on performance. We assumed that a linear combination of factors drawn from a simulated set of 173 quantitative variables assessed in 1,301 subjects influenced to varying extents a continuous health outcome. We assessed model stability, sensitivity, and false discovery proportion.

RESULTS

All tested algorithms were unstable. For LASSO, stabilization methods improved stability without ensuring perfect stability, a finding confirmed by application to an exposome study. Stabilization methods also affected performance. Specifically, stabilization based on hyperparameter optimization, frequently implemented in epidemiology, increased the false discovery proportion dramatically when predictors explained a low share of outcome variability. In contrast, stabilization based on stability selection procedure often decreased the false discovery proportion, while sometimes simultaneously lowering sensitivity.

CONCLUSIONS

Machine-learning methods instability should concern epidemiologists relying on them for variable selection, as stabilizing a model can impact its performance. For LASSO, stabilization methods based on stability selection procedure (rather than addressing prediction stability) should be preferred to identify true predictors.

摘要

背景

当测量到许多潜在的预测因子时,机器学习算法越来越多地用于流行病学中,以识别对健康结果的真正预测因子。然而,当这些算法被反复应用于同一数据集时,它们可能会提供不同的输出,这可能会影响研究的可重复性。我们旨在说明常用的算法是不稳定的,并且使用最小绝对收缩和选择算子(LASSO)的示例说明,稳定方法的选择至关重要。

方法

在一项模拟研究中,我们测试了广泛使用的机器学习算法(LASSO、弹性网络和删除-替换-添加[DSA])的稳定性和性能。然后,我们评估了六种稳定 LASSO 的方法及其对性能的影响。我们假设从模拟的 173 个定量变量集中抽取的因素的线性组合以不同的程度影响连续的健康结果。我们评估了模型的稳定性、敏感性和假阳性发现率。

结果

所有测试的算法都是不稳定的。对于 LASSO,稳定化方法提高了稳定性,但不能确保完全稳定,这一发现通过应用于暴露组研究得到了证实。稳定化方法也会影响性能。具体而言,基于超参数优化的稳定化方法,在预测因子解释结果变异性的低份额时,极大地增加了假阳性发现率。相比之下,基于稳定性选择程序的稳定化方法通常会降低假阳性发现率,而有时同时降低敏感性。

结论

依赖机器学习方法进行变量选择的流行病学家应该关注这些方法的不稳定性,因为稳定模型可能会影响其性能。对于 LASSO,应优先选择基于稳定性选择程序的稳定化方法(而不是解决预测稳定性问题)来识别真正的预测因子。

相似文献

1
Instability of Variable-selection Algorithms Used to Identify True Predictors of an Outcome in Intermediate-dimension Epidemiologic Studies.中介维度流行病学研究中识别结局真实预测因子的变量选择算法的不稳定性。
Epidemiology. 2021 May 1;32(3):402-411. doi: 10.1097/EDE.0000000000001340.
2
Stability selection for mixed effect models with large numbers of predictor variables: A simulation study.具有大量预测变量的混合效应模型的稳定性选择:一项模拟研究。
Prev Vet Med. 2022 Sep;206:105714. doi: 10.1016/j.prevetmed.2022.105714. Epub 2022 Jul 12.
3
Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches.社会环境数据中的变量选择:稀疏回归和树集成机器学习方法。
BMC Med Res Methodol. 2020 Dec 10;20(1):302. doi: 10.1186/s12874-020-01183-9.
4
Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO.与套索变量选择相比,递归随机森林具有更好的预测性能和模型解释能力。
J Chem Inf Model. 2015 Apr 27;55(4):736-46. doi: 10.1021/ci500715e. Epub 2015 Mar 16.
5
A systematic comparison of statistical methods to detect interactions in exposome-health associations.用于检测暴露组-健康关联中相互作用的统计方法的系统比较。
Environ Health. 2017 Jul 14;16(1):74. doi: 10.1186/s12940-017-0277-6.
6
Stability selection for lasso, ridge and elastic net implemented with AFT models.使用加速失效时间(AFT)模型实现套索、岭回归和弹性网络的稳定性选择。
Stat Appl Genet Mol Biol. 2019 Oct 7;18(5):/j/sagmb.2019.18.issue-5/sagmb-2017-0001/sagmb-2017-0001.xml. doi: 10.1515/sagmb-2017-0001.
7
Least absolute shrinkage and selection operator type methods for the identification of serum biomarkers of overweight and obesity: simulation and application.用于识别超重和肥胖血清生物标志物的最小绝对收缩和选择算子类型方法:模拟与应用
BMC Med Res Methodol. 2016 Nov 14;16(1):154. doi: 10.1186/s12874-016-0254-8.
8
The effect of machine learning regression algorithms and sample size on individualized behavioral prediction with functional connectivity features.机器学习回归算法和样本量对功能连接特征的个体化行为预测的影响。
Neuroimage. 2018 Sep;178:622-637. doi: 10.1016/j.neuroimage.2018.06.001. Epub 2018 Jun 2.
9
Machine learning algorithms for outcome prediction in (chemo)radiotherapy: An empirical comparison of classifiers.机器学习算法在(放化疗)治疗结果预测中的应用:分类器的实证比较。
Med Phys. 2018 Jul;45(7):3449-3459. doi: 10.1002/mp.12967. Epub 2018 Jun 13.
10
Overall Survival Prognostic Modelling of Non-small Cell Lung Cancer Patients Using Positron Emission Tomography/Computed Tomography Harmonised Radiomics Features: The Quest for the Optimal Machine Learning Algorithm.正电子发射断层扫描/计算机断层扫描调和放射组学特征预测非小细胞肺癌患者总生存期:最优机器学习算法的探索。
Clin Oncol (R Coll Radiol). 2022 Feb;34(2):114-127. doi: 10.1016/j.clon.2021.11.014. Epub 2021 Dec 3.

引用本文的文献

1
Characteristics of ChatGPT users from Germany: Implications for the digital divide from web tracking data.来自德国的ChatGPT用户特征:网络追踪数据对数字鸿沟的影响
PLoS One. 2025 Jan 17;20(1):e0309047. doi: 10.1371/journal.pone.0309047. eCollection 2025.
2
A bootstrap model comparison test for identifying genes with context-specific patterns of genetic regulation.一种用于识别具有基因调控上下文特异性模式的基因的自举模型比较测试。
Ann Appl Stat. 2024 Sep;18(3):1840-1857. doi: 10.1214/23-aoas1859. Epub 2024 Aug 5.
3
Association of High-Dose Erythropoietin With Circulating Biomarkers and Neurodevelopmental Outcomes Among Neonates With Hypoxic Ischemic Encephalopathy: A Secondary Analysis of the HEAL Randomized Clinical Trial.
高剂量促红细胞生成素与缺氧缺血性脑病新生儿循环生物标志物和神经发育结局的关系:HEAL 随机临床试验的二次分析。
JAMA Netw Open. 2023 Jul 3;6(7):e2322131. doi: 10.1001/jamanetworkopen.2023.22131.
4
A BOOTSTRAP MODEL COMPARISON TEST FOR IDENTIFYING GENES WITH CONTEXT-SPECIFIC PATTERNS OF GENETIC REGULATION.一种用于识别具有基因调控上下文特异性模式基因的自举模型比较测试。
bioRxiv. 2023 Oct 22:2023.03.06.531446. doi: 10.1101/2023.03.06.531446.
5
The Exposome Approach to Decipher the Role of Multiple Environmental and Lifestyle Determinants in Asthma.外核组学方法解析多种环境和生活方式决定因素在哮喘中的作用。
Int J Environ Res Public Health. 2021 Jan 28;18(3):1138. doi: 10.3390/ijerph18031138.