• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

正则化逻辑回归中严重不平衡大数据的稳定变量排名和选择。

Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data.

机构信息

University of Guelph, Guelph, Ontario, Canada.

出版信息

PLoS One. 2023 Jan 17;18(1):e0280258. doi: 10.1371/journal.pone.0280258. eCollection 2023.

DOI:10.1371/journal.pone.0280258
PMID:36649281
原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9844919/
Abstract

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.

摘要

我们开发了一种新的协变量排序和选择算法,用于在高维数据集存在严重类别不平衡和相关信号和噪声协变量的情况下进行正则化普通逻辑回归(OLR)模型。类别不平衡通过基于响应的抽样来解决,我们还使用基于响应的抽样来实现变量选择的稳定性,方法是创建一组拟合于抽样(和平衡)数据集的正则化 OLR 模型的集合。我们研究中考虑的正则化方法包括 Lasso、自适应 Lasso(adaLasso)和岭回归。我们的方法非常灵活,它可以有效地应用于涉及回归系数硬收缩(例如 Lasso)和软收缩(例如岭回归)的正则化技术。我们通过进行详细的模拟实验来评估选择性能,该实验涉及变化的中度到严重的类别不平衡比和高度相关的连续和离散信号和噪声协变量。模拟结果表明,我们的算法在存在高度相关协变量的情况下对严重的类别不平衡具有鲁棒性,并始终实现稳定且准确的变量选择,具有非常低的假发现率。我们使用一个涉及 1300 万个实例的严重不平衡高维野火发生数据集的案例研究来说明我们的方法。案例研究和模拟结果表明,我们的框架为严重不平衡的大数据二进制数据中的变量选择提供了一种稳健的方法。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/6e1ae82fa83c/pone.0280258.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/2d2eeda68981/pone.0280258.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/288537ab385c/pone.0280258.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/833b5de49ed0/pone.0280258.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/e32a42e9ba3b/pone.0280258.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/003b905c7844/pone.0280258.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/2a0876b34430/pone.0280258.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/5a56378cfef9/pone.0280258.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/33d854e33ce4/pone.0280258.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/6e1ae82fa83c/pone.0280258.g009.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/2d2eeda68981/pone.0280258.g001.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/288537ab385c/pone.0280258.g002.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/833b5de49ed0/pone.0280258.g003.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/e32a42e9ba3b/pone.0280258.g004.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/003b905c7844/pone.0280258.g005.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/2a0876b34430/pone.0280258.g006.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/5a56378cfef9/pone.0280258.g007.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/33d854e33ce4/pone.0280258.g008.jpg
https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9eb8/9844919/6e1ae82fa83c/pone.0280258.g009.jpg

相似文献

1
Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data.正则化逻辑回归中严重不平衡大数据的稳定变量排名和选择。
PLoS One. 2023 Jan 17;18(1):e0280258. doi: 10.1371/journal.pone.0280258. eCollection 2023.
2
Class-imbalanced subsampling lasso algorithm for discovering adverse drug reactions.用于发现药物不良反应的类不平衡子采样套索算法
Stat Methods Med Res. 2018 Mar;27(3):785-797. doi: 10.1177/0962280216643116. Epub 2016 Apr 25.
3
New adaptive lasso approaches for variable selection in automated pharmacovigilance signal detection.用于自动化药物警戒信号检测中变量选择的新自适应套索方法。
BMC Med Res Methodol. 2021 Dec 1;21(1):271. doi: 10.1186/s12874-021-01450-3.
4
Hellinger distance-based stable sparse feature selection for high-dimensional class-imbalanced data.基于 Hellinger 距离的高维类不平衡数据稳定稀疏特征选择。
BMC Bioinformatics. 2020 Mar 23;21(1):121. doi: 10.1186/s12859-020-3411-3.
5
Regularized estimation of large-scale gene association networks using graphical Gaussian models.基于图式高斯模型的大规模基因关联网络正则化估计
BMC Bioinformatics. 2009 Nov 24;10:384. doi: 10.1186/1471-2105-10-384.
6
Optimism Bias Correction in Omics Studies with Big Data: Assessment of Penalized Methods on Simulated Data.基于大数据的组学研究中的乐观偏差校正:模拟数据上惩罚方法的评估。
OMICS. 2019 Apr;23(4):207-213. doi: 10.1089/omi.2018.0191. Epub 2019 Feb 22.
7
Approaches to Regularized Regression - A Comparison between Gradient Boosting and the Lasso.正则化回归方法——梯度提升与套索法的比较
Methods Inf Med. 2016 Oct 17;55(5):422-430. doi: 10.3414/ME16-01-0033. Epub 2016 Sep 14.
8
Ensembling Variable Selectors by Stability Selection for the Cox Model.基于稳定性选择的 Cox 模型变量集成选择器。
Comput Intell Neurosci. 2017;2017:2747431. doi: 10.1155/2017/2747431. Epub 2017 Nov 15.
9
Stability selection for mixed effect models with large numbers of predictor variables: A simulation study.具有大量预测变量的混合效应模型的稳定性选择:一项模拟研究。
Prev Vet Med. 2022 Sep;206:105714. doi: 10.1016/j.prevetmed.2022.105714. Epub 2022 Jul 12.
10
Binary classification with fuzzy logistic regression under class imbalance and complete separation in clinical studies.临床研究中类不平衡和完全分离下的模糊逻辑回归的二分类
BMC Med Res Methodol. 2024 Jul 5;24(1):145. doi: 10.1186/s12874-024-02270-x.

引用本文的文献

1
Prediction of malaria positivity using patients' demographic and environmental features and clinical symptoms to complement parasitological confirmation before treatment.利用患者的人口统计学和环境特征以及临床症状预测疟疾阳性,以便在治疗前补充寄生虫学确诊。
Trop Dis Travel Med Vaccines. 2023 Dec 15;9(1):24. doi: 10.1186/s40794-023-00208-7.

本文引用的文献

1
Attribution of the Influence of Human-Induced Climate Change on an Extreme Fire Season.人为引起的气候变化对极端火灾季节的影响归因
Earths Future. 2019 Jan;7(1):2-10. doi: 10.1029/2018EF001050. Epub 2019 Jan 8.
2
A logistic regression model for consumer default risk.用于消费者违约风险的逻辑回归模型。
J Appl Stat. 2020 May 5;47(13-15):2879-2894. doi: 10.1080/02664763.2020.1759030. eCollection 2020.
3
Effective Feature Selection Method for Class-Imbalance Datasets Applied to Chemical Toxicity Prediction.应用于化学毒性预测的类不平衡数据集的有效特征选择方法。
J Chem Inf Model. 2021 Jan 25;61(1):76-94. doi: 10.1021/acs.jcim.0c00908. Epub 2020 Dec 22.
4
Logistic regression was as good as machine learning for predicting major chronic diseases.逻辑回归在预测主要慢性病方面与机器学习一样出色。
J Clin Epidemiol. 2020 Jun;122:56-69. doi: 10.1016/j.jclinepi.2020.03.002. Epub 2020 Mar 10.
5
Cross-validation pitfalls when selecting and assessing regression and classification models.交叉验证在选择和评估回归与分类模型时的陷阱。
J Cheminform. 2014 Mar 29;6(1):10. doi: 10.1186/1758-2946-6-10.
6
Regularization Paths for Generalized Linear Models via Coordinate Descent.基于坐标下降法的广义线性模型正则化路径
J Stat Softw. 2010;33(1):1-22.