正则化逻辑回归中严重不平衡大数据的稳定变量排名和选择。

Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data.

机构信息

University of Guelph, Guelph, Ontario, Canada.

出版信息

PLoS One. 2023 Jan 17;18(1):e0280258. doi: 10.1371/journal.pone.0280258. eCollection 2023.

DOI:10.1371/journal.pone.0280258

PMID:36649281

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC9844919/

Abstract

We develop a novel covariate ranking and selection algorithm for regularized ordinary logistic regression (OLR) models in the presence of severe class-imbalance in high dimensional datasets with correlated signal and noise covariates. Class-imbalance is resolved using response-based subsampling which we also employ to achieve stability in variable selection by creating an ensemble of regularized OLR models fitted to subsampled (and balanced) datasets. The regularization methods considered in our study include Lasso, adaptive Lasso (adaLasso) and ridge regression. Our methodology is versatile in the sense that it works effectively for regularization techniques involving both hard- (e.g. Lasso) and soft-shrinkage (e.g. ridge) of the regression coefficients. We assess selection performance by conducting a detailed simulation experiment involving varying moderate-to-severe class-imbalance ratios and highly correlated continuous and discrete signal and noise covariates. Simulation results show that our algorithm is robust against severe class-imbalance under the presence of highly correlated covariates, and consistently achieves stable and accurate variable selection with very low false discovery rate. We illustrate our methodology using a case study involving a severely imbalanced high-dimensional wildland fire occurrence dataset comprising 13 million instances. The case study and simulation results demonstrate that our framework provides a robust approach to variable selection in severely imbalanced big binary data.

摘要

我们开发了一种新的协变量排序和选择算法，用于在高维数据集存在严重类别不平衡和相关信号和噪声协变量的情况下进行正则化普通逻辑回归（OLR）模型。类别不平衡通过基于响应的抽样来解决，我们还使用基于响应的抽样来实现变量选择的稳定性，方法是创建一组拟合于抽样（和平衡）数据集的正则化 OLR 模型的集合。我们研究中考虑的正则化方法包括 Lasso、自适应 Lasso（adaLasso）和岭回归。我们的方法非常灵活，它可以有效地应用于涉及回归系数硬收缩（例如 Lasso）和软收缩（例如岭回归）的正则化技术。我们通过进行详细的模拟实验来评估选择性能，该实验涉及变化的中度到严重的类别不平衡比和高度相关的连续和离散信号和噪声协变量。模拟结果表明，我们的算法在存在高度相关协变量的情况下对严重的类别不平衡具有鲁棒性，并始终实现稳定且准确的变量选择，具有非常低的假发现率。我们使用一个涉及 1300 万个实例的严重不平衡高维野火发生数据集的案例研究来说明我们的方法。案例研究和模拟结果表明，我们的框架为严重不平衡的大数据二进制数据中的变量选择提供了一种稳健的方法。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

正则化逻辑回归中严重不平衡大数据的稳定变量排名和选择。

Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

正则化逻辑回归中严重不平衡大数据的稳定变量排名和选择。

Stable variable ranking and selection in regularized logistic regression for severely imbalanced big binary data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献