调参还是不调参，小数据集或稀疏数据集的岭 logistic 回归案例研究。

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets.

机构信息

Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria.

Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia.

出版信息

BMC Med Res Methodol. 2021 Sep 30;21(1):199. doi: 10.1186/s12874-021-01374-y.

DOI:10.1186/s12874-021-01374-y

PMID:34592945

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8482588/

Abstract

BACKGROUND

For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that ridge logistic regression can result in highly variable calibration slopes in small or sparse data situations.

METHODS

In this paper, we elaborate this issue further by performing a comprehensive simulation study, investigating the performance of ridge logistic regression in terms of coefficients and predictions and comparing it to Firth's correction that has been shown to perform well in low-dimensional settings. In addition to tuned ridge regression where the penalty strength is estimated from the data by minimizing some measure of the out-of-sample prediction error or information criterion, we also considered ridge regression with pre-specified degree of shrinkage. We included 'oracle' models in the simulation study in which the complexity parameter was chosen based on the true event probabilities (prediction oracle) or regression coefficients (explanation oracle) to demonstrate the capability of ridge regression if truth was known.

RESULTS

Performance of ridge regression strongly depends on the choice of complexity parameter. As shown in our simulation and illustrated by a data example, values optimized in small or sparse datasets are negatively correlated with optimal values and suffer from substantial variability which translates into large MSE of coefficients and large variability of calibration slopes. In contrast, in our simulations pre-specifying the degree of shrinkage prior to fitting led to accurate coefficients and predictions even in non-ideal settings such as encountered in the context of rare outcomes or sparse predictors.

CONCLUSIONS

Applying tuned ridge regression in small or sparse datasets is problematic as it results in unstable coefficients and predictions. In contrast, determining the degree of shrinkage according to some meaningful prior assumptions about true effects has the potential to reduce bias and stabilize the estimates.

摘要

背景

对于具有二项结果的有限样本，例如岭逻辑回归，可以实现系数和预测的均方误差 (MSE) 小于最大似然估计。然而，有证据表明，在小数据或稀疏数据情况下，岭逻辑回归可能导致校准斜率高度变化。

方法

在本文中，我们通过进行全面的模拟研究进一步阐述了这个问题，研究了岭逻辑回归在系数和预测方面的性能，并将其与已被证明在低维环境中表现良好的 Firth 校正进行了比较。除了通过最小化样本外预测误差或信息准则的某种度量来从数据中估计惩罚强度的调整岭回归外，我们还考虑了具有预定义收缩程度的岭回归。我们在模拟研究中包括了“Oracle”模型，其中复杂度参数是基于真实事件概率（预测 Oracle）或回归系数（解释 Oracle）选择的，以展示如果知道真相，岭回归的能力。

结果

岭回归的性能强烈依赖于复杂度参数的选择。正如我们的模拟所示，并通过一个数据示例说明，在小数据集或稀疏数据集中优化的值与最佳值负相关，并受到大量变化的影响，这转化为系数的大 MSE 和校准斜率的大变化。相比之下，在我们的模拟中，在拟合之前预先指定收缩程度会导致即使在不理想的设置下（例如在罕见结果或稀疏预测器的情况下）也能得到准确的系数和预测。

结论

在小数据集或稀疏数据集中应用调整后的岭回归是有问题的，因为它会导致不稳定的系数和预测。相比之下，根据关于真实效果的某些有意义的先验假设来确定收缩程度，有可能减少偏差并稳定估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/670c/8482588/6998777d69a8/12874_2021_1374_Fig1_HTML.jpg

相似文献

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets.

BMC Med Res Methodol. 2021 Sep 30;21(1):199. doi: 10.1186/s12874-021-01374-y.

Firth's logistic regression with rare events: accurate effect estimates and predictions?

Stat Med. 2017 Jun 30;36(14):2302-2317. doi: 10.1002/sim.7273. Epub 2017 Mar 12.

Regression shrinkage methods for clinical prediction models do not guarantee improved performance: Simulation study.

Stat Methods Med Res. 2020 Nov;29(11):3166-3178. doi: 10.1177/0962280220921415. Epub 2020 May 13.

On estimation for accelerated failure time models with small or rare event survival data.

BMC Med Res Methodol. 2022 Jun 11;22(1):169. doi: 10.1186/s12874-022-01638-1.

Fridge: Focused fine-tuning of ridge regression for personalized predictions.

Stat Med. 2018 Apr 15;37(8):1290-1303. doi: 10.1002/sim.7576. Epub 2018 Jan 3.

Bias in Odds Ratios From Logistic Regression Methods With Sparse Data Sets.

J Epidemiol. 2023 Jun 5;33(6):265-275. doi: 10.2188/jea.JE20210089. Epub 2022 Apr 1.

Solutions to problems of nonexistence of parameter estimates and sparse data bias in Poisson regression.

Stat Methods Med Res. 2022 Feb;31(2):253-266. doi: 10.1177/09622802211065405. Epub 2021 Dec 21.

A Solution to Separation and Multicollinearity in Multiple Logistic Regression.

J Data Sci. 2008 Oct 1;6(4):515-531.

Re-evaluation of the comparative effectiveness of bootstrap-based optimism correction methods in the development of multivariable clinical prediction models.

BMC Med Res Methodol. 2021 Jan 7;21(1):9. doi: 10.1186/s12874-020-01201-w.

An investigation of penalization and data augmentation to improve convergence of generalized estimating equations for clustered binary outcomes.

BMC Med Res Methodol. 2022 Jun 9;22(1):168. doi: 10.1186/s12874-022-01641-6.

引用本文的文献

Alternatives to default shrinkage methods can improve prediction accuracy, calibration, and coverage: A methods comparison study.

Stat Methods Med Res. 2025 Jul;34(7):1342-1355. doi: 10.1177/09622802251338440. Epub 2025 May 29.

Digital-Tier Strategy Improves Newborn Screening for Glutaric Aciduria Type 1.

Int J Neonatal Screen. 2024 Dec 21;10(4):83. doi: 10.3390/ijns10040083.

Dynamic functional connectivity MEG features of Alzheimer's disease.

Neuroimage. 2023 Nov 1;281:120358. doi: 10.1016/j.neuroimage.2023.120358. Epub 2023 Sep 11.

Stability of clinical prediction models developed using statistical or machine learning methods.

Biom J. 2023 Dec;65(8):e2200302. doi: 10.1002/bimj.202200302. Epub 2023 Jul 19.

Predicting total knee arthroplasty from ultrasonography using machine learning.

Osteoarthr Cartil Open. 2022 Nov 6;4(4):100319. doi: 10.1016/j.ocarto.2022.100319. eCollection 2022 Dec.

Comparison of conventional scoring systems to machine learning models for the prediction of major adverse cardiovascular events in patients undergoing coronary computed tomography angiography.

Front Cardiovasc Med. 2022 Oct 26;9:994483. doi: 10.3389/fcvm.2022.994483. eCollection 2022.

Individual-specific networks for prediction modelling - A scoping review of methods.

BMC Med Res Methodol. 2022 Mar 6;22(1):62. doi: 10.1186/s12874-022-01544-6.

本文引用的文献

Factors that determine dependence in daily activities: A cross-sectional study of family practice non-attenders from Slovenia.

PLoS One. 2021 Jan 22;16(1):e0245465. doi: 10.1371/journal.pone.0245465. eCollection 2021.

Penalization and shrinkage methods produced unreliable clinical prediction models especially when sample size was small.

J Clin Epidemiol. 2021 Apr;132:88-96. doi: 10.1016/j.jclinepi.2020.12.005. Epub 2020 Dec 8.

Regression shrinkage methods for clinical prediction models do not guarantee improved performance: Simulation study.

Stat Methods Med Res. 2020 Nov;29(11):3166-3178. doi: 10.1177/0962280220921415. Epub 2020 May 13.

Calculating the sample size required for developing a clinical prediction model.

BMJ. 2020 Mar 18;368:m441. doi: 10.1136/bmj.m441.

Bring More Data!-A Good Advice? Removing Separation in Logistic Regression by Increasing Sample Size.

Int J Environ Res Public Health. 2019 Nov 22;16(23):4658. doi: 10.3390/ijerph16234658.

Reconciling modern machine-learning practice and the classical bias-variance trade-off.

Proc Natl Acad Sci U S A. 2019 Aug 6;116(32):15849-15854. doi: 10.1073/pnas.1903070116. Epub 2019 Jul 24.

Using simulation studies to evaluate statistical methods.

Stat Med. 2019 May 20;38(11):2074-2102. doi: 10.1002/sim.8086. Epub 2019 Jan 16.

Sample size for binary logistic prediction models: Beyond events per variable criteria.

Stat Methods Med Res. 2019 Aug;28(8):2455-2474. doi: 10.1177/0962280218784726. Epub 2018 Jul 3.

Variable selection - A review and recommendations for the practicing statistician.

Biom J. 2018 May;60(3):431-449. doi: 10.1002/bimj.201700067. Epub 2018 Jan 2.

On the necessity and design of studies comparing statistical methods.

Biom J. 2018 Jan;60(1):216-218. doi: 10.1002/bimj.201700129. Epub 2017 Nov 29.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

调参还是不调参，小数据集或稀疏数据集的岭 logistic 回归案例研究。

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets.

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献