Suppr超能文献

调参还是不调参,小数据集或稀疏数据集的岭 logistic 回归案例研究。

To tune or not to tune, a case study of ridge logistic regression in small or sparse datasets.

机构信息

Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Spitalgasse 23, 1090, Vienna, Austria.

Institute for Biostatistics and Medical Informatics, University of Ljubljana, Ljubljana, Slovenia.

出版信息

BMC Med Res Methodol. 2021 Sep 30;21(1):199. doi: 10.1186/s12874-021-01374-y.

Abstract

BACKGROUND

For finite samples with binary outcomes penalized logistic regression such as ridge logistic regression has the potential of achieving smaller mean squared errors (MSE) of coefficients and predictions than maximum likelihood estimation. There is evidence, however, that ridge logistic regression can result in highly variable calibration slopes in small or sparse data situations.

METHODS

In this paper, we elaborate this issue further by performing a comprehensive simulation study, investigating the performance of ridge logistic regression in terms of coefficients and predictions and comparing it to Firth's correction that has been shown to perform well in low-dimensional settings. In addition to tuned ridge regression where the penalty strength is estimated from the data by minimizing some measure of the out-of-sample prediction error or information criterion, we also considered ridge regression with pre-specified degree of shrinkage. We included 'oracle' models in the simulation study in which the complexity parameter was chosen based on the true event probabilities (prediction oracle) or regression coefficients (explanation oracle) to demonstrate the capability of ridge regression if truth was known.

RESULTS

Performance of ridge regression strongly depends on the choice of complexity parameter. As shown in our simulation and illustrated by a data example, values optimized in small or sparse datasets are negatively correlated with optimal values and suffer from substantial variability which translates into large MSE of coefficients and large variability of calibration slopes. In contrast, in our simulations pre-specifying the degree of shrinkage prior to fitting led to accurate coefficients and predictions even in non-ideal settings such as encountered in the context of rare outcomes or sparse predictors.

CONCLUSIONS

Applying tuned ridge regression in small or sparse datasets is problematic as it results in unstable coefficients and predictions. In contrast, determining the degree of shrinkage according to some meaningful prior assumptions about true effects has the potential to reduce bias and stabilize the estimates.

摘要

背景

对于具有二项结果的有限样本,例如岭逻辑回归,可以实现系数和预测的均方误差 (MSE) 小于最大似然估计。然而,有证据表明,在小数据或稀疏数据情况下,岭逻辑回归可能导致校准斜率高度变化。

方法

在本文中,我们通过进行全面的模拟研究进一步阐述了这个问题,研究了岭逻辑回归在系数和预测方面的性能,并将其与已被证明在低维环境中表现良好的 Firth 校正进行了比较。除了通过最小化样本外预测误差或信息准则的某种度量来从数据中估计惩罚强度的调整岭回归外,我们还考虑了具有预定义收缩程度的岭回归。我们在模拟研究中包括了“Oracle”模型,其中复杂度参数是基于真实事件概率(预测 Oracle)或回归系数(解释 Oracle)选择的,以展示如果知道真相,岭回归的能力。

结果

岭回归的性能强烈依赖于复杂度参数的选择。正如我们的模拟所示,并通过一个数据示例说明,在小数据集或稀疏数据集中优化的值与最佳值负相关,并受到大量变化的影响,这转化为系数的大 MSE 和校准斜率的大变化。相比之下,在我们的模拟中,在拟合之前预先指定收缩程度会导致即使在不理想的设置下(例如在罕见结果或稀疏预测器的情况下)也能得到准确的系数和预测。

结论

在小数据集或稀疏数据集中应用调整后的岭回归是有问题的,因为它会导致不稳定的系数和预测。相比之下,根据关于真实效果的某些有意义的先验假设来确定收缩程度,有可能减少偏差并稳定估计。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/670c/8482588/6998777d69a8/12874_2021_1374_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验