Suppr超能文献

低事件数低维数据中风险预测的惩罚回归方法综述与评估

Review and evaluation of penalised regression methods for risk prediction in low-dimensional data with few events.

作者信息

Pavlou Menelaos, Ambler Gareth, Seaman Shaun, De Iorio Maria, Omar Rumana Z

机构信息

Department of Statistical Science, University College London, London, WC1E 6BT, U.K.

MRC Biostatistics Unit, Cambridge, CB2 0SR, U.K.

出版信息

Stat Med. 2016 Mar 30;35(7):1159-77. doi: 10.1002/sim.6782. Epub 2015 Oct 29.

Abstract

Risk prediction models are used to predict a clinical outcome for patients using a set of predictors. We focus on predicting low-dimensional binary outcomes typically arising in epidemiology, health services and public health research where logistic regression is commonly used. When the number of events is small compared with the number of regression coefficients, model overfitting can be a serious problem. An overfitted model tends to demonstrate poor predictive accuracy when applied to new data. We review frequentist and Bayesian shrinkage methods that may alleviate overfitting by shrinking the regression coefficients towards zero (some methods can also provide more parsimonious models by omitting some predictors). We evaluated their predictive performance in comparison with maximum likelihood estimation using real and simulated data. The simulation study showed that maximum likelihood estimation tends to produce overfitted models with poor predictive performance in scenarios with few events, and penalised methods can offer improvement. Ridge regression performed well, except in scenarios with many noise predictors. Lasso performed better than ridge in scenarios with many noise predictors and worse in the presence of correlated predictors. Elastic net, a hybrid of the two, performed well in all scenarios. Adaptive lasso and smoothly clipped absolute deviation performed best in scenarios with many noise predictors; in other scenarios, their performance was inferior to that of ridge and lasso. Bayesian approaches performed well when the hyperparameters for the priors were chosen carefully. Their use may aid variable selection, and they can be easily extended to clustered-data settings and to incorporate external information.

摘要

风险预测模型用于通过一组预测变量来预测患者的临床结局。我们专注于预测流行病学、卫生服务和公共卫生研究中通常出现的低维二元结局,在这些研究中逻辑回归被广泛使用。当事件数量与回归系数数量相比很小时,模型过度拟合可能是一个严重问题。过度拟合的模型在应用于新数据时往往表现出较差的预测准确性。我们回顾了频率主义和贝叶斯收缩方法,这些方法可以通过将回归系数向零收缩来减轻过度拟合(一些方法还可以通过省略一些预测变量来提供更简洁的模型)。我们使用真实数据和模拟数据评估了它们与最大似然估计相比的预测性能。模拟研究表明,在事件较少的情况下,最大似然估计往往会产生预测性能较差的过度拟合模型,而惩罚方法可以有所改进。岭回归表现良好,除了在有许多噪声预测变量的情况下。在有许多噪声预测变量的情况下,套索回归比岭回归表现更好,而在存在相关预测变量的情况下则更差。弹性网络是两者的混合体,在所有情况下都表现良好。自适应套索回归和光滑截断绝对偏差在有许多噪声预测变量的情况下表现最佳;在其他情况下,它们的性能不如岭回归和套索回归。当仔细选择先验的超参数时,贝叶斯方法表现良好。它们的使用可能有助于变量选择,并且可以很容易地扩展到聚类数据设置并纳入外部信息。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e37e/4982098/552919676d5d/SIM-35-1159-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验