Suppr超能文献

你所看到的未必是你所得到的:回归类型模型中过拟合的简要非技术介绍。

What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models.

作者信息

Babyak Michael A

机构信息

Duke University Medical Center, Durham, NC 27710, USA.

出版信息

Psychosom Med. 2004 May-Jun;66(3):411-21. doi: 10.1097/01.psy.0000127692.23278.a9.

Abstract

Statistical models, such as linear or logistic regression or survival analysis, are frequently used as a means to answer scientific questions in psychosomatic research. Many who use these techniques, however, apparently fail to appreciate fully the problem of overfitting, ie, capitalizing on the idiosyncrasies of the sample at hand. Overfitted models will fail to replicate in future samples, thus creating considerable uncertainty about the scientific merit of the finding. The present article is a nontechnical discussion of the concept of overfitting and is intended to be accessible to readers with varying levels of statistical expertise. The notion of overfitting is presented in terms of asking too much from the available data. Given a certain number of observations in a data set, there is an upper limit to the complexity of the model that can be derived with any acceptable degree of uncertainty. Complexity arises as a function of the number of degrees of freedom expended (the number of predictors including complex terms such as interactions and nonlinear terms) against the same data set during any stage of the data analysis. Theoretical and empirical evidence--with a special focus on the results of computer simulation studies--is presented to demonstrate the practical consequences of overfitting with respect to scientific inference. Three common practices--automated variable selection, pretesting of candidate predictors, and dichotomization of continuous variables--are shown to pose a considerable risk for spurious findings in models. The dilemma between overfitting and exploring candidate confounders is also discussed. Alternative means of guarding against overfitting are discussed, including variable aggregation and the fixing of coefficients a priori. Techniques that account and correct for complexity, including shrinkage and penalization, also are introduced.

摘要

统计模型,如线性回归、逻辑回归或生存分析,经常被用作回答心身研究中科学问题的一种手段。然而,许多使用这些技术的人显然没有充分认识到过度拟合的问题,即利用手头样本的特质。过度拟合的模型在未来样本中无法复制,从而给研究结果的科学价值带来很大的不确定性。本文是对过度拟合概念的非技术性讨论,旨在让具有不同统计专业水平的读者都能理解。过度拟合的概念是从对现有数据要求过高的角度来阐述的。给定数据集中一定数量的观测值,能够以任何可接受的不确定度推导出来的模型复杂度存在上限。在数据分析的任何阶段,复杂度是根据针对同一数据集所耗费的自由度数量(预测变量的数量,包括诸如交互项和非线性项等复杂项)而产生的。本文给出了理论和实证证据——特别关注计算机模拟研究的结果——以证明过度拟合对科学推断的实际影响。三种常见做法——自动变量选择、候选预测变量的预测试以及连续变量的二分法——被证明会给模型中的虚假发现带来相当大的风险。还讨论了过度拟合与探索候选混杂因素之间的困境。文中讨论了防范过度拟合的其他方法,包括变量聚合和先验系数固定。还介绍了考虑和校正复杂度的技术,包括收缩和惩罚。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验