Suppr超能文献

统计不可知回归:一种用于验证回归模型的机器学习方法。

Statistical agnostic regression: A machine learning method to validate regression models.

作者信息

Gorriz J M, Ramirez J, Segovia F, Jimenez-Mesa C, Martinez-Murcia F J, Suckling J

机构信息

Dpt. of Psychiatry, University of Cambridge, UK; DaSCI Institute, University of Granada, Spain; ibs.Granada, Granada, Spain.

DaSCI Institute, University of Granada, Spain.

出版信息

J Adv Res. 2025 May 1. doi: 10.1016/j.jare.2025.04.026.

Abstract

INTRODUCTION

Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources.

OBJECTIVES

Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection.

METHODS

In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least 1-η, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables.

CONCLUSIONS

Simulations demonstrate that the proposed agnostic (non-parametric) test can perform an analysis of variance comparable to the classical multivariate F-test for the slope parameter, without relying on the underlying assumptions of classical methods. A power analysis on a putative regression task revealed an overinflated false positive rate in standard ML methods, whereas the SAR test exhibited excellent control. Moreover, the residuals computed using this method represent a trade-off between those obtained from ML approaches and classical OLS.

摘要

引言

回归分析是统计建模的核心主题,旨在估计因变量(通常称为响应变量)与一个或多个自变量(即解释变量)之间的关系。在线性回归是在各个研究领域执行此任务的最常用方法,例如在整合来自多个来源的信息时进行数据集成和预测建模。

目标

解决线性回归问题的经典方法,如普通最小二乘法(OLS)、岭回归或套索回归,通常构成更先进的机器学习(ML)技术的基础,这些技术已成功应用,尽管没有对统计显著性进行正式定义。最多只是进行了排列检验或基于经验度量(如残差或准确性)的分析,利用ML估计在检测方面的更高灵敏度。

方法

在本文中,我们引入统计无偏回归(SAR)来评估基于ML的线性回归模型的统计显著性。这是通过分析实际风险(预期损失)的集中不等式并考虑最坏情况来实现的。为此,我们定义了一个阈值,确保有足够的证据(概率至少为1-η)来推断总体中解释(特征)变量和响应(标签)变量之间存在线性关系。

结论

模拟表明,所提出的无偏(非参数)检验可以进行与经典多变量F检验相当的斜率参数方差分析,而无需依赖经典方法的潜在假设。对一个假定回归任务的功效分析表明,标准ML方法中的误报率过高,而SAR检验表现出出色的控制能力。此外,使用此方法计算的残差代表了从ML方法和经典OLS方法获得的残差之间的一种权衡。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验