• 文献检索
  • 文档翻译
  • 深度研究
  • 学术资讯
  • Suppr Zotero 插件Zotero 插件
  • 邀请有礼
  • 套餐&价格
  • 历史记录
应用&插件
Suppr Zotero 插件Zotero 插件浏览器插件Mac 客户端Windows 客户端微信小程序
定价
高级版会员购买积分包购买API积分包
服务
文献检索文档翻译深度研究API 文档MCP 服务
关于我们
关于 Suppr公司介绍联系我们用户协议隐私条款
关注我们

Suppr 超能文献

核心技术专利:CN118964589B侵权必究
粤ICP备2023148730 号-1Suppr @ 2026

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验

统计不可知回归:一种用于验证回归模型的机器学习方法。

Statistical agnostic regression: A machine learning method to validate regression models.

作者信息

Gorriz J M, Ramirez J, Segovia F, Jimenez-Mesa C, Martinez-Murcia F J, Suckling J

机构信息

Dpt. of Psychiatry, University of Cambridge, UK; DaSCI Institute, University of Granada, Spain; ibs.Granada, Granada, Spain.

DaSCI Institute, University of Granada, Spain.

出版信息

J Adv Res. 2025 May 1. doi: 10.1016/j.jare.2025.04.026.

DOI:10.1016/j.jare.2025.04.026
PMID:40318765
Abstract

INTRODUCTION

Regression analysis is a central topic in statistical modeling, aimed at estimating the relationships between a dependent variable, commonly referred to as the response variable, and one or more independent variables, i.e., explanatory variables. Linear regression is by far the most popular method for performing this task in various fields of research, such as data integration and predictive modeling when combining information from multiple sources.

OBJECTIVES

Classical methods for solving linear regression problems, such as Ordinary Least Squares (OLS), Ridge, or Lasso regressions, often form the foundation for more advanced machine learning (ML) techniques, which have been successfully applied, though without a formal definition of statistical significance. At most, permutation or analyses based on empirical measures (e.g., residuals or accuracy) have been conducted, leveraging the greater sensitivity of ML estimations for detection.

METHODS

In this paper, we introduce Statistical Agnostic Regression (SAR) for evaluating the statistical significance of ML-based linear regression models. This is achieved by analyzing concentration inequalities of the actual risk (expected loss) and considering the worst-case scenario. To this end, we define a threshold that ensures there is sufficient evidence, with a probability of at least 1-η, to conclude the existence of a linear relationship in the population between the explanatory (feature) and the response (label) variables.

CONCLUSIONS

Simulations demonstrate that the proposed agnostic (non-parametric) test can perform an analysis of variance comparable to the classical multivariate F-test for the slope parameter, without relying on the underlying assumptions of classical methods. A power analysis on a putative regression task revealed an overinflated false positive rate in standard ML methods, whereas the SAR test exhibited excellent control. Moreover, the residuals computed using this method represent a trade-off between those obtained from ML approaches and classical OLS.

摘要

引言

回归分析是统计建模的核心主题,旨在估计因变量(通常称为响应变量)与一个或多个自变量(即解释变量)之间的关系。在线性回归是在各个研究领域执行此任务的最常用方法,例如在整合来自多个来源的信息时进行数据集成和预测建模。

目标

解决线性回归问题的经典方法,如普通最小二乘法(OLS)、岭回归或套索回归,通常构成更先进的机器学习(ML)技术的基础,这些技术已成功应用,尽管没有对统计显著性进行正式定义。最多只是进行了排列检验或基于经验度量(如残差或准确性)的分析,利用ML估计在检测方面的更高灵敏度。

方法

在本文中,我们引入统计无偏回归(SAR)来评估基于ML的线性回归模型的统计显著性。这是通过分析实际风险(预期损失)的集中不等式并考虑最坏情况来实现的。为此,我们定义了一个阈值,确保有足够的证据(概率至少为1-η)来推断总体中解释(特征)变量和响应(标签)变量之间存在线性关系。

结论

模拟表明,所提出的无偏(非参数)检验可以进行与经典多变量F检验相当的斜率参数方差分析,而无需依赖经典方法的潜在假设。对一个假定回归任务的功效分析表明,标准ML方法中的误报率过高,而SAR检验表现出出色的控制能力。此外,使用此方法计算的残差代表了从ML方法和经典OLS方法获得的残差之间的一种权衡。

相似文献

1
Statistical agnostic regression: A machine learning method to validate regression models.统计不可知回归:一种用于验证回归模型的机器学习方法。
J Adv Res. 2025 May 1. doi: 10.1016/j.jare.2025.04.026.
2
A Connection Between Pattern Classification by Machine Learning and Statistical Inference With the General Linear Model.机器学习的模式分类与广义线性模型的统计推断之间的联系。
IEEE J Biomed Health Inform. 2022 Nov;26(11):5332-5343. doi: 10.1109/JBHI.2021.3101662. Epub 2022 Nov 10.
3
A land use regression model using machine learning and locally developed low cost particulate matter sensors in Uganda.乌干达使用机器学习和本地开发的低成本颗粒物传感器的土地利用回归模型。
Environ Res. 2021 Aug;199:111352. doi: 10.1016/j.envres.2021.111352. Epub 2021 May 24.
4
A comparison of methods to handle skew distributed cost variables in the analysis of the resource consumption in schizophrenia treatment.精神分裂症治疗资源消耗分析中处理偏态分布成本变量方法的比较。
J Ment Health Policy Econ. 2002 Mar;5(1):21-31.
5
Empirical analyses and simulations showed that different machine and statistical learning methods had differing performance for predicting blood pressure.实证分析和模拟表明,不同的机器和统计学习方法在预测血压方面的表现有所不同。
Sci Rep. 2022 Jun 3;12(1):9312. doi: 10.1038/s41598-022-13015-5.
6
Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区,服用抗叶酸抗疟药物的人群中,叶酸补充剂与疟疾易感性和严重程度的关系。
Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.
7
Causal Artificial Intelligence Models of Food Quality Data.食品质量数据的因果人工智能模型。
Food Technol Biotechnol. 2024 Mar;62(1):102-109. doi: 10.17113/ftb.62.01.24.8301.
8
Pharmacokinetic parameter estimations by minimum relative entropy method.基于最小相对熵法的药代动力学参数估计
J Pharmacokinet Biopharm. 1995 Oct;23(5):479-94. doi: 10.1007/BF02353470.
9
Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives.随机对照试验中的亚组分析:量化假阳性和假阴性风险
Health Technol Assess. 2001;5(33):1-56. doi: 10.3310/hta5330.
10
Part 1. Statistical Learning Methods for the Effects of Multiple Air Pollution Constituents.第1部分. 多种空气污染成分影响的统计学习方法
Res Rep Health Eff Inst. 2015 Jun(183 Pt 1-2):5-50.