回归模型的性能作为实验噪声的函数

Performance of Regression Models as a Function of Experiment Noise.

作者信息

Li Gang, Zrimec Jan, Ji Boyang, Geng Jun, Larsbrink Johan, Zelezniak Aleksej, Nielsen Jens, Engqvist Martin Km

机构信息

Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden.

Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kgs. Lyngby, Denmark.

出版信息

Bioinform Biol Insights. 2021 Jun 27;15:11779322211020315. doi: 10.1177/11779322211020315. eCollection 2021.

DOI:10.1177/11779322211020315

PMID:34262264

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8243133/

Abstract

BACKGROUND

A challenge in developing machine learning regression models is that it is difficult to know whether maximal performance has been reached on the test dataset, or whether further model improvement is possible. In biology, this problem is particularly pronounced as sample labels (response variables) are typically obtained through experiments and therefore have experiment noise associated with them. Such label noise puts a fundamental limit to the metrics of performance attainable by regression models on the test dataset.

RESULTS

We address this challenge by deriving an expected upper bound for the coefficient of determination ( ) for regression models when tested on the holdout dataset. This upper bound depends only on the noise associated with the response variable in a dataset as well as its variance. The upper bound estimate was validated via Monte Carlo simulations and then used as a tool to bootstrap performance of regression models trained on biological datasets, including protein sequence data, transcriptomic data, and genomic data.

CONCLUSIONS

The new method for estimating upper bounds for model performance on test data should aid researchers in developing ML regression models that reach their maximum potential. Although we study biological datasets in this work, the new upper bound estimates will hold true for regression models from any research field or application area where response variables have associated noise.

摘要

背景

开发机器学习回归模型面临的一个挑战是，很难知道在测试数据集上是否已达到最大性能，或者是否有可能进一步改进模型。在生物学中，这个问题尤为突出，因为样本标签（响应变量）通常是通过实验获得的，因此与之相关存在实验噪声。这种标签噪声对回归模型在测试数据集上可达到的性能指标构成了根本限制。

结果

我们通过推导回归模型在留出数据集上进行测试时决定系数（）的预期上限来应对这一挑战。这个上限仅取决于数据集中与响应变量相关的噪声及其方差。通过蒙特卡洛模拟验证了上限估计，然后将其用作引导在生物数据集（包括蛋白质序列数据、转录组数据和基因组数据）上训练的回归模型性能的工具。

结论

用于估计测试数据上模型性能上限的新方法应有助于研究人员开发出发挥最大潜力的机器学习回归模型。尽管我们在这项工作中研究生物数据集，但新的上限估计对于来自任何研究领域或应用领域、其响应变量存在相关噪声的回归模型都将成立。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/40d5/8243133/f36ef36a574a/10.1177_11779322211020315-fig1.jpg

相似文献

Performance of Regression Models as a Function of Experiment Noise.

Bioinform Biol Insights. 2021 Jun 27;15:11779322211020315. doi: 10.1177/11779322211020315. eCollection 2021.

On the Upper Bounds of the Real-Valued Predictions.

Bioinform Biol Insights. 2019 Aug 23;13:1177932219871263. doi: 10.1177/1177932219871263. eCollection 2019.

Applications of Monte Carlo Simulation in Modelling of Biochemical Processes

Macromolecular crowding: chemistry and physics meet biology (Ascona, Switzerland, 10-14 June 2012).

Phys Biol. 2013 Aug;10(4):040301. doi: 10.1088/1478-3975/10/4/040301. Epub 2013 Aug 2.

Classification with Noisy Labels by Importance Reweighting.

IEEE Trans Pattern Anal Mach Intell. 2016 Mar;38(3):447-61. doi: 10.1109/TPAMI.2015.2456899.

Gene ontology based transfer learning for protein subcellular localization.

BMC Bioinformatics. 2011 Feb 2;12:44. doi: 10.1186/1471-2105-12-44.

Are we fitting data or noise? Analysing the predictive power of commonly used datasets in drug-, materials-, and molecular-discovery.

Faraday Discuss. 2025 Jan 14;256(0):304-321. doi: 10.1039/d4fd00091a.

Effects of dataset size and interactions on the prediction performance of logistic regression and deep learning models.

Comput Methods Programs Biomed. 2022 Jan;213:106504. doi: 10.1016/j.cmpb.2021.106504. Epub 2021 Oct 28.

Monte Carlo simulation of OLS and linear mixed model inference of phenotypic effects on gene expression.

PeerJ. 2016 Oct 11;4:e2575. doi: 10.7717/peerj.2575. eCollection 2016.

Bootstrap model selection had similar performance for selecting authentic and noise variables compared to backward variable elimination: a simulation study.

J Clin Epidemiol. 2008 Oct;61(10):1009-17.e1. doi: 10.1016/j.jclinepi.2007.11.014. Epub 2008 Jun 9.

引用本文的文献

Plant photosynthesis in basil (C3) and maize (C4) under different light conditions as basis of an AI-based model for PAM fluorescence/gas-exchange correlation.

Front Plant Sci. 2025 May 19;16:1590884. doi: 10.3389/fpls.2025.1590884. eCollection 2025.

Predicting gestational diabetes mellitus risk at 11-13 weeks' gestation: the role of extrachromosomal circular DNA.

Cardiovasc Diabetol. 2024 Aug 7;23(1):289. doi: 10.1186/s12933-024-02381-1.

Learning deep representations of enzyme thermal adaptation.

Protein Sci. 2022 Dec;31(12):e4480. doi: 10.1002/pro.4480.

G4Boost: a machine learning-based tool for quadruplex identification and stability prediction.

BMC Bioinformatics. 2022 Jun 18;23(1):240. doi: 10.1186/s12859-022-04782-z.

Acceleration of Magnetic Resonance Fingerprinting Reconstruction Using Denoising and Self-Attention Pyramidal Convolutional Neural Network.

Sensors (Basel). 2022 Feb 7;22(3):1260. doi: 10.3390/s22031260.

Probabilistic Random Forest improves bioactivity predictions close to the classification threshold by taking into account experimental uncertainty.

J Cheminform. 2021 Aug 19;13(1):62. doi: 10.1186/s13321-021-00539-7.

Machine learning for enzyme engineering, selection and design.

Protein Eng Des Sel. 2021 Feb 15;34. doi: 10.1093/protein/gzab019.

Learning the Regulatory Code of Gene Expression.

Front Mol Biosci. 2021 Jun 10;8:673363. doi: 10.3389/fmolb.2021.673363. eCollection 2021.

Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure.

Nat Commun. 2020 Dec 1;11(1):6141. doi: 10.1038/s41467-020-19921-4.

Predicting RNA SHAPE scores with deep learning.

RNA Biol. 2020 Sep;17(9):1324-1330. doi: 10.1080/15476286.2020.1760534. Epub 2020 May 31.

本文引用的文献

Machine-Learning-Based Approach to Decode the Influence of Nanomaterial Properties on Their Interaction with Cells.

ACS Appl Mater Interfaces. 2021 Jan 13;13(1):1943-1955. doi: 10.1021/acsami.0c18470. Epub 2020 Dec 29.

A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis.

Lancet Digit Health. 2019 Oct;1(6):e271-e297. doi: 10.1016/S2589-7500(19)30123-2. Epub 2019 Sep 25.

Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure.

Nat Commun. 2020 Dec 1;11(1):6141. doi: 10.1038/s41467-020-19921-4.

Artificial Intelligence and Machine Learning in Computational Nanotoxicology: Unlocking and Empowering Nanomedicine.

Adv Healthc Mater. 2020 Sep;9(17):e1901862. doi: 10.1002/adhm.201901862. Epub 2020 Jul 6.

Unified rational protein engineering with sequence-based deep representation learning.

Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.

The pan-genome of Saccharomyces cerevisiae.

FEMS Yeast Res. 2019 Nov 1;19(7). doi: 10.1093/femsyr/foz064.

On the Upper Bounds of the Real-Valued Predictions.

Bioinform Biol Insights. 2019 Aug 23;13:1177932219871263. doi: 10.1177/1177932219871263. eCollection 2019.

Machine Learning Applied to Predicting Microorganism Growth Temperatures and Enzyme Catalytic Optima.

ACS Synth Biol. 2019 Jun 21;8(6):1411-1420. doi: 10.1021/acssynbio.9b00099. Epub 2019 Jun 7.

Digital expression explorer 2: a repository of uniformly processed RNA sequencing data.

Gigascience. 2019 Apr 1;8(4). doi: 10.1093/gigascience/giz022.

Correlating enzyme annotations with a large set of microbial growth temperatures reveals metabolic adaptations to growth at diverse temperatures.

BMC Microbiol. 2018 Nov 6;18(1):177. doi: 10.1186/s12866-018-1320-7.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

回归模型的性能作为实验噪声的函数

Performance of Regression Models as a Function of Experiment Noise.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献