关于实值预测的上界

On the Upper Bounds of the Real-Valued Predictions.

作者信息

Benevenuta Silvia, Fariselli Piero

机构信息

Department of Medical Sciences, University of Turin, Turin, Italy.

出版信息

Bioinform Biol Insights. 2019 Aug 23;13:1177932219871263. doi: 10.1177/1177932219871263. eCollection 2019.

DOI:10.1177/1177932219871263

PMID:31488948

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6710671/

Abstract

Predictions are fundamental in science as they allow to test and falsify theories. Predictions are ubiquitous in bioinformatics and also help when no first principles are available. Predictions can be distinguished between classifications (when we associate a label to a given input) or regression (when a real value is assigned). Different scores are used to assess the performance of regression predictors; the most widely adopted include the mean square error, the Pearson correlation (ρ), and the coefficient of determination (or ). The common conception related to the last 2 indices is that the theoretical upper bound is 1; however, their upper bounds depend both on the experimental uncertainty and the distribution of target variables. A narrow distribution of the target variable may induce a low upper bound. The knowledge of the theoretical upper bounds also has 2 practical applications: (1) comparing different predictors tested on different data sets may lead to wrong ranking and (2) performances higher than the theoretical upper bounds indicate overtraining and improper usage of the learning data sets. Here, we derive the upper bound for the coefficient of determination showing that it is lower than that of the square of the Pearson correlation. We provide analytical equations for both indices that can be used to evaluate the upper bound of the predictions when the experimental uncertainty and the target distribution are available. Our considerations are general and applicable to all regression predictors.

摘要

预测在科学中至关重要，因为它们能用于检验理论并证明其错误。预测在生物信息学中无处不在，并且在没有第一性原理可用时也能发挥作用。预测可分为分类（为给定输入关联一个标签）或回归（分配一个实值）。不同的分数用于评估回归预测器的性能；最广泛采用的包括均方误差、皮尔逊相关系数（ρ）和决定系数（或）。与后两个指标相关的普遍观念是理论上限为1；然而，它们的上限既取决于实验不确定性，也取决于目标变量的分布。目标变量的窄分布可能导致上限较低。理论上限的知识还有两个实际应用：（1）比较在不同数据集上测试的不同预测器可能会导致错误的排名，（2）高于理论上限的性能表明存在过训练以及对学习数据集的不当使用。在此，我们推导了决定系数的上限，表明它低于皮尔逊相关系数平方的上限。我们为这两个指标提供了解析方程，当实验不确定性和目标分布可用时，可用于评估预测的上限。我们的考量具有普遍性，适用于所有回归预测器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e72a/6710671/06708d8eefa4/10.1177_1177932219871263-fig1.jpg

相似文献

On the Upper Bounds of the Real-Valued Predictions.关于实值预测的上界

Bioinform Biol Insights. 2019 Aug 23;13:1177932219871263. doi: 10.1177/1177932219871263. eCollection 2019.

A natural upper bound to the accuracy of predicting protein stability changes upon mutations.一种预测蛋白质突变稳定性变化的自然上限精度。

Bioinformatics. 2019 May 1;35(9):1513-1517. doi: 10.1093/bioinformatics/bty880.

Improper Complex-Valued Bhattacharyya Distance.不当复值 Bhattacharyya 距离。

IEEE Trans Neural Netw Learn Syst. 2016 May;27(5):1049-64. doi: 10.1109/TNNLS.2015.2436064. Epub 2015 Jun 10.

A unified approach to universal prediction: generalized upper and lower bounds.一种通用预测的统一方法：广义上下界。

IEEE Trans Neural Netw Learn Syst. 2015 Mar;26(3):646-51. doi: 10.1109/TNNLS.2014.2317552.

Performance of Regression Models as a Function of Experiment Noise.回归模型的性能作为实验噪声的函数

Bioinform Biol Insights. 2021 Jun 27;15:11779322211020315. doi: 10.1177/11779322211020315. eCollection 2021.

Information-Theoretic Generalization Bounds for Meta-Learning and Applications.元学习及其应用的信息论泛化界

Entropy (Basel). 2021 Jan 19;23(1):126. doi: 10.3390/e23010126.

General Approach to Estimate Error Bars for Quantitative Structure-Activity Relationship Predictions of Molecular Activity.定量构效关系预测分子活性的误差估计的一般方法。

J Chem Inf Model. 2018 Aug 27;58(8):1561-1575. doi: 10.1021/acs.jcim.8b00114. Epub 2018 Jul 17.

An Upper Bound on the Error Induced by Saddlepoint Approximations-Applications to Information Theory.鞍点近似法引起的误差的上界——在信息论中的应用

Entropy (Basel). 2020 Jun 20;22(6):690. doi: 10.3390/e22060690.

Metric-Guided Conformal Bounds for Probabilistic Image Reconstruction.概率图像重建的度量引导共形边界

ArXiv. 2025 Mar 4:arXiv:2404.15274v3.

Optimal temporal differentiation.最佳时间分化。

J Exp Anal Behav. 1987 Mar;47(2):191-200. doi: 10.1901/jeab.1987.47-191.

引用本文的文献

DAGIP: alleviating cell-free DNA sequencing biases with optimal transport.DAGIP：利用最优传输减轻游离DNA测序偏差

Genome Biol. 2025 Mar 7;26(1):49. doi: 10.1186/s13059-025-03511-y.

AbMelt: Learning antibody thermostability from molecular dynamics.AbMelt：从分子动力学角度学习抗体热稳定性。

Biophys J. 2024 Sep 3;123(17):2921-2933. doi: 10.1016/j.bpj.2024.06.003. Epub 2024 Jun 7.

Biologically meaningful genome interpretation models to address data underdetermination for the leaf and seed ionome prediction in Arabidopsis thaliana.用于解决拟南芥叶和种子离子组预测中数据不足问题的具有生物学意义的基因组解释模型。

Sci Rep. 2024 Jun 8;14(1):13188. doi: 10.1038/s41598-024-63855-6.

Quantification of biases in predictions of protein-protein binding affinity changes upon mutations.量化预测蛋白质突变后结合亲和力变化的偏倚。

Brief Bioinform. 2023 Nov 22;25(1). doi: 10.1093/bib/bbad491.

Assessing computational tools for predicting protein stability changes upon missense mutations using a new dataset.评估使用新数据集预测错义突变导致的蛋白质稳定性变化的计算工具。

Protein Sci. 2024 Jan;33(1):e4861. doi: 10.1002/pro.4861.

Challenges in predicting stabilizing variations: An exploration.预测稳定变异的挑战：一项探索

Front Mol Biosci. 2023 Jan 5;9:1075570. doi: 10.3389/fmolb.2022.1075570. eCollection 2022.

Predicting protein stability changes upon mutation using a simple orientational potential.使用简单的取向势能预测突变后蛋白质稳定性的变化。

Bioinformatics. 2023 Jan 1;39(1). doi: 10.1093/bioinformatics/btad011.

Predicting protein stability changes upon single-point mutation: a thorough comparison of the available tools on a new dataset.预测单点突变后蛋白质稳定性的变化：在新数据集上对现有工具的全面比较。

Brief Bioinform. 2022 Mar 10;23(2). doi: 10.1093/bib/bbab555.

From genotype to phenotype in Arabidopsis thaliana: in-silico genome interpretation predicts 288 phenotypes from sequencing data.拟南芥从基因型到表型：基于测序数据的计算机基因组解读预测 288 种表型。

Nucleic Acids Res. 2022 Feb 22;50(3):e16. doi: 10.1093/nar/gkab1099.

Performance of Regression Models as a Function of Experiment Noise.回归模型的性能作为实验噪声的函数

Bioinform Biol Insights. 2021 Jun 27;15:11779322211020315. doi: 10.1177/11779322211020315. eCollection 2021.

本文引用的文献

DDGun: an untrained method for the prediction of protein stability changes upon single and multiple point variations.DDGun：一种未经训练的方法，用于预测单点和多点变异后蛋白质稳定性的变化。

BMC Bioinformatics. 2019 Jul 3;20(Suppl 14):335. doi: 10.1186/s12859-019-2923-1.

Blind tests of RNA-protein binding affinity prediction.RNA-蛋白质结合亲和力预测的盲测。

Proc Natl Acad Sci U S A. 2019 Apr 23;116(17):8336-8341. doi: 10.1073/pnas.1819047116. Epub 2019 Apr 8.

Relative Binding Affinity Prediction of Charge-Changing Sequence Mutations with FEP in Protein-Protein Interfaces.电荷改变序列突变在蛋白质-蛋白质界面中的相对结合亲和力预测采用 FEP。

J Mol Biol. 2019 Mar 29;431(7):1481-1493. doi: 10.1016/j.jmb.2019.02.003. Epub 2019 Feb 16.

A natural upper bound to the accuracy of predicting protein stability changes upon mutations.一种预测蛋白质突变稳定性变化的自然上限精度。

Bioinformatics. 2019 May 1;35(9):1513-1517. doi: 10.1093/bioinformatics/bty880.

SKEMPI 2.0: an updated benchmark of changes in protein-protein binding energy, kinetics and thermodynamics upon mutation.SKEMPI 2.0：一个更新的蛋白质-蛋白质结合能、动力学和热力学突变的基准。

Bioinformatics. 2019 Feb 1;35(3):462-469. doi: 10.1093/bioinformatics/bty635.

Protein Solvent-Accessibility Prediction by a Stacked Deep Bidirectional Recurrent Neural Network.基于堆叠双向递归神经网络的蛋白质溶剂可及性预测。

Biomolecules. 2018 May 25;8(2):33. doi: 10.3390/biom8020033.

Towards more accurate prediction of protein folding rates: a review of the existing Web-based bioinformatics approaches.迈向更准确的蛋白质折叠速率预测：基于网络的现有生物信息学方法综述

Brief Bioinform. 2015 Mar;16(2):314-24. doi: 10.1093/bib/bbu007. Epub 2014 Mar 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

关于实值预测的上界

On the Upper Bounds of the Real-Valued Predictions.

作者信息

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献