Suppr超能文献

关于实值预测的上界

On the Upper Bounds of the Real-Valued Predictions.

作者信息

Benevenuta Silvia, Fariselli Piero

机构信息

Department of Medical Sciences, University of Turin, Turin, Italy.

出版信息

Bioinform Biol Insights. 2019 Aug 23;13:1177932219871263. doi: 10.1177/1177932219871263. eCollection 2019.

Abstract

Predictions are fundamental in science as they allow to test and falsify theories. Predictions are ubiquitous in bioinformatics and also help when no first principles are available. Predictions can be distinguished between classifications (when we associate a label to a given input) or regression (when a real value is assigned). Different scores are used to assess the performance of regression predictors; the most widely adopted include the mean square error, the Pearson correlation (ρ), and the coefficient of determination (or ). The common conception related to the last 2 indices is that the theoretical upper bound is 1; however, their upper bounds depend both on the experimental uncertainty and the distribution of target variables. A narrow distribution of the target variable may induce a low upper bound. The knowledge of the theoretical upper bounds also has 2 practical applications: (1) comparing different predictors tested on different data sets may lead to wrong ranking and (2) performances higher than the theoretical upper bounds indicate overtraining and improper usage of the learning data sets. Here, we derive the upper bound for the coefficient of determination showing that it is lower than that of the square of the Pearson correlation. We provide analytical equations for both indices that can be used to evaluate the upper bound of the predictions when the experimental uncertainty and the target distribution are available. Our considerations are general and applicable to all regression predictors.

摘要

预测在科学中至关重要,因为它们能用于检验理论并证明其错误。预测在生物信息学中无处不在,并且在没有第一性原理可用时也能发挥作用。预测可分为分类(为给定输入关联一个标签)或回归(分配一个实值)。不同的分数用于评估回归预测器的性能;最广泛采用的包括均方误差、皮尔逊相关系数(ρ)和决定系数(或 )。与后两个指标相关的普遍观念是理论上限为1;然而,它们的上限既取决于实验不确定性,也取决于目标变量的分布。目标变量的窄分布可能导致上限较低。理论上限的知识还有两个实际应用:(1)比较在不同数据集上测试的不同预测器可能会导致错误的排名,(2)高于理论上限的性能表明存在过训练以及对学习数据集的不当使用。在此,我们推导了决定系数的上限,表明它低于皮尔逊相关系数平方的上限。我们为这两个指标提供了解析方程,当实验不确定性和目标分布可用时,可用于评估预测的上限。我们的考量具有普遍性,适用于所有回归预测器。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/e72a/6710671/06708d8eefa4/10.1177_1177932219871263-fig1.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验