Benevenuta Silvia, Fariselli Piero
Department of Medical Sciences, University of Turin, Turin, Italy.
Bioinform Biol Insights. 2019 Aug 23;13:1177932219871263. doi: 10.1177/1177932219871263. eCollection 2019.
Predictions are fundamental in science as they allow to test and falsify theories. Predictions are ubiquitous in bioinformatics and also help when no first principles are available. Predictions can be distinguished between classifications (when we associate a label to a given input) or regression (when a real value is assigned). Different scores are used to assess the performance of regression predictors; the most widely adopted include the mean square error, the Pearson correlation (ρ), and the coefficient of determination (or ). The common conception related to the last 2 indices is that the theoretical upper bound is 1; however, their upper bounds depend both on the experimental uncertainty and the distribution of target variables. A narrow distribution of the target variable may induce a low upper bound. The knowledge of the theoretical upper bounds also has 2 practical applications: (1) comparing different predictors tested on different data sets may lead to wrong ranking and (2) performances higher than the theoretical upper bounds indicate overtraining and improper usage of the learning data sets. Here, we derive the upper bound for the coefficient of determination showing that it is lower than that of the square of the Pearson correlation. We provide analytical equations for both indices that can be used to evaluate the upper bound of the predictions when the experimental uncertainty and the target distribution are available. Our considerations are general and applicable to all regression predictors.
预测在科学中至关重要,因为它们能用于检验理论并证明其错误。预测在生物信息学中无处不在,并且在没有第一性原理可用时也能发挥作用。预测可分为分类(为给定输入关联一个标签)或回归(分配一个实值)。不同的分数用于评估回归预测器的性能;最广泛采用的包括均方误差、皮尔逊相关系数(ρ)和决定系数(或 )。与后两个指标相关的普遍观念是理论上限为1;然而,它们的上限既取决于实验不确定性,也取决于目标变量的分布。目标变量的窄分布可能导致上限较低。理论上限的知识还有两个实际应用:(1)比较在不同数据集上测试的不同预测器可能会导致错误的排名,(2)高于理论上限的性能表明存在过训练以及对学习数据集的不当使用。在此,我们推导了决定系数的上限,表明它低于皮尔逊相关系数平方的上限。我们为这两个指标提供了解析方程,当实验不确定性和目标分布可用时,可用于评估预测的上限。我们的考量具有普遍性,适用于所有回归预测器。