基准指标概率：用于二元分类问题的概率误差/损失性能评估工具的基准测试

BenchMetrics Prob: benchmarking of probabilistic error/loss performance evaluation instruments for binary classification problems.

作者信息

Canbek Gürol

机构信息

Pointr, Ankara, Turkey.

出版信息

Int J Mach Learn Cybern. 2023 Apr 19:1-31. doi: 10.1007/s13042-023-01826-5.

DOI:10.1007/s13042-023-01826-5

PMID:37360884

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10113998/

Abstract

Probabilistic error/loss performance evaluation instruments that are originally used for regression and time series forecasting are also applied in some binary-class or multi-class classifiers, such as artificial neural networks. This study aims to systematically assess probabilistic instruments for binary classification performance evaluation using a proposed two-stage benchmarking method called BenchMetrics Prob. The method employs five criteria and fourteen simulation cases based on hypothetical classifiers on synthetic datasets. The goal is to reveal specific weaknesses of performance instruments and to identify the most robust instrument in binary classification problems. The BenchMetrics Prob method was tested on 31 instrument/instrument variants, and the results have identified four instruments as the most robust in a binary classification context: Sum Squared Error (), Mean Squared Error (), Root Mean Squared Error (, as the variant of ), and Mean Absolute Error (). As has lower interpretability due to its [0, ∞) range, in [0, 1] is the most convenient and robust probabilistic metric for generic purposes. In classification problems where large errors are more important than small errors, may be a better choice. Additionally, the results showed that instrument variants with summarization functions other than mean (, median and geometric mean), , and the error instruments with relative/percentage/symmetric-percentage subtypes for regression, such as Mean Absolute Percentage Error (), Symmetric MAPE (), and Mean Relative Absolute Error (), were less robust and should be avoided. These findings suggest that researchers should employ robust probabilistic metrics when measuring and reporting performance in binary classification problems.

摘要

最初用于回归和时间序列预测的概率误差/损失性能评估工具也被应用于一些二分类或多分类分类器中，如人工神经网络。本研究旨在使用一种名为BenchMetrics Prob的两阶段基准测试方法，系统地评估用于二分类性能评估的概率工具。该方法基于合成数据集上的假设分类器，采用五个标准和14个模拟案例。目标是揭示性能工具的具体弱点，并在二分类问题中识别出最稳健的工具。BenchMetrics Prob方法在31种工具/工具变体上进行了测试，结果确定了四种工具在二分类环境中最为稳健：均方误差（Sum Squared Error）、平均平方误差（Mean Squared Error）、均方根误差（Root Mean Squared Error，作为的变体）和平均绝对误差（Mean Absolute Error）。由于其[0, ∞)范围，的可解释性较低，因此[0, 1]范围内的是通用目的下最方便和稳健的概率度量。在大误差比小误差更重要的分类问题中，可能是更好的选择。此外，结果表明，具有除均值（、中位数和几何均值）之外的汇总函数的工具变体，以及具有回归相对/百分比/对称百分比子类型的误差工具，如平均绝对百分比误差（Mean Absolute Percentage Error）、对称平均绝对百分比误差（Symmetric MAPE）和平均相对绝对误差（Mean Relative Absolute Error），稳健性较差，应避免使用。这些发现表明，研究人员在测量和报告二分类问题的性能时应采用稳健的概率度量。