分类算法评估：基于混淆矩阵的五种性能度量

Classification-algorithm evaluation: five performance measures based on confusion matrices.

作者信息

Forbes A D

机构信息

Medical Department, Hewlett-Packard Laboratories, Palo Alto, CA 94303-0867, USA.

出版信息

J Clin Monit. 1995 May;11(3):189-206. doi: 10.1007/BF01617722.

DOI:10.1007/BF01617722

PMID:7623060

Abstract

OBJECTIVE

The objective of this paper is to introduce, explain, and extend methods for comparing the performance of classification algorithms using error tallies obtained on properly sized, populated, and labeled data sets.

METHODS

Two distinct contexts of classification are defined, involving "objects-by-inspection" and "objects-by-segmentation." In the former context, the total number of objects to be classified is unambiguously and self-evidently defined. In the latter, there is troublesome ambiguity. All five of the measures of performance here considered are based on confusion matrices, tables of counts revealing the extent of an algorithm's "confusion" regarding the true classifications. A proper measure of classification-algorithm performance must meet four requirements. A proper measure should obey six additional constraints.

RESULTS

Four traditional measures of performance are critiqued in terms of the requirements and constraints. Each measure meets the requirements, but fails to obey at least one of the constraints. A nontraditional measure of algorithm performance, the normalized mutual information (NMI), is therefore introduced. Based on the NMI, methods for comparing algorithm performance using confusion matrices are devised.

CONCLUSIONS

The five performance measures lead to similar inferences when comparing a trio of QRS-detection algorithms using a large data set. The modified NMI is preferred, however, because it obeys each of the constraints and is the most conservative measure of performance.

摘要

目的

本文的目的是介绍、解释并扩展一些方法，这些方法用于使用在大小合适、数据充实且带有标签的数据集上获得的错误计数来比较分类算法的性能。

方法

定义了两种不同的分类情境，分别涉及“逐个检查对象”和“逐个分割对象”。在前一种情境中，要分类的对象总数是明确且不言而喻地定义的。而在后一种情境中，存在麻烦的模糊性。这里所考虑的所有五种性能度量都是基于混淆矩阵的，混淆矩阵是一种计数表，揭示了算法在真实分类方面的“混淆”程度。一种合适的分类算法性能度量必须满足四个要求。一种合适的度量还应遵循另外六个约束条件。