Departamento de Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia, Madrid, Spain.
Signal Theory and Communications Department, University Carlos III Madrid, Madrid, Spain.
PLoS One. 2014 Jan 10;9(1):e84217. doi: 10.1371/journal.pone.0084217. eCollection 2014.
The most widely spread measure of performance, accuracy, suffers from a paradox: predictive models with a given level of accuracy may have greater predictive power than models with higher accuracy. Despite optimizing classification error rate, high accuracy models may fail to capture crucial information transfer in the classification task. We present evidence of this behavior by means of a combinatorial analysis where every possible contingency matrix of 2, 3 and 4 classes classifiers are depicted on the entropy triangle, a more reliable information-theoretic tool for classification assessment. Motivated by this, we develop from first principles a measure of classification performance that takes into consideration the information learned by classifiers. We are then able to obtain the entropy-modulated accuracy (EMA), a pessimistic estimate of the expected accuracy with the influence of the input distribution factored out, and the normalized information transfer factor (NIT), a measure of how efficient is the transmission of information from the input to the output set of classes. The EMA is a more natural measure of classification performance than accuracy when the heuristic to maximize is the transfer of information through the classifier instead of classification error count. The NIT factor measures the effectiveness of the learning process in classifiers and also makes it harder for them to "cheat" using techniques like specialization, while also promoting the interpretability of results. Their use is demonstrated in a mind reading task competition that aims at decoding the identity of a video stimulus based on magnetoencephalography recordings. We show how the EMA and the NIT factor reject rankings based in accuracy, choosing more meaningful and interpretable classifiers.
应用最广泛的性能衡量标准——准确性,存在一个悖论:具有给定精度水平的预测模型可能比精度更高的模型具有更大的预测能力。尽管优化了分类错误率,高精度模型可能无法捕获分类任务中的关键信息传递。我们通过组合分析证明了这种行为,在该分析中,描绘了 2、3 和 4 类分类器的每个可能的 contingency 矩阵在熵三角形上,这是一种更可靠的分类评估信息论工具。受此启发,我们从第一性原理出发,开发了一种分类性能衡量标准,该标准考虑了分类器学到的信息。然后,我们能够获得熵调制准确性(EMA),这是一种在排除输入分布影响的情况下对预期准确性的悲观估计,以及归一化信息传递因子(NIT),这是衡量信息从输入到类别的输出集传输效率的一种方法。当启发式方法是通过分类器而不是分类错误计数来最大化信息传输时,EMA 是比准确性更自然的分类性能衡量标准。NIT 因子衡量分类器中学习过程的有效性,并且还使它们更难使用专门化等技术“作弊”,同时还促进了结果的可解释性。它们在一项思维阅读任务竞赛中得到了应用,该竞赛旨在基于脑磁图记录解码视频刺激的身份。我们展示了 EMA 和 NIT 因子如何拒绝基于准确性的排名,选择更有意义和可解释的分类器。