Suppr超能文献

用于比较监督分类学习算法的近似统计检验

Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.

作者信息

Dietterich TG

机构信息

Oregon State University, Department of Computer Science, Corvallis OR, US, Dearborn Hall 303, 97331.

出版信息

Neural Comput. 1998 Sep 15;10(7):1895-1923. doi: 10.1162/089976698300017197.

Abstract

This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learning task. These tests are compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paired-differences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar's test, is shown to have low type I error. The fifth test is a new test, 5 x 2 cv, based on five iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the power (ability to detect algorithm differences when they do exist) of these tests. The cross-validated t test is the most powerful. The 5 x 2 cv test is shown to be slightly more powerful than McNemar's test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar's test is the only test with acceptable type I error. For algorithms that can be executed 10 times, the 5 x 2 cv test is recommended, because it is slightly more powerful and because it directly measures variation due to the choice of training set.

摘要

本文回顾了五种近似统计检验方法,用于确定在特定学习任务中一种学习算法是否优于另一种。通过实验比较这些检验方法,以确定在不存在差异时错误检测到差异的概率(I型错误)。结果表明,两种广泛使用的统计检验方法在某些情况下具有较高的I型错误概率,不应使用:两种比例差异检验和基于多次随机训练-测试分割的配对差异t检验。第三种检验方法,基于十折交叉验证的配对差异t检验,显示出I型错误概率略有升高。第四种检验方法,McNemar检验,显示出较低的I型错误。第五种检验方法是一种新的检验方法,5×2交叉验证,基于两次交叉验证的五次迭代。实验表明,这种检验方法也具有可接受的I型错误。本文还测量了这些检验方法的功效(在存在算法差异时检测差异的能力)。交叉验证t检验的功效最强。5×2交叉验证检验被证明比McNemar检验的功效略强。最佳检验方法的选择取决于运行学习算法的计算成本。对于只能执行一次的算法,McNemar检验是唯一具有可接受I型错误的检验方法。对于可以执行十次的算法,建议使用5×2交叉验证检验,因为它的功效略强,并且它直接测量了由于训练集选择而产生的变化。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验