Sima Chao, Braga-Neto Ulisses, Dougherty Edward R
Department of Electrical Engineering, Texas A&M University College Station, TX, USA.
Bioinformatics. 2005 Apr 1;21(7):1046-54. doi: 10.1093/bioinformatics/bti081. Epub 2004 Oct 28.
Ranking feature sets is a key issue for classification, for instance, phenotype classification based on gene expression. Since ranking is often based on error estimation, and error estimators suffer to differing degrees of imprecision in small-sample settings, it is important to choose a computationally feasible error estimator that yields good feature-set ranking.
This paper examines the feature-ranking performance of several kinds of error estimators: resubstitution, cross-validation, bootstrap and bolstered error estimation. It does so for three classification rules: linear discriminant analysis, three-nearest-neighbor classification and classification trees. Two measures of performance are considered. One counts the number of the truly best feature sets appearing among the best feature sets discovered by the error estimator and the other computes the mean absolute error between the top ranks of the truly best feature sets and their ranks as given by the error estimator. Our results indicate that bolstering is superior to bootstrap, and bootstrap is better than cross-validation, for discovering top-performing feature sets for classification when using small samples. A key issue is that bolstered error estimation is tens of times faster than bootstrap, and faster than cross-validation, and is therefore feasible for feature-set ranking when the number of feature sets is extremely large.
对特征集进行排序是分类中的一个关键问题,例如基于基因表达的表型分类。由于排序通常基于误差估计,并且在小样本情况下误差估计器会受到不同程度的不精确性影响,因此选择一个计算可行且能产生良好特征集排序的误差估计器非常重要。
本文研究了几种误差估计器的特征排序性能:再代入法、交叉验证法、自助法和增强误差估计法。针对三种分类规则进行了研究:线性判别分析、三近邻分类法和分类树。考虑了两种性能度量。一种是计算在误差估计器发现的最佳特征集中出现的真正最佳特征集的数量,另一种是计算真正最佳特征集的最高排名与其在误差估计器给出的排名之间的平均绝对误差。我们的结果表明,在使用小样本进行分类时发现顶级性能特征集方面,增强法优于自助法,自助法优于交叉验证法。一个关键问题是,增强误差估计比自助法快数十倍,比交叉验证法也快,因此当特征集数量极大时,对于特征集排序是可行的。