Dabney Alan R
Department of Biostatistics, University of Washington, Seattle, 98195, USA.
Bioinformatics. 2005 Nov 15;21(22):4148-54. doi: 10.1093/bioinformatics/bti681. Epub 2005 Sep 20.
Classification of biological samples by microarrays is a topic of much interest. A number of methods have been proposed and successfully applied to this problem. It has recently been shown that classification by nearest centroids provides an accurate predictor that may outperform much more complicated methods. The 'Prediction Analysis of Microarrays' (PAM) approach is one such example, which the authors strongly motivate by its simplicity and interpretability. In this spirit, I seek to assess the performance of classifiers simpler than even PAM.
I surprisingly show that the modified t-statistics and shrunken centroids employed by PAM tend to increase misclassification error when compared with their simpler counterparts. Based on these observations, I propose a classification method called 'Classification to Nearest Centroids' (ClaNC). ClaNC ranks genes by standard t-statistics, does not shrink centroids and uses a class-specific gene-selection procedure. Because of these modifications, ClaNC is arguably simpler and easier to interpret than PAM, and it can be viewed as a traditional nearest centroid classifier that uses specially selected genes. I demonstrate that ClaNC error rates tend to be significantly less than those for PAM, for a given number of active genes.
Point-and-click software is freely available at http://students.washington.edu/adabney/clanc.
利用微阵列对生物样本进行分类是一个备受关注的课题。已经提出了许多方法并成功应用于该问题。最近有研究表明,最近质心分类法能提供一个准确的预测器,其性能可能优于更为复杂的方法。“微阵列预测分析”(PAM)方法就是这样一个例子,作者因其简单性和可解释性而大力推崇。本着这种精神,我试图评估比PAM甚至更简单的分类器的性能。
我令人惊讶地发现,与更简单的对应方法相比,PAM所采用的修正t统计量和收缩质心往往会增加误分类误差。基于这些观察结果,我提出了一种名为“最近质心分类法”(ClaNC)的分类方法。ClaNC通过标准t统计量对基因进行排序,不收缩质心,并使用特定类别的基因选择程序。由于这些改进,ClaNC可以说是比PAM更简单且更易于解释,并且它可以被视为一种使用特别选择基因的传统最近质心分类器。我证明,对于给定数量的活跃基因,ClaNC的错误率往往显著低于PAM的错误率。
可通过点击式软件免费获取,网址为http://students.washington.edu/adabney/clanc 。