Yeang C H, Ramaswamy S, Tamayo P, Mukherjee S, Rifkin R M, Angelo M, Reich M, Lander E, Mesirov J, Golub T
Center for Genome Research, MIT Whitehead Institute, One Kendall Square, Cambridge, MA 02139, USA.
Bioinformatics. 2001;17 Suppl 1:S316-22. doi: 10.1093/bioinformatics/17.suppl_1.s316.
Using gene expression data to classify tumor types is a very promising tool in cancer diagnosis. Previous works show several pairs of tumor types can be successfully distinguished by their gene expression patterns (Golub et al. 1999, Ben-Dor et al. 2000, Alizadeh et al. 2000). However, the simultaneous classification across a heterogeneous set of tumor types has not been well studied yet. We obtained 190 samples from 14 tumor classes and generated a combined expression dataset containing 16063 genes for each of those samples. We performed multi-class classification by combining the outputs of binary classifiers. Three binary classifiers (k-nearest neighbors, weighted voting, and support vector machines) were applied in conjunction with three combination scenarios (one-vs-all, all-pairs, hierarchical partitioning). We achieved the best cross validation error rate of 18.75% and the best test error rate of 21.74% by using the one-vs-all support vector machine algorithm. The results demonstrate the feasibility of performing clinically useful classification from samples of multiple tumor types.
利用基因表达数据对肿瘤类型进行分类是癌症诊断中一项非常有前景的工具。先前的研究表明,几对肿瘤类型可以通过其基因表达模式成功区分(Golub等人,1999年;Ben-Dor等人,2000年;Alizadeh等人,2000年)。然而,跨异类肿瘤类型集的同时分类尚未得到充分研究。我们从14种肿瘤类别中获取了190个样本,并为每个样本生成了一个包含16063个基因的组合表达数据集。我们通过组合二元分类器的输出进行多类别分类。三种二元分类器(k近邻、加权投票和支持向量机)与三种组合方案(一对多、所有对、层次划分)结合使用。通过使用一对多支持向量机算法,我们实现了18.75%的最佳交叉验证错误率和21.74%的最佳测试错误率。结果证明了从多种肿瘤类型的样本中进行临床有用分类的可行性。