用于基于表达数据的癌症分类中基因选择的多重支持向量机递归特征消除法

Multiple SVM-RFE for gene selection in cancer classification with expression data.

作者信息

Duan Kai-Bo, Rajapakse Jagath C, Wang Haiying, Azuaje Francisco

机构信息

BioInformatics Research Centre, School of Computer Engineering, Nanyang Technological University, Singapore 639798.

出版信息

IEEE Trans Nanobioscience. 2005 Sep;4(3):228-34. doi: 10.1109/tnb.2005.853657.

DOI:10.1109/tnb.2005.853657

PMID:16220686

Abstract

This paper proposes a new feature selection method that uses a backward elimination procedure similar to that implemented in support vector machine recursive feature elimination (SVM-RFE). Unlike the SVM-RFE method, at each step, the proposed approach computes the feature ranking score from a statistical analysis of weight vectors of multiple linear SVMs trained on subsamples of the original training data. We tested the proposed method on four gene expression datasets for cancer classification. The results show that the proposed feature selection method selects better gene subsets than the original SVM-RFE and improves the classification accuracy. A Gene Ontology-based similarity assessment indicates that the selected subsets are functionally diverse, further validating our gene selection method. This investigation also suggests that, for gene expression-based cancer classification, average test error from multiple partitions of training and test sets can be recommended as a reference of performance quality.

摘要

本文提出了一种新的特征选择方法，该方法使用了一种类似于支持向量机递归特征消除（SVM-RFE）中实现的向后消除过程。与SVM-RFE方法不同的是，在每一步中，该方法通过对在原始训练数据的子样本上训练的多个线性支持向量机的权重向量进行统计分析来计算特征排名分数。我们在四个用于癌症分类的基因表达数据集上测试了该方法。结果表明，所提出的特征选择方法比原始的SVM-RFE能选择更好的基因子集，并提高了分类准确率。基于基因本体的相似性评估表明，所选子集在功能上具有多样性，进一步验证了我们的基因选择方法。这项研究还表明，对于基于基因表达的癌症分类，可以推荐将训练集和测试集的多个划分的平均测试误差作为性能质量的参考。