Vinaya Vijayan, Bulsara Nadeem, Gadgil Chetan J, Gadgil Mugdha
Department of Bioinformatics, Dr. D.Y. Patil Biotechnology and Bioinformatics Institute, Akurdi, Pune 411044, India.
Int J Bioinform Res Appl. 2009;5(4):417-31. doi: 10.1504/IJBRA.2009.027515.
High throughput gene expression data can be used to identify biomarker profiles for classification. The accuracy of microarray based sample classification depends on the algorithm employed for selecting the features (genes) used for classification, and the classification algorithm. We have evaluated the performance of over 2000 combinations of feature selection and classification algorithms in classifying cancer datasets. One of these combinations (SVM for ranking genes + SMO) shows excellent classification accuracy using a small number of genes across three cancer datasets tested. Notably, classification using 15 selected genes yields 96% accuracy for a dataset obtained on an independent microarray platform.
高通量基因表达数据可用于识别用于分类的生物标志物谱。基于微阵列的样本分类的准确性取决于用于选择分类所用特征(基因)的算法以及分类算法。我们评估了2000多种特征选择和分类算法组合在癌症数据集分类中的性能。其中一种组合(用于基因排名的支持向量机+序列最小优化算法)在测试的三个癌症数据集中使用少量基因显示出优异的分类准确性。值得注意的是,使用15个选定基因进行分类时,对于在独立微阵列平台上获得的数据集,准确率达到96%。