Pirooznia Mehdi, Yang Jack Y, Yang Mary Qu, Deng Youping
Department of Biological Sciences, University of Southern Mississippi, Hattiesburg 39406, USA.
BMC Genomics. 2008;9 Suppl 1(Suppl 1):S13. doi: 10.1186/1471-2164-9-S1-S13.
Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results.
In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers.
We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.
为了识别微阵列数据中差异表达的基因,人们研究了多种分类和特征选择方法。近期研究中使用了支持向量机(SVM)、径向基函数神经网络(RBF Neural Nets)、多层感知器神经网络(MLP Neural Nets)、贝叶斯、决策树和随机森林等分类方法。这些方法的准确性已通过诸如v折交叉验证等验证方法进行计算。然而,在这些方法之间缺乏比较,以找到一个更好的微阵列基因表达结果分类、聚类和分析框架。
在本研究中,我们比较了包括支持向量机、径向基函数神经网络、多层感知器神经网络、贝叶斯、决策树和随机森林方法在内的分类方法的效率。使用v折交叉验证来计算分类器的准确性。一些常见的聚类方法,包括K均值、密度峰值聚类(DBC)和期望最大化(EM)聚类,被应用于数据集,并分析了这些方法的效率。此外,还比较了包括支持向量机递归特征消除(SVM-RFE)、卡方检验和脑脊液特征选择方法(CSF)在内的特征选择方法的效率。在每种情况下,这些方法都应用于八个不同的二元(两类)微阵列数据集。我们使用监督分类器评估了训练和测试交叉验证中每个基因列表的类预测效率。
我们进行了一项研究,比较了一些常用的分类、聚类和特征选择方法。我们将这些方法应用于八个公开可用的数据集,并比较了这些方法在测试数据集的类预测中的表现。我们报告说,特征选择方法的选择、基因列表中的基因数量、样本数量对分类成功有重大影响。基于这些方法选择的特征,获得了几种分类算法的错误率和准确性。结果揭示了特征选择在准确分类新样本中的重要性,以及集成特征选择和分类算法的性能,以及它能够识别重要基因的能力。