Yang Kun, Cai Zhipeng, Li Jianzhong, Lin Guohui
Department of Computer Science and Engineering, Harbin Institute of Technology, Harbin 150001, China.
BMC Bioinformatics. 2006 Apr 27;7:228. doi: 10.1186/1471-2105-7-228.
Microarray data analysis is notorious for involving a huge number of genes compared to a relatively small number of samples. Gene selection is to detect the most significantly differentially expressed genes under different conditions, and it has been a central research focus. In general, a better gene selection method can improve the performance of classification significantly. One of the difficulties in gene selection is that the numbers of samples under different conditions vary a lot.
Two novel gene selection methods are proposed in this paper, which are not affected by the unbalanced sample class sizes and do not assume any explicit statistical model on the gene expression values. They were evaluated on eight publicly available microarray datasets, using leave-one-out cross-validation and 5-fold cross-validation. The performance is measured by the classification accuracies using the top ranked genes based on the training datasets.
The experimental results showed that the proposed gene selection methods are efficient, effective, and robust in identifying differentially expressed genes. Adopting the existing SVM-based and KNN-based classifiers, the selected genes by our proposed methods in general give more accurate classification results, typically when the sample class sizes in the training dataset are unbalanced.
与相对较少的样本数量相比,微阵列数据分析因涉及大量基因而声名狼藉。基因选择旨在检测不同条件下差异表达最显著的基因,它一直是核心研究重点。一般来说,更好的基因选择方法能显著提高分类性能。基因选择的困难之一在于不同条件下的样本数量差异很大。
本文提出了两种新颖的基因选择方法,它们不受样本类别大小不平衡的影响,且不对基因表达值假设任何显式统计模型。使用留一法交叉验证和五折交叉验证,在八个公开可用的微阵列数据集上对它们进行了评估。性能通过基于训练数据集使用排名靠前的基因的分类准确率来衡量。
实验结果表明,所提出的基因选择方法在识别差异表达基因方面高效、有效且稳健。采用现有的基于支持向量机和基于K近邻的分类器,我们提出的方法选择的基因通常能给出更准确的分类结果,特别是当训练数据集中的样本类别大小不平衡时。