Chuang L-Y, Yang C-S, Wu K-C, Yang C-H
Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan.
Methods Inf Med. 2010;49(3):254-68. doi: 10.3414/ME09-01-0010. Epub 2010 Feb 5.
Microarray data with reference to gene expression profiles have provided some valuable results related to a variety of problems, and contributed to advances in clinical medicine. Microarray data characteristically have a high dimension and small sample size, which makes it difficult for a general classification method to obtain correct data for classification. However, not every gene is potentially relevant for distinguishing the sample class. Thus, in order to analyze gene expression profiles correctly, feature (gene) selection is crucial for the classification process, and an effective gene extraction method is necessary for eliminating irrelevant genes and decreasing the classification error rate.
The purpose of gene expression analysis is to discriminate between classes of samples, and to predict the relative importance of each gene for sample classification.
In this paper, correlation-based feature selection (CFS) and Taguchi-binary particle swarm optimization (TBPSO) were combined into a hybrid method, and the K-nearest neighbor (K-NN) with leave-one-out cross-validation (LOOCV) method served as a classifier for ten gene expression profiles.
Experimental results show that this hybrid method effectively simplifies feature selection by reducing the number of features needed. The classification error rate obtained by the proposed method had the lowest classification error rate for all of the ten gene expression data set problems tested. For six of the gene expression profile data sets a classification error rate of zero could be reached.
The introduced method outperformed five other methods from the literature in terms of classification error rate. It could thus constitute a valuable tool for gene expression analysis in future studies.
与基因表达谱相关的微阵列数据已提供了一些与各种问题相关的有价值结果,并推动了临床医学的进步。微阵列数据的特点是维度高且样本量小,这使得一般的分类方法难以获得用于分类的正确数据。然而,并非每个基因都与区分样本类别潜在相关。因此,为了正确分析基因表达谱,特征(基因)选择对于分类过程至关重要,并且需要一种有效的基因提取方法来消除不相关基因并降低分类错误率。
基因表达分析的目的是区分样本类别,并预测每个基因对样本分类的相对重要性。
本文将基于相关性的特征选择(CFS)和田口二进制粒子群优化(TBPSO)组合成一种混合方法,并将采用留一法交叉验证(LOOCV)的K近邻(K-NN)方法用作十个基因表达谱的分类器。
实验结果表明,这种混合方法通过减少所需特征数量有效地简化了特征选择。对于所测试的所有十个基因表达数据集问题,所提出的方法获得的分类错误率最低。对于六个基因表达谱数据集,分类错误率可以达到零。
就分类错误率而言,所引入的方法优于文献中的其他五种方法。因此,它可能成为未来研究中基因表达分析的有价值工具。