Department of Chemical Engineering, I-Shou University, Kaohsiung 80041, Taiwan.
Comput Biol Med. 2011 Apr;41(4):228-37. doi: 10.1016/j.compbiomed.2011.02.004. Epub 2011 Mar 3.
Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. In cancer classification, available training data sets are generally of a fairly small sample size compared to the number of genes involved. Along with training data limitations, this constitutes a challenge to certain classification methods. Feature (gene) selection can be used to successfully extract those genes that directly influence classification accuracy and to eliminate genes which have no influence on it. This significantly improves calculation performance and classification accuracy. In this paper, correlation-based feature selection (CFS) and the Taguchi-genetic algorithm (TGA) method were combined into a hybrid method, and the K-nearest neighbor (KNN) with the leave-one-out cross-validation (LOOCV) method served as a classifier for eleven classification profiles to calculate the classification accuracy. Experimental results show that the proposed method reduced redundant features effectively and achieved superior classification accuracy. The classification accuracy obtained by the proposed method was higher in ten out of the eleven gene expression data set test problems when compared to other classification methods from the literature.
基因表达谱代表细胞在分子水平上的状态,具有作为医学诊断工具的巨大潜力。在癌症分类中,与所涉及的基因数量相比,可用的训练数据集通常样本量相当小。除了训练数据的限制外,这对某些分类方法构成了挑战。特征(基因)选择可用于成功提取那些直接影响分类准确性的基因,并消除对其没有影响的基因。这显著提高了计算性能和分类准确性。在本文中,基于相关性的特征选择(CFS)和 Taguchi 遗传算法(TGA)方法被组合成一种混合方法,而 K-最近邻(KNN)与留一交叉验证(LOOCV)方法一起作为分类器,用于计算十一个分类谱的分类准确性。实验结果表明,所提出的方法有效地减少了冗余特征,并获得了更高的分类准确性。在所提出的方法与文献中的其他分类方法相比,在所测试的十一个基因表达数据集问题中,有十个问题的分类准确性更高。