Chuang Li-Yeh, Chang Hsueh-Wei, Tu Chung-Jui, Yang Cheng-Hong
Department of Chemical Engineering, I-Shou University, Kaohsiung 840, Taiwan.
Comput Biol Chem. 2008 Feb;32(1):29-37. doi: 10.1016/j.compbiolchem.2007.09.005. Epub 2007 Sep 25.
Gene expression profiles, which represent the state of a cell at a molecular level, have great potential as a medical diagnosis tool. Compared to the number of genes involved, available training data sets generally have a fairly small sample size in cancer type classification. These training data limitations constitute a challenge to certain classification methodologies. A reliable selection method for genes relevant for sample classification is needed in order to speed up the processing rate, decrease the predictive error rate, and to avoid incomprehensibility due to the large number of genes investigated. Improved binary particle swarm optimization (IBPSO) is used in this study to implement feature selection, and the K-nearest neighbor (K-NN) method serves as an evaluator of the IBPSO for gene expression data classification problems. Experimental results show that this method effectively simplifies feature selection and reduces the total number of features needed. The classification accuracy obtained by the proposed method has the highest classification accuracy in nine of the 11 gene expression data test problems, and is comparative to the classification accuracy of the two other test problems, as compared to the best results previously published.
基因表达谱代表了细胞在分子水平上的状态,作为一种医学诊断工具具有巨大潜力。与所涉及的基因数量相比,在癌症类型分类中,可用的训练数据集通常样本量相当小。这些训练数据的局限性对某些分类方法构成了挑战。为了加快处理速度、降低预测错误率并避免因研究的基因数量众多而导致的不可理解性,需要一种用于样本分类相关基因的可靠选择方法。本研究使用改进的二元粒子群优化算法(IBPSO)来进行特征选择,并且K近邻(K-NN)方法作为IBPSO用于基因表达数据分类问题的评估器。实验结果表明,该方法有效地简化了特征选择并减少了所需特征的总数。与之前发表的最佳结果相比,该方法在11个基因表达数据测试问题中的9个中获得了最高的分类准确率,并且与其他两个测试问题的分类准确率相当。