Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.
J Healthc Eng. 2013;4(2):255-83. doi: 10.1260/2040-2295.4.2.255.
Classification of cancer based on gene expression has provided insight into possible treatment strategies. Thus, developing machine learning methods that can successfully distinguish among cancer subtypes or normal versus cancer samples is important. This work discusses supervised learning techniques that have been employed to classify cancers. Furthermore, a two-step feature selection method based on an attribute estimation method (e.g., ReliefF) and a genetic algorithm was employed to find a set of genes that can best differentiate between cancer subtypes or normal versus cancer samples. The application of different classification methods (e.g., decision tree, k-nearest neighbor, support vector machine (SVM), bagging, and random forest) on 5 cancer datasets shows that no classification method universally outperforms all the others. However, k-nearest neighbor and linear SVM generally improve the classification performance over other classifiers. Finally, incorporating diverse types of genomic data (e.g., protein-protein interaction data and gene expression) increase the prediction accuracy as compared to using gene expression alone.
基于基因表达的癌症分类为可能的治疗策略提供了深入的了解。因此,开发能够成功区分癌症亚型或正常与癌症样本的机器学习方法非常重要。本工作讨论了用于癌症分类的监督学习技术。此外,还采用了一种基于属性估计方法(例如 ReliefF)和遗传算法的两步特征选择方法,以找到一组可以最佳区分癌症亚型或正常与癌症样本的基因。不同分类方法(例如决策树、k-最近邻、支持向量机 (SVM)、袋装和随机森林)在 5 个癌症数据集上的应用表明,没有一种分类方法普遍优于所有其他方法。然而,k-最近邻和线性 SVM 通常优于其他分类器,从而提高了分类性能。最后,与仅使用基因表达相比,结合多种类型的基因组数据(例如蛋白质-蛋白质相互作用数据和基因表达数据)可提高预测准确性。