Chen Yuan, Wang Lifeng, Li Lanzhi, Zhang Hongyan, Yuan Zheming
Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, China.
Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
BMC Bioinformatics. 2016 Jan 20;17:44. doi: 10.1186/s12859-016-0893-0.
Selecting a parsimonious set of informative genes to build highly generalized performance classifier is the most important task for the analysis of tumor microarray expression data. Many existing gene pair evaluation methods cannot highlight diverse patterns of gene pairs only used one strategy of vertical comparison and horizontal comparison, while individual-gene-ranking method ignores redundancy and synergy among genes.
Here we proposed a novel score measure named relative simplicity (RS). We evaluated gene pairs according to integrating vertical comparison with horizontal comparison, finally built RS-based direct classifier (RS-based DC) based on a set of informative genes capable of binary discrimination with a paired votes strategy. Nine multi-class gene expression datasets involving human cancers were used to validate the performance of new method. Compared with the nine reference models, RS-based DC received the highest average independent test accuracy (91.40%), the best generalization performance and the smallest informative average gene number (20.56). Compared with the four reference feature selection methods, RS also received the highest average test accuracy in three classifiers (Naïve Bayes, k-Nearest Neighbor and Support Vector Machine), and only RS can improve the performance of SVM.
Diverse patterns of gene pairs could be highlighted more fully while integrating vertical comparison with horizontal comparison strategy. DC core classifier can effectively control over-fitting. RS-based feature selection method combined with DC classifier can lead to more robust selection of informative genes and classification accuracy.
选择一组简洁的信息基因来构建高度通用的性能分类器是肿瘤微阵列表达数据分析的最重要任务。许多现有的基因对评估方法仅采用垂直比较和水平比较中的一种策略,无法突出基因对的多样模式,而单基因排名方法则忽略了基因间的冗余和协同作用。
在此我们提出了一种名为相对简洁性(RS)的新型评分度量。我们通过将垂直比较与水平比较相结合来评估基因对,最终基于一组能够通过配对投票策略进行二元区分的信息基因构建了基于RS的直接分类器(RS-based DC)。使用九个涉及人类癌症的多类基因表达数据集来验证新方法的性能。与九个参考模型相比,基于RS的DC获得了最高的平均独立测试准确率(91.40%)、最佳的泛化性能和最小的信息平均基因数(20.56)。与四种参考特征选择方法相比,RS在三个分类器(朴素贝叶斯、k近邻和支持向量机)中也获得了最高的平均测试准确率,并且只有RS能够提高支持向量机的性能。
将垂直比较与水平比较策略相结合可以更充分地突出基因对的多样模式。DC核心分类器可以有效控制过拟合。基于RS的特征选择方法与DC分类器相结合可以导致更稳健地选择信息基因和提高分类准确率。