Zhang Hongyan, Li Lanzhi, Luo Chao, Sun Congwei, Chen Yuan, Dai Zhijun, Yuan Zheming
Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China ; College of Information Science and Technology, Hunan Agricultural University, Changsha 410128, China ; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China.
Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China ; Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha 410128, China.
Biomed Res Int. 2014;2014:589290. doi: 10.1155/2014/589290. Epub 2014 Jul 23.
In efforts to discover disease mechanisms and improve clinical diagnosis of tumors, it is useful to mine profiles for informative genes with definite biological meanings and to build robust classifiers with high precision. In this study, we developed a new method for tumor-gene selection, the Chi-square test-based integrated rank gene and direct classifier (χ(2)-IRG-DC). First, we obtained the weighted integrated rank of gene importance from chi-square tests of single and pairwise gene interactions. Then, we sequentially introduced the ranked genes and removed redundant genes by using leave-one-out cross-validation of the chi-square test-based Direct Classifier (χ(2)-DC) within the training set to obtain informative genes. Finally, we determined the accuracy of independent test data by utilizing the genes obtained above with χ(2)-DC. Furthermore, we analyzed the robustness of χ(2)-IRG-DC by comparing the generalization performance of different models, the efficiency of different feature-selection methods, and the accuracy of different classifiers. An independent test of ten multiclass tumor gene-expression datasets showed that χ(2)-IRG-DC could efficiently control overfitting and had higher generalization performance. The informative genes selected by χ(2)-IRG-DC could dramatically improve the independent test precision of other classifiers; meanwhile, the informative genes selected by other feature selection methods also had good performance in χ(2)-DC.
为了发现疾病机制并改善肿瘤的临床诊断,挖掘具有明确生物学意义的信息基因谱并构建高精度的稳健分类器是很有用的。在本研究中,我们开发了一种新的肿瘤基因选择方法,即基于卡方检验的综合排序基因与直接分类器(χ(2)-IRG-DC)。首先,我们通过单基因和双基因相互作用的卡方检验获得基因重要性的加权综合排序。然后,我们在训练集中使用基于卡方检验的直接分类器(χ(2)-DC)的留一法交叉验证依次引入排序后的基因并去除冗余基因,以获得信息基因。最后,我们利用上述获得的基因通过χ(2)-DC确定独立测试数据的准确性。此外,我们通过比较不同模型的泛化性能、不同特征选择方法的效率以及不同分类器的准确性来分析χ(2)-IRG-DC的稳健性。对十个多类肿瘤基因表达数据集的独立测试表明,χ(2)-IRG-DC可以有效地控制过拟合并具有更高的泛化性能。χ(2)-IRG-DC选择的信息基因可以显著提高其他分类器的独立测试精度;同时,其他特征选择方法选择的信息基因在χ(2)-DC中也具有良好的性能。