Intelligent Databases, Data mining and Bioinformatics Laboratory, Isfahan University of Technology, Isfahan, Iran.
BMC Med Genomics. 2011 Jan 26;4:12. doi: 10.1186/1755-8794-4-12.
One of the best and most accurate methods for identifying disease-causing genes is monitoring gene expression values in different samples using microarray technology. One of the shortcomings of microarray data is that they provide a small quantity of samples with respect to the number of genes. This problem reduces the classification accuracy of the methods, so gene selection is essential to improve the predictive accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVMRFE) has become one of the leading methods, but its performance can be reduced because of the small sample size, noisy data and the fact that the method does not remove redundant genes.
We propose a novel framework for gene selection which uses the advantageous features of conventional methods and addresses their weaknesses. In fact, we have combined the Fisher method and SVMRFE to utilize the advantages of a filtering method as well as an embedded method. Furthermore, we have added a redundancy reduction stage to address the weakness of the Fisher method and SVMRFE. In addition to gene expression values, the proposed method uses Gene Ontology which is a reliable source of information on genes. The use of Gene Ontology can compensate, in part, for the limitations of microarrays, such as having a small number of samples and erroneous measurement results.
The proposed method has been applied to colon, Diffuse Large B-Cell Lymphoma (DLBCL) and prostate cancer datasets. The empirical results show that our method has improved classification performance in terms of accuracy, sensitivity and specificity. In addition, the study of the molecular function of selected genes strengthened the hypothesis that these genes are involved in the process of cancer growth.
The proposed method addresses the weakness of conventional methods by adding a redundancy reduction stage and utilizing Gene Ontology information. It predicts marker genes for colon, DLBCL and prostate cancer with a high accuracy. The predictions made in this study can serve as a list of candidates for subsequent wet-lab verification and might help in the search for a cure for cancers.
使用微阵列技术监测不同样本中的基因表达值是鉴定致病基因的最佳和最准确的方法之一。微阵列数据的一个缺点是,相对于基因数量,它们提供的样本数量较少。这个问题降低了方法的分类准确性,因此基因选择对于提高预测准确性和识别疾病的潜在标记基因至关重要。在众多现有的基因选择方法中,基于支持向量机的递归特征消除(SVMRFE)已成为领先方法之一,但由于样本量小、数据噪声以及该方法无法去除冗余基因,其性能可能会降低。
我们提出了一种新的基因选择框架,该框架利用了传统方法的优势,并解决了它们的弱点。实际上,我们已经结合了 Fisher 方法和 SVMRFE,以利用过滤方法和嵌入式方法的优势。此外,我们还添加了一个冗余减少阶段,以解决 Fisher 方法和 SVMRFE 的弱点。除了基因表达值之外,该方法还使用了基因本体论,这是基因信息的可靠来源。基因本体论的使用可以在一定程度上弥补微阵列的局限性,例如样本数量少和测量结果有误。
该方法已应用于结肠癌、弥漫性大 B 细胞淋巴瘤(DLBCL)和前列腺癌数据集。实验结果表明,我们的方法在准确性、敏感性和特异性方面提高了分类性能。此外,对选定基因的分子功能的研究加强了这些基因参与癌症生长过程的假设。
该方法通过添加冗余减少阶段并利用基因本体论信息来解决传统方法的弱点。它以高精度预测结肠癌、DLBCL 和前列腺癌的标记基因。本研究中的预测可以作为后续湿实验室验证的候选名单,并可能有助于寻找癌症的治疗方法。