Yousef Malik, Ülgen Ege, Uğur Sezerman Osman
Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel.
Department of Information Systems, Zefat Academic College, Zefat, Israel.
PeerJ Comput Sci. 2021 Feb 22;7:e336. doi: 10.7717/peerj-cs.336. eCollection 2021.
Most of the traditional gene selection approaches are borrowed from other fields such as statistics and computer science, However, they do not prioritize biologically relevant genes since the ultimate goal is to determine features that optimize model performance metrics not to build a biologically meaningful model. Therefore, there is an imminent need for new computational tools that integrate the biological knowledge about the data in the process of gene selection and machine learning. Integrative gene selection enables incorporation of biological domain knowledge from external biological resources. In this study, we propose a new computational approach named CogNet that is an integrative gene selection tool that exploits biological knowledge for grouping the genes for the computational modeling tasks of ranking and classification. In CogNet, the pathfindR serves as the biological grouping tool to allow the main algorithm to rank active-subnetwork-oriented KEGG pathway enrichment analysis results to build a biologically relevant model. CogNet provides a list of significant KEGG pathways that can classify the data with a very high accuracy. The list also provides the genes belonging to these pathways that are differentially expressed that are used as features in the classification problem. The list facilitates deep analysis and better interpretability of the role of KEGG pathways in classification of the data thus better establishing the biological relevance of these differentially expressed genes. Even though the main aim of our study is not to improve the accuracy of any existing tool, the performance of the CogNet outperforms a similar approach called maTE while obtaining similar performance compared to other similar tools including SVM-RCE. CogNet was tested on 13 gene expression datasets concerning a variety of diseases.
大多数传统的基因选择方法借鉴自统计学和计算机科学等其他领域。然而,它们并未将生物学相关基因作为优先考虑对象,因为其最终目标是确定能优化模型性能指标的特征,而非构建具有生物学意义的模型。因此,迫切需要新的计算工具,在基因选择和机器学习过程中整合有关数据的生物学知识。整合基因选择能够纳入来自外部生物学资源的生物学领域知识。在本研究中,我们提出了一种名为CogNet的新计算方法,它是一种整合基因选择工具,利用生物学知识对基因进行分组,以用于排序和分类的计算建模任务。在CogNet中,pathfindR用作生物学分组工具,使主要算法能够对面向活性子网的KEGG通路富集分析结果进行排序,从而构建具有生物学相关性的模型。CogNet提供了一份显著KEGG通路列表,该列表能够以非常高的准确率对数据进行分类。该列表还提供了属于这些通路的差异表达基因,这些基因在分类问题中用作特征。该列表有助于深入分析KEGG通路在数据分类中的作用,并具有更好的可解释性,从而更好地确定这些差异表达基因的生物学相关性。尽管我们研究的主要目的不是提高任何现有工具的准确率,但CogNet的性能优于一种名为maTE的类似方法,同时与包括SVM - RCE在内的其他类似工具相比,性能相当。CogNet在13个涉及多种疾病的基因表达数据集上进行了测试。