Laboratory of Systems Biology and Bioinformatics (LBB), Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran.
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran.
Genomics. 2019 Jul;111(4):612-618. doi: 10.1016/j.ygeno.2018.03.017. Epub 2018 Mar 28.
In solving the gene prioritization problem, ranking candidate genes from most to least promising is attempted before further experimental validation. Integrating the results of various data sources and methods tends to result in a better performance when solving the gene prioritization problem. Therefore, a wide range of datasets and algorithms was investigated; these included topological features of protein networks, physicochemical characteristics and blast similarity scores of protein sequences, gene ontology, biological pathways, and tissue-based data sources. The novelty of this study lies in how the best-performing methods and reliable multi-genomic data sources were applied in an efficient two-step approach. In the first step, various multi-genomic data sources and algorithms were evaluated and seven best-performing rankers were then applied to prioritize candidate genes in different ways. In the second step, global prioritization was obtained by aggregating several scoring schemes. The results showed that protein networks, functional linkage networks, gene ontology, and biological pathway data sources have a significant impact on the quality of the gene prioritization approach. The findings also demonstrated a direct relationship between the degree of genes and the ranking quality of the evaluated tools. This approach outperformed previously published algorithms (e.g., DIR, GPEC, GeneDistiller, and Endeavour) in all evaluation metrices and led to the development of GPS software. Its user-friendly interface and accuracy makes GPS a powerful tool for the identification of human disease genes. GPS is available at http://gpsranker.com and http://LBB.ut.ac.ir.
在解决基因优先级排序问题时,通常会尝试先将候选基因从最有希望的到最不有希望的进行排序,然后再进行进一步的实验验证。整合来自不同数据源和方法的结果,往往可以在解决基因优先级排序问题时获得更好的性能。因此,研究人员广泛调查了各种数据集和算法;这些数据集和算法包括蛋白质网络的拓扑特征、蛋白质序列的理化特性和 Blast 相似性得分、基因本体论、生物途径和基于组织的数据源。本研究的新颖之处在于如何应用表现最佳的方法和可靠的多组学数据源,以高效的两步法进行研究。在第一步中,评估了各种多组学数据源和算法,然后应用七种表现最佳的排名算法以不同的方式对候选基因进行优先级排序。在第二步中,通过聚合几种评分方案获得全局优先级排序。结果表明,蛋白质网络、功能链接网络、基因本体论和生物途径数据源对基因优先级排序方法的质量有重大影响。研究结果还表明,基因的度与评估工具的排名质量之间存在直接关系。该方法在所有评估指标上均优于先前发表的算法(例如 DIR、GPEC、GeneDistiller 和 Endeavour),并由此开发了 GPS 软件。其用户友好的界面和准确性使其成为识别人类疾病基因的强大工具。GPS 可在 http://gpsranker.com 和 http://LBB.ut.ac.ir 上获取。