IEEE J Biomed Health Inform. 2019 Jul;23(4):1805-1815. doi: 10.1109/JBHI.2018.2870728.
The discovery of disease-causing genes is a critical step towards understanding the nature of a disease and determining a possible cure for it. In recent years, many computational methods to identify disease genes have been proposed. However, making full use of disease-related (e.g., symptoms) and gene-related (e.g., gene ontology and protein-protein interactions) information to improve the performance of disease gene prediction is still an issue. Here, we develop a heterogeneous disease-gene-related network (HDGN) embedding representation framework for disease gene prediction (called HerGePred). Based on this framework, a low-dimensional vector representation (LVR) of the nodes in the HDGN can be obtained. Then, we propose two specific algorithms, namely, an LVR-based similarity prediction and a random walk with restart on a reconstructed heterogeneous disease-gene network (RW-RDGN), to predict disease genes with high performance. First, to validate the rationality of the framework, we analyze the similarity-based overlap distribution of disease pairs and design an experiment for disease-gene association recovery, the results of which revealed that the LVR of nodes performs well at preserving the local and global network structure of the HDGN. Then, we apply tenfold cross validation and external validation to compare our methods with other well-known disease gene prediction algorithms. The experimental results show that the RW-RDGN performs better than the state-of-the-art algorithm. The prediction results of disease candidate genes are essential for molecular mechanism investigation and experimental validation. The source codes of HerGePred and experimental data are available at https://github.com/yangkuoone/HerGePred.
疾病基因的发现是理解疾病本质和确定可能治疗方法的关键步骤。近年来,已经提出了许多用于识别疾病基因的计算方法。然而,充分利用与疾病相关的(例如症状)和基因相关的(例如基因本体和蛋白质-蛋白质相互作用)信息来提高疾病基因预测的性能仍然是一个问题。在这里,我们开发了一种用于疾病基因预测的异构疾病基因相关网络(HDGN)嵌入表示框架(称为 HerGePred)。在此框架的基础上,可以获得 HDGN 中节点的低维向量表示(LVR)。然后,我们提出了两种具体的算法,即基于 LVR 的相似性预测和基于随机游走的带有重启的重构异构疾病基因网络(RW-RDGN),以实现高性能的疾病基因预测。首先,为了验证框架的合理性,我们分析了疾病对之间基于相似性的重叠分布,并设计了一个疾病-基因关联恢复的实验,结果表明节点的 LVR 在保留 HDGN 的局部和全局网络结构方面表现良好。然后,我们采用十折交叉验证和外部验证将我们的方法与其他著名的疾病基因预测算法进行比较。实验结果表明,RW-RDGN 比最先进的算法表现更好。疾病候选基因的预测结果对于分子机制研究和实验验证至关重要。HerGePred 的源代码和实验数据可在 https://github.com/yangkuoone/HerGePred 上获得。