基于 Apache Spark 的异构网络上可扩展的随机游走与重启动算法，用于通过 II 型模糊数据融合对疾病相关基因进行排序。

A scalable random walk with restart on heterogeneous networks with Apache Spark for ranking disease-related genes through type-II fuzzy data fusion.

机构信息

Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran.

出版信息

J Biomed Inform. 2021 Mar;115:103688. doi: 10.1016/j.jbi.2021.103688. Epub 2021 Feb 2.

DOI:10.1016/j.jbi.2021.103688

Abstract

One of the effective missions of biology and medical science is to find disease-related genes. Recent research uses gene/protein networks to find such genes. Due to false positive interactions in these networks, the results often are not accurate and reliable. Integrating multiple gene/protein networks could overcome this drawback, causing a network with fewer false positive interactions. The integration method plays a crucial role in the quality of the constructed network. In this paper, we integrate several sources to build a reliable heterogeneous network, i.e., a network that includes nodes of different types. Due to the different gene/protein sources, four gene-gene similarity networks are constructed first and integrated by applying the type-II fuzzy voter scheme. The resulting gene-gene network is linked to a disease-disease similarity network (as the outcome of integrating four sources) through a two-part disease-gene network. We propose a novel algorithm, namely random walk with restart on the heterogeneous network method with fuzzy fusion (RWRHN-FF). Through running RWRHN-FF over the heterogeneous network, disease-related genes are determined. Experimental results using the leave-one-out cross-validation indicate that RWRHN-FF outperforms existing methods. The proposed algorithm can be applied to find new genes for prostate, breast, gastric, and colon cancers. Since the RWRHN-FF algorithm converges slowly on large heterogeneous networks, we propose a parallel implementation of the RWRHN-FF algorithm on the Apache Spark platform for high-throughput and reliable network inference. Experiments run on heterogeneous networks of different sizes indicate faster convergence compared to other non-distributed modes of implementation.

摘要

生物学和医学科学的有效任务之一是找到与疾病相关的基因。最近的研究使用基因/蛋白质网络来寻找这些基因。由于这些网络中存在假阳性相互作用，因此结果通常不准确且不可靠。整合多个基因/蛋白质网络可以克服这一缺点，从而产生具有较少假阳性相互作用的网络。整合方法在构建网络的质量中起着至关重要的作用。在本文中，我们整合了多个来源来构建可靠的异质网络，即包含不同类型节点的网络。由于基因/蛋白质来源不同，首先构建了四个基因-基因相似性网络，并通过应用 II 型模糊投票方案进行整合。所得到的基因-基因网络通过两部分疾病-基因网络与疾病-疾病相似性网络（作为整合四个来源的结果）连接。我们提出了一种新的算法，即基于模糊融合的异质网络上的随机游走与重启动（RWRHN-FF）。通过在异质网络上运行 RWRHN-FF，可以确定与疾病相关的基因。使用留一交叉验证的实验结果表明，RWRHN-FF 优于现有方法。该算法可用于寻找前列腺癌、乳腺癌、胃癌和结肠癌的新基因。由于 RWRHN-FF 算法在大型异质网络上收敛缓慢，因此我们提出了在 Apache Spark 平台上并行实现 RWRHN-FF 算法，以实现高通量和可靠的网络推断。在不同大小的异质网络上运行的实验表明，与其他非分布式实现模式相比，收敛速度更快。

相似文献

A scalable random walk with restart on heterogeneous networks with Apache Spark for ranking disease-related genes through type-II fuzzy data fusion.基于 Apache Spark 的异构网络上可扩展的随机游走与重启动算法，用于通过 II 型模糊数据融合对疾病相关基因进行排序。

J Biomed Inform. 2021 Mar;115:103688. doi: 10.1016/j.jbi.2021.103688. Epub 2021 Feb 2.

Prioritization of potential candidate disease genes by topological similarity of protein-protein interaction network and phenotype data.通过蛋白质-蛋白质相互作用网络和表型数据的拓扑相似性对潜在候选疾病基因进行优先级排序。

J Biomed Inform. 2015 Feb;53:229-36. doi: 10.1016/j.jbi.2014.11.004. Epub 2014 Nov 15.

Network-based ranking methods for prediction of novel disease associated microRNAs.基于网络的新型疾病相关微小RNA预测排序方法。

Comput Biol Chem. 2015 Oct;58:139-48. doi: 10.1016/j.compbiolchem.2015.07.003. Epub 2015 Jul 21.

Gene gravity-like algorithm for disease gene prediction based on phenotype-specific network.基于表型特异性网络的疾病基因预测的基因引力样算法。

BMC Syst Biol. 2017 Dec 6;11(1):121. doi: 10.1186/s12918-017-0519-9.

Constructing an integrated gene similarity network for the identification of disease genes.构建用于疾病基因识别的综合基因相似性网络。

J Biomed Semantics. 2017 Sep 20;8(Suppl 1):32. doi: 10.1186/s13326-017-0141-1.

Laplacian normalization and random walk on heterogeneous networks for disease-gene prioritization.用于疾病-基因优先级排序的异质网络上的拉普拉斯归一化和随机游走

Comput Biol Chem. 2015 Aug;57:21-8. doi: 10.1016/j.compbiolchem.2015.02.008. Epub 2015 Feb 7.

An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods.使用网络整合和基于快速核的基因优先级排序方法对疾病-基因关联进行广泛分析。

Artif Intell Med. 2014 Jun;61(2):63-78. doi: 10.1016/j.artmed.2014.03.003. Epub 2014 Mar 20.

Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks.在异质生物网络上使用带重启的判别式随机游走对基因集进行特征描述。

Bioinformatics. 2016 Jul 15;32(14):2167-75. doi: 10.1093/bioinformatics/btw151. Epub 2016 Mar 19.

A novel target convergence set based random walk with restart for prediction of potential LncRNA-disease associations.基于新型目标收敛集的重启动随机游走算法预测潜在的 lncRNA-疾病关联

BMC Bioinformatics. 2019 Dec 3;20(1):626. doi: 10.1186/s12859-019-3216-4.

Enhancing gene regulatory networks inference through hub-based data integration.通过基于枢纽的数据整合增强基因调控网络推断

Comput Biol Chem. 2021 Dec;95:107589. doi: 10.1016/j.compbiolchem.2021.107589. Epub 2021 Oct 6.

引用本文的文献

Influence of multi-species data on gene-disease associations in substance use disorder using random walk with restart models.使用带重启的随机游走模型的多物种数据对物质使用障碍中基因-疾病关联的影响

PLoS One. 2025 Jun 16;20(6):e0325201. doi: 10.1371/journal.pone.0325201. eCollection 2025.

Enhancing Molecular Network-Based Cancer Driver Gene Prediction Using Machine Learning Approaches: Current Challenges and Opportunities.使用机器学习方法增强基于分子网络的癌症驱动基因预测：当前挑战与机遇

J Cell Mol Med. 2025 Jan;29(1):e70351. doi: 10.1111/jcmm.70351.

Disease gene prioritization with quantum walks.基于量子游走的疾病基因优先级排序。

Bioinformatics. 2024 Aug 2;40(8). doi: 10.1093/bioinformatics/btae513.

Speos: an ensemble graph representation learning framework to predict core gene candidates for complex diseases.Speos：一种用于预测复杂疾病核心基因候选物的集成图表示学习框架。

Nat Commun. 2023 Nov 8;14(1):7206. doi: 10.1038/s41467-023-42975-z.

Graph Representation Learning and Its Applications: A Survey.图表示学习及其应用综述。

Sensors (Basel). 2023 Apr 21;23(8):4168. doi: 10.3390/s23084168.

Framing Apache Spark in life sciences.从生命科学角度构建Apache Spark

Heliyon. 2023 Feb 9;9(2):e13368. doi: 10.1016/j.heliyon.2023.e13368. eCollection 2023 Feb.

Predicting Parkinson disease related genes based on PyFeat and gradient boosted decision tree.基于 PyFeat 和梯度提升决策树预测帕金森病相关基因。

Sci Rep. 2022 Jun 15;12(1):10004. doi: 10.1038/s41598-022-14127-8.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

基于 Apache Spark 的异构网络上可扩展的随机游走与重启动算法，用于通过 II 型模糊数据融合对疾病相关基因进行排序。

A scalable random walk with restart on heterogeneous networks with Apache Spark for ranking disease-related genes through type-II fuzzy data fusion.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献