García-Criado Federico, Seoane Pedro, Rojano Elena, Ranea Juan A G, Perkins James R
Department of Molecular Biology and Biochemistry, University of Malaga, 29010 Malaga, Spain.
Center for Biomedical Network Research on Rare Diseases (CIBERER), Instituto de Salud Carlos III (ISCIII), 28029 Madrid, Spain.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf320.
Understanding and predicting biological processes from protein-protein interaction (PPI) networks requires accurate and efficient representations of their structure. However, many existing methods fail to capture the complex, overlapping modular structure of biological systems. To address this, we propose a network embedding strategy that improves both biological interpretability and predictive power. By transforming networks into a low-dimensional space while preserving key topological properties, embedding enables the discovery of novel functional relationships. Pre-clustering a network before embedding enhances representation quality, i.e. the ability to preserve meaningful structural and functional properties in the embedding space. However, traditional non-overlapping clustering methods can introduce bias by ignoring the overlapping nature of biological communities. We overcome this limitation by integrating the Hierarchical Link Clustering (HLC) algorithm into an embedding workflow tailored for large, weighted, undirected networks. First, we introduce two optimized HLC implementations for Python and R, both outperforming existing methods in clustering accuracy and scalability. Then, by restricting random walks to HLC-defined communities, we improve the representation of biological pathways, as shown using Reactome on the human PPI network. We also apply our full cluster embedding workflow to analyze RASopathies, a group of interrelated disorders with a diverse range of phenotypes, caused by mutations in genes from the RAS/MAPK pathway. This approach was used not only to represent known pathways, but also to identify potential novel gene candidates associated with RASopathies, including Noonan and Costello syndrome. HLC implementations are available in the CDLIB library (https://github.com/GiulioRossetti/cdlib), and at https://github.com/jimrperkins/linkcomm for Python and R, respectively.
从蛋白质-蛋白质相互作用(PPI)网络理解和预测生物过程需要对其结构进行准确且高效的表示。然而,许多现有方法未能捕捉生物系统复杂、重叠的模块化结构。为解决这一问题,我们提出一种网络嵌入策略,该策略可提高生物可解释性和预测能力。通过在保留关键拓扑特性的同时将网络转换到低维空间,嵌入能够发现新的功能关系。在嵌入之前对网络进行预聚类可提高表示质量,即在嵌入空间中保留有意义的结构和功能特性的能力。然而,传统的非重叠聚类方法可能会因忽略生物群落的重叠性质而引入偏差。我们通过将层次链接聚类(HLC)算法集成到针对大型、加权、无向网络量身定制的嵌入工作流程中来克服这一限制。首先,我们为Python和R引入了两种优化的HLC实现,二者在聚类准确性和可扩展性方面均优于现有方法。然后,通过将随机游走限制在HLC定义的群落中,我们改进了生物途径的表示,这在人类PPI网络上使用Reactome进行展示。我们还应用完整的聚类嵌入工作流程来分析RASopathies,这是一组由RAS/MAPK途径中的基因突变引起的具有多种不同表型的相关疾病。这种方法不仅用于表示已知途径,还用于识别与RASopathies相关的潜在新基因候选物,包括努南综合征和科斯特洛综合征。HLC实现分别可在CDLIB库(https://github.com/GiulioRossetti/cdlib)以及Python和R的https://github.com/jimrperkins/linkcomm上获取。