Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620, USA and Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA.
Bioinformatics. 2014 Jan 1;30(1):81-93. doi: 10.1093/bioinformatics/btt569. Epub 2013 Oct 1.
Identifying functional modules in protein-protein interaction (PPI) networks may shed light on cellular functional organization and thereafter underlying cellular mechanisms. Many existing module identification algorithms aim to detect densely connected groups of proteins as potential modules. However, based on this simple topological criterion of 'higher than expected connectivity', those algorithms may miss biologically meaningful modules of functional significance, in which proteins have similar interaction patterns to other proteins in networks but may not be densely connected to each other. A few blockmodel module identification algorithms have been proposed to address the problem but the lack of global optimum guarantee and the prohibitive computational complexity have been the bottleneck of their applications in real-world large-scale PPI networks.
In this article, we propose a novel optimization formulation LCP(2) (low two-hop conductance sets) using the concept of Markov random walk on graphs, which enables simultaneous identification of both dense and sparse modules based on protein interaction patterns in given networks through searching for LCP(2) by random walk. A spectral approximate algorithm SLCP(2) is derived to identify non-overlapping functional modules. Based on a bottom-up greedy strategy, we further extend LCP(2) to a new algorithm (greedy algorithm for LCP(2)) GLCP(2) to identify overlapping functional modules. We compare SLCP(2) and GLCP(2) with a range of state-of-the-art algorithms on synthetic networks and real-world PPI networks. The performance evaluation based on several criteria with respect to protein complex prediction, high level Gene Ontology term prediction and especially sparse module detection, has demonstrated that our algorithms based on searching for LCP(2) outperform all other compared algorithms.
All data and code are available at http://www.cse.usf.edu/~xqian/fmi/slcp2hop/.
在蛋白质-蛋白质相互作用 (PPI) 网络中识别功能模块可以揭示细胞的功能组织,进而揭示潜在的细胞机制。许多现有的模块识别算法旨在检测蛋白质的密集连接组作为潜在模块。然而,基于“高于预期的连接度”这一简单拓扑标准,这些算法可能会错过具有生物学意义的功能模块,其中蛋白质与网络中的其他蛋白质具有相似的相互作用模式,但彼此之间可能没有密集连接。已经提出了几种块模型模块识别算法来解决这个问题,但缺乏全局最优保证和计算复杂度高成为它们在真实大规模 PPI 网络中应用的瓶颈。
在本文中,我们提出了一种新的优化公式 LCP(2)(低两跳电导集),使用图上马尔可夫随机游走的概念,通过在给定网络中基于蛋白质相互作用模式搜索 LCP(2),可以同时识别密集和稀疏模块。导出了一个谱近似算法 SLCP(2)来识别非重叠的功能模块。基于自底向上的贪婪策略,我们将 LCP(2)进一步扩展到一种新的算法(LCP(2)的贪婪算法)GLCP(2),以识别重叠的功能模块。我们将 SLCP(2)和 GLCP(2)与一系列最先进的算法在合成网络和真实 PPI 网络上进行了比较。基于几个标准(关于蛋白质复合物预测、高级基因本体论术语预测,特别是稀疏模块检测)的性能评估表明,我们基于搜索 LCP(2)的算法优于所有其他比较算法。
所有数据和代码都可在 http://www.cse.usf.edu/~xqian/fmi/slcp2hop/ 获得。