Pellegrini Marco, Baglioni Miriam, Geraci Filippo
Laboratory for Integrative Systems Medicine - Istituto di Informatica e Telematica and Istituto di Fisiologia Clinica del CNR, via Moruzzi 1, Pisa, 56124, Italy.
BMC Bioinformatics. 2016 Nov 8;17(Suppl 12):372. doi: 10.1186/s12859-016-1191-6.
Biological networks play an increasingly important role in the exploration of functional modularity and cellular organization at a systemic level. Quite often the first tools used to analyze these networks are clustering algorithms. We concentrate here on the specific task of predicting protein complexes (PC) in large protein-protein interaction networks (PPIN). Currently, many state-of-the-art algorithms work well for networks of small or moderate size. However, their performance on much larger networks, which are becoming increasingly common in modern proteome-wise studies, needs to be re-assessed.
We present a new fast algorithm for clustering large sparse networks: Core&Peel, which runs essentially in time and storage O(a(G)m+n) for a network G of n nodes and m arcs, where a(G) is the arboricity of G (which is roughly proportional to the maximum average degree of any induced subgraph in G). We evaluated Core&Peel on five PPI networks of large size and one of medium size from both yeast and homo sapiens, comparing its performance against those of ten state-of-the-art methods. We demonstrate that Core&Peel consistently outperforms the ten competitors in its ability to identify known protein complexes and in the functional coherence of its predictions. Our method is remarkably robust, being quite insensible to the injection of random interactions. Core&Peel is also empirically efficient attaining the second best running time over large networks among the tested algorithms.
Our algorithm Core&Peel pushes forward the state-of the-art in PPIN clustering providing an algorithmic solution with polynomial running time that attains experimentally demonstrable good output quality and speed on challenging large real networks.
生物网络在系统层面探索功能模块性和细胞组织方面发挥着越来越重要的作用。聚类算法常常是用于分析这些网络的首批工具。我们在此专注于在大型蛋白质 - 蛋白质相互作用网络(PPIN)中预测蛋白质复合物(PC)这一特定任务。目前,许多先进算法在中小型网络中运行良好。然而,它们在大得多的网络上的性能需要重新评估,而这种大网络在现代全蛋白质组研究中越来越常见。
我们提出了一种用于聚类大型稀疏网络的新的快速算法:核心&剥离算法(Core&Peel),对于一个具有n个节点和m条弧的网络G,该算法的运行时间和存储空间本质上为O(a(G)m + n),其中a(G)是G的树密度(大致与G中任何诱导子图的最大平均度成正比)。我们在来自酵母和智人的五个大型PPI网络和一个中型PPI网络上评估了核心&剥离算法,将其性能与十种先进方法的性能进行了比较。我们证明,在识别已知蛋白质复合物的能力以及预测的功能连贯性方面,核心&剥离算法始终优于这十种竞争对手。我们的方法非常稳健,对随机相互作用的注入相当不敏感。在测试算法中,核心&剥离算法在大型网络上的运行时间也经验性地达到了第二快。
我们的算法核心&剥离算法推动了PPIN聚类技术的发展,提供了一种具有多项式运行时间的算法解决方案,在具有挑战性的大型真实网络上实现了实验上可证明的良好输出质量和速度。