Wong Daniel, Li Xiao-Li, Wu Min, Zheng Jie, Ng See-Kiong
BMC Genomics. 2013;14 Suppl 5(Suppl 5):S15. doi: 10.1186/1471-2164-14-S5-S15. Epub 2013 Oct 16.
Many biological processes are carried out by proteins interacting with each other in the form of protein complexes. However, large-scale detection of protein complexes has remained constrained by experimental limitations. As such, computational detection of protein complexes by applying clustering algorithms on the abundantly available protein-protein interaction (PPI) networks is an important alternative. However, many current algorithms have overlooked the importance of selecting seeds for expansion into clusters without excluding important proteins and including many noisy ones, while ensuring a high degree of functional homogeneity amongst the proteins detected for the complexes.
We designed a novel method called Probabilistic Local Walks (PLW) which clusters regions in a PPI network with high functional similarity to find protein complex cores with high precision and efficiency in O (|V| log |V| + |E|) time. A seed selection strategy, which prioritises seeds with dense neighbourhoods, was devised. We defined a topological measure, called common neighbour similarity, to estimate the functional similarity of two proteins given the number of their common neighbours.
Our proposed PLW algorithm achieved the highest F-measure (recall and precision) when compared to 11 state-of-the-art methods on yeast protein interaction data, with an improvement of 16.7% over the next highest score. Our experiments also demonstrated that our seed selection strategy is able to increase algorithm precision when applied to three previous protein complex mining techniques.
The software, datasets and predicted complexes are available at http://wonglkd.github.io/PLW.
许多生物学过程是由蛋白质以蛋白质复合物的形式相互作用来完成的。然而,蛋白质复合物的大规模检测一直受到实验限制的制约。因此,通过在大量可用的蛋白质-蛋白质相互作用(PPI)网络上应用聚类算法来进行蛋白质复合物的计算检测是一种重要的替代方法。然而,当前许多算法在选择用于扩展成簇的种子时忽略了其重要性,在不排除重要蛋白质的同时却包含了许多噪声蛋白质,同时还要确保为复合物检测出的蛋白质之间具有高度的功能同质性。
我们设计了一种名为概率局部游走(PLW)的新方法,该方法在PPI网络中对具有高功能相似性的区域进行聚类,从而在O(|V| log |V| + |E|)时间内高精度、高效率地找到蛋白质复合物核心。我们设计了一种种子选择策略,该策略优先选择具有密集邻域的种子。我们定义了一种拓扑度量,称为共同邻居相似性,以根据两个蛋白质的共同邻居数量来估计它们的功能相似性。
与11种最先进的方法相比,我们提出的PLW算法在酵母蛋白质相互作用数据上实现了最高的F值(召回率和精确率),比次高得分提高了16.7%。我们的实验还表明,我们的种子选择策略应用于之前的三种蛋白质复合物挖掘技术时能够提高算法的精确率。
该软件、数据集和预测的复合物可在http://wonglkd.github.io/PLW获取。