Cai Bingjing, Wang Haiying, Zheng Huiru, Wang Hui
School of Computing and Mathematics, Computer Sciences Research Institute, University of Ulster, N. Ireland, BT37 0QB, UK.
BMC Syst Biol. 2012;6 Suppl 3(Suppl 3):S4. doi: 10.1186/1752-0509-6-S3-S4. Epub 2012 Dec 17.
Recent advances in molecular biology have led to the accumulation of large amounts of data on protein-protein interaction networks in different species. An important challenge for the analysis of these data is to extract functional modules such as protein complexes and biological processes from networks which are characterised by the present of a significant number of false positives. Various computational techniques have been applied in recent years. However, most of them treat protein interaction as binary. Co-complex relations derived from affinity purification/mass spectrometry (AP-MS) experiments have been largely ignored.
This paper presents a new algorithm for detecting protein complexes from AP-MS data. The algorithm intends to detect groups of prey proteins that are significantly co-associated with the same set of bait proteins. We first construct AP-MS data as a bipartite network, where one set of nodes consists of bait proteins and the other set is composed of prey proteins. We then calculate pair-wise similarities of bait proteins based on the number of their commonly shared neighbours. A hierarchical clustering algorithm is employed to cluster bait proteins based on the similarities and thus a set of 'seed' clusters is obtained. Starting from these 'seed' clusters, an expansion process is developed to identify prey proteins which are significantly associated with the same set of bait proteins. Then, a set of complete protein complexes is derived. In application to two real AP-MS datasets, we validate biological significance of predicted protein complexes by using curated protein complexes and well-characterized cellular component annotation from Gene Ontology (GO). Several statistical metrics have been applied for evaluation.
Experimental results show that, the proposed algorithm achieves significant improvement in detecting protein complexes from AP-MS data. In comparison to the well-known MCL algorithm, our algorithm improves the accuracy rate by about 20% in detecting protein complexes in both networks and increases the F-Measure value by about 50% in Krogan_2006 network. Greater precision and better accuracy have been achieved and the identified complexes are demonstrated to match well with existing curated protein complexes.
Our study highlights the significance of taking co-complex relations into account when extracting protein complexes from AP-MS data. The algorithm proposed in this paper can be easily extended to the analysis of other biological networks which can be conveniently represented by bipartite graphs such as drug-target networks.
分子生物学的最新进展使得不同物种中蛋白质-蛋白质相互作用网络的大量数据得以积累。分析这些数据面临的一个重要挑战是从存在大量假阳性的网络中提取诸如蛋白质复合物和生物过程等功能模块。近年来已应用了各种计算技术。然而,它们大多将蛋白质相互作用视为二元关系。源自亲和纯化/质谱(AP-MS)实验的共复合物关系在很大程度上被忽视了。
本文提出了一种从AP-MS数据中检测蛋白质复合物的新算法。该算法旨在检测与同一组诱饵蛋白显著共关联的猎物蛋白组。我们首先将AP-MS数据构建为一个二分网络,其中一组节点由诱饵蛋白组成,另一组由猎物蛋白组成。然后,我们根据诱饵蛋白共同拥有的邻居数量计算它们之间的成对相似度。采用层次聚类算法基于这些相似度对诱饵蛋白进行聚类,从而得到一组“种子”簇。从这些“种子”簇开始,开发一个扩展过程来识别与同一组诱饵蛋白显著相关的猎物蛋白。然后,得到一组完整的蛋白质复合物。在应用于两个真实的AP-MS数据集时,我们通过使用经过整理的蛋白质复合物和来自基因本体(GO)的特征明确的细胞成分注释来验证预测蛋白质复合物的生物学意义。应用了几种统计指标进行评估。
实验结果表明,所提出的算法在从AP-MS数据中检测蛋白质复合物方面取得了显著改进。与著名的MCL算法相比,我们的算法在检测两个网络中的蛋白质复合物时,准确率提高了约20%,在Krogan_2006网络中F-Measure值提高了约50%。实现了更高的精度和更好的准确性,并且所识别的复合物与现有的经过整理的蛋白质复合物匹配良好。
我们的研究强调了在从AP-MS数据中提取蛋白质复合物时考虑共复合物关系的重要性。本文提出的算法可以很容易地扩展到其他可以方便地用二分图表示的生物网络的分析,如药物-靶标网络。