Maruyama Osamu, Kuwahara Yuki
Institute of Mathematics for Industry, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka, 819-0395, Japan.
Graduate School of Mathematics, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka, 819-0395, Japan.
BMC Bioinformatics. 2017 Dec 6;18(Suppl 15):491. doi: 10.1186/s12859-017-1920-5.
In recent years, protein-protein interaction (PPI) networks have been well recognized as important resources to elucidate various biological processes and cellular mechanisms. In this paper, we address the problem of predicting protein complexes from a PPI network. This problem has two difficulties. One is related to small complexes, which contains two or three components. It is relatively difficult to identify them due to their simpler internal structure, but unfortunately complexes of such sizes are dominant in major protein complex databases, such as CYC2008. Another difficulty is how to model overlaps between predicted complexes, that is, how to evaluate different predicted complexes sharing common proteins because CYC2008 and other databases include such protein complexes. Thus, it is critical how to model overlaps between predicted complexes to identify them simultaneously.
In this paper, we propose a sampling-based protein complex prediction method, RocSampler (Regularizing Overlapping Complexes), which exploits, as part of the whole scoring function, a regularization term for the overlaps of predicted complexes and that for the distribution of sizes of predicted complexes. We have implemented RocSampler in MATLAB and its executable file for Windows is available at the site, http://imi.kyushu-u.ac.jp/~om/software/RocSampler/ .
We have applied RocSampler to five yeast PPI networks and shown that it is superior to other existing methods. This implies that the design of scoring functions including regularization terms is an effective approach for protein complex prediction.
近年来,蛋白质 - 蛋白质相互作用(PPI)网络已被公认为是阐明各种生物过程和细胞机制的重要资源。在本文中,我们解决了从PPI网络预测蛋白质复合物的问题。这个问题存在两个难点。一个与小复合物有关,即包含两个或三个组分的复合物。由于其内部结构较为简单,识别它们相对困难,但不幸的是,这种规模的复合物在主要的蛋白质复合物数据库(如CYC2008)中占主导地位。另一个难点是如何对预测复合物之间的重叠进行建模,也就是说,如何评估共享共同蛋白质的不同预测复合物,因为CYC2008和其他数据库中都包含此类蛋白质复合物。因此,如何对预测复合物之间的重叠进行建模以同时识别它们至关重要。
在本文中,我们提出了一种基于采样的蛋白质复合物预测方法RocSampler(正则化重叠复合物),该方法在整个评分函数中利用了一个针对预测复合物重叠的正则化项以及一个针对预测复合物大小分布的正则化项。我们已在MATLAB中实现了RocSampler,其Windows可执行文件可在网站http://imi.kyushu-u.ac.jp/~om/software/RocSampler/获取。
我们将RocSampler应用于五个酵母PPI网络,并表明它优于其他现有方法。这意味着包含正则化项的评分函数设计是蛋白质复合物预测的一种有效方法。