Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
Department of Mathematics, Purdue University, West Lafayette, IN, USA.
Bioinformatics. 2017 Jun 15;33(12):1829-1836. doi: 10.1093/bioinformatics/btx029.
Diffusion-based network models are widely used for protein function prediction using protein network data and have been shown to outperform neighborhood-based and module-based methods. Recent studies have shown that integrating the hierarchical structure of the Gene Ontology (GO) data dramatically improves prediction accuracy. However, previous methods usually either used the GO hierarchy to refine the prediction results of multiple classifiers, or flattened the hierarchy into a function-function similarity kernel. No study has taken the GO hierarchy into account together with the protein network as a two-layer network model.
We first construct a Bi-relational graph (Birg) model comprised of both protein-protein association and function-function hierarchical networks. We then propose two diffusion-based methods, BirgRank and AptRank, both of which use PageRank to diffuse information on this two-layer graph model. BirgRank is a direct application of traditional PageRank with fixed decay parameters. In contrast, AptRank utilizes an adaptive diffusion mechanism to improve the performance of BirgRank. We evaluate the ability of both methods to predict protein function on yeast, fly and human protein datasets, and compare with four previous methods: GeneMANIA, TMC, ProteinRank and clusDCA. We design four different validation strategies: missing function prediction, de novo function prediction, guided function prediction and newly discovered function prediction to comprehensively evaluate predictability of all six methods. We find that both BirgRank and AptRank outperform the previous methods, especially in missing function prediction when using only 10% of the data for training.
The MATLAB code is available at https://github.rcac.purdue.edu/mgribsko/aptrank .
Supplementary data are available at Bioinformatics online.
基于扩散的网络模型广泛用于使用蛋白质网络数据进行蛋白质功能预测,并且已经被证明优于基于邻域和基于模块的方法。最近的研究表明,整合基因本体论(GO)数据的层次结构可以显著提高预测准确性。然而,以前的方法通常要么使用 GO 层次结构来细化多个分类器的预测结果,要么将层次结构平展为功能-功能相似性核。没有研究将 GO 层次结构与蛋白质网络一起考虑作为两层网络模型。
我们首先构建了一个由蛋白质-蛋白质相互作用和功能-功能层次网络组成的双关系图(Birg)模型。然后,我们提出了两种基于扩散的方法,BirgRank 和 AptRank,它们都使用 PageRank 在这个两层图模型上扩散信息。BirgRank 是传统 PageRank 的直接应用,具有固定的衰减参数。相比之下,AptRank 利用自适应扩散机制来提高 BirgRank 的性能。我们在酵母、果蝇和人类蛋白质数据集上评估了这两种方法预测蛋白质功能的能力,并与之前的四种方法进行了比较:GeneMANIA、TMC、ProteinRank 和 clusDCA。我们设计了四种不同的验证策略:缺失功能预测、从头功能预测、引导功能预测和新发现功能预测,以全面评估所有六种方法的可预测性。我们发现,BirgRank 和 AptRank 都优于之前的方法,特别是在仅使用 10%的数据进行训练时,在缺失功能预测方面表现出色。
MATLAB 代码可在 https://github.rcac.purdue.edu/mgribsko/aptrank 获得。
补充数据可在 Bioinformatics 在线获得。