Mathematical and Computer Sciences and Engineering Division, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia.
Bioinformatics. 2012 Sep 15;28(18):i444-i450. doi: 10.1093/bioinformatics/bts398.
Burgeoning sequencing technologies have generated massive amounts of genomic and proteomic data. Annotating the functions of proteins identified in this data has become a big and crucial problem. Various computational methods have been developed to infer the protein functions based on either the sequences or domains of proteins. The existing methods, however, ignore the recurrence and the order of the protein domains in this function inference.
We developed two new methods to infer protein functions based on protein domain recurrence and domain order. Our first method, DRDO, calculates the posterior probability of the Gene Ontology terms based on domain recurrence and domain order information, whereas our second method, DRDO-NB, relies on the naïve Bayes methodology using the same domain architecture information. Our large-scale benchmark comparisons show strong improvements in the accuracy of the protein function inference achieved by our new methods, demonstrating that domain recurrence and order can provide important information for inference of protein functions.
The new models are provided as open source programs at http://sfb.kaust.edu.sa/Pages/Software.aspx.
dkihara@cs.purdue.edu, xin.gao@kaust.edu.sa
Supplementary data are available at Bioinformatics Online.
蓬勃发展的测序技术产生了大量的基因组和蛋白质组数据。注释在这些数据中发现的蛋白质的功能已成为一个大而关键的问题。已经开发了各种计算方法来基于蛋白质的序列或结构域来推断蛋白质的功能。然而,现有的方法忽略了蛋白质功能推断中蛋白质结构域的重现和顺序。
我们开发了两种基于蛋白质结构域重现和结构域顺序推断蛋白质功能的新方法。我们的第一个方法 DRDO,基于结构域重现和结构域顺序信息计算基因本体术语的后验概率,而我们的第二个方法 DRDO-NB,则依赖于使用相同结构域体系结构信息的朴素贝叶斯方法。我们的大规模基准比较显示,我们的新方法在蛋白质功能推断的准确性方面取得了显著提高,表明结构域重现和顺序可以为蛋白质功能推断提供重要信息。
新模型作为开源程序在 http://sfb.kaust.edu.sa/Pages/Software.aspx 提供。
dkihara@cs.purdue.edu,xin.gao@kaust.edu.sa
补充数据可在 Bioinformatics Online 上获得。