Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN 46202, USA.
BMC Bioinformatics. 2010 May 19;11:265. doi: 10.1186/1471-2105-11-265.
A new paradigm of biological investigation takes advantage of technologies that produce large high throughput datasets, including genome sequences, interactions of proteins, and gene expression. The ability of biologists to analyze and interpret such data relies on functional annotation of the included proteins, but even in highly characterized organisms many proteins can lack the functional evidence necessary to infer their biological relevance.
Here we have applied high confidence function predictions from our automated prediction system, PFP, to three genome sequences, Escherichia coli, Saccharomyces cerevisiae, and Plasmodium falciparum (malaria). The number of annotated genes is increased by PFP to over 90% for all of the genomes. Using the large coverage of the function annotation, we introduced the functional similarity networks which represent the functional space of the proteomes. Four different functional similarity networks are constructed for each proteome, one each by considering similarity in a single Gene Ontology (GO) category, i.e. Biological Process, Cellular Component, and Molecular Function, and another one by considering overall similarity with the funSim score. The functional similarity networks are shown to have higher modularity than the protein-protein interaction network. Moreover, the funSim score network is distinct from the single GO-score networks by showing a higher clustering degree exponent value and thus has a higher tendency to be hierarchical. In addition, examining function assignments to the protein-protein interaction network and local regions of genomes has identified numerous cases where subnetworks or local regions have functionally coherent proteins. These results will help interpreting interactions of proteins and gene orders in a genome. Several examples of both analyses are highlighted.
The analyses demonstrate that applying high confidence predictions from PFP can have a significant impact on a researchers' ability to interpret the immense biological data that are being generated today. The newly introduced functional similarity networks of the three organisms show different network properties as compared with the protein-protein interaction networks.
一种新的生物学研究范式利用了能够产生大量高通量数据集的技术,包括基因组序列、蛋白质相互作用和基因表达。生物学家分析和解释这些数据的能力依赖于所包含蛋白质的功能注释,但即使在高度描述的生物体中,许多蛋白质也缺乏推断其生物学相关性所需的功能证据。
我们将我们的自动化预测系统 PFP 的高置信度功能预测应用于三个基因组序列,大肠杆菌、酿酒酵母和恶性疟原虫(疟疾)。PFP 将所有基因组的注释基因数量增加到 90%以上。利用功能注释的广泛覆盖,我们引入了功能相似性网络,代表了蛋白质组的功能空间。为每个蛋白质组构建了四个不同的功能相似性网络,一个是通过考虑单个基因本体论 (GO) 类别中的相似性构建的,即生物过程、细胞成分和分子功能,另一个是通过考虑与 funSim 分数的整体相似性构建的。功能相似性网络被证明比蛋白质-蛋白质相互作用网络具有更高的模块性。此外,funSim 分数网络与单个 GO 分数网络不同,通过显示更高的聚类度指数值,因此具有更高的层次倾向。此外,检查蛋白质-蛋白质相互作用网络和基因组局部区域的功能分配,确定了许多情况下子网或局部区域具有功能一致的蛋白质。这些结果将有助于解释基因组中蛋白质相互作用和基因顺序。突出显示了这两种分析的几个示例。
这些分析表明,应用 PFP 的高置信度预测可以显著影响研究人员解释当今生成的大量生物数据的能力。与蛋白质-蛋白质相互作用网络相比,新引入的三个生物体的功能相似性网络显示出不同的网络特性。