Levy Emmanuel D, Ouzounis Christos A, Gilks Walter R, Audit Benjamin
Computational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK.
BMC Bioinformatics. 2005 Dec 14;6:302. doi: 10.1186/1471-2105-6-302.
One of the most evident achievements of bioinformatics is the development of methods that transfer biological knowledge from characterised proteins to uncharacterised sequences. This mode of protein function assignment is mostly based on the detection of sequence similarity and the premise that functional properties are conserved during evolution. Most automatic approaches developed to date rely on the identification of clusters of homologous proteins and the mapping of new proteins onto these clusters, which are expected to share functional characteristics.
Here, we inverse the logic of this process, by considering the mapping of sequences directly to a functional classification instead of mapping functions to a sequence clustering. In this mode, the starting point is a database of labelled proteins according to a functional classification scheme, and the subsequent use of sequence similarity allows defining the membership of new proteins to these functional classes. In this framework, we define the Correspondence Indicators as measures of relationship between sequence and function and further formulate two Bayesian approaches to estimate the probability for a sequence of unknown function to belong to a functional class. This approach allows the parametrisation of different sequence search strategies and provides a direct measure of annotation error rates. We validate this approach with a database of enzymes labelled by their corresponding four-digit EC numbers and analyse specific cases.
The performance of this method is significantly higher than the simple strategy consisting in transferring the annotation from the highest scoring BLAST match and is expected to find applications in automated functional annotation pipelines.
生物信息学最显著的成就之一是开发了将生物学知识从已表征的蛋白质转移到未表征序列的方法。这种蛋白质功能分配模式主要基于序列相似性的检测以及功能特性在进化过程中保守的前提。迄今为止开发的大多数自动方法都依赖于同源蛋白质簇的识别以及将新蛋白质映射到这些簇上,这些簇有望共享功能特征。
在此,我们颠倒了这个过程的逻辑,通过直接将序列映射到功能分类而不是将功能映射到序列聚类。在这种模式下,起点是根据功能分类方案建立的带标签蛋白质数据库,随后利用序列相似性来确定新蛋白质属于这些功能类别的归属。在这个框架中,我们将对应指标定义为序列与功能之间关系的度量,并进一步制定了两种贝叶斯方法来估计未知功能序列属于某个功能类别的概率。这种方法允许对不同的序列搜索策略进行参数化,并提供注释错误率的直接度量。我们用一个由相应的四位数字酶委员会(EC)编号标记的酶数据库验证了这种方法,并分析了具体案例。
该方法的性能明显高于简单地从得分最高的BLAST匹配转移注释的策略,有望在自动功能注释流程中得到应用。