Kaplan Noam, Friedlich Moriah, Fromer Menachem, Linial Michal
Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel.
BMC Bioinformatics. 2004 Dec 14;5:196. doi: 10.1186/1471-2105-5-196.
It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity.
In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust.
We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.
对所有已知蛋白质进行全面的功能分类是计算生物学面临的一项重大挑战。大多数现有方法基于已知蛋白质家族的手动验证比对,在已知蛋白质中寻找重复模式。这类方法可以实现高灵敏度,但受到必要的人工劳动的限制。这使得我们目前对蛋白质世界的认识不完整且有偏差。本文介绍ProtoNet,这是一个自动无监督全局聚类系统,仅基于序列相似性生成一棵包含超过100万个蛋白质的层次树。
在本文中,我们表明ProtoNet正确地捕捉了蛋白质世界的功能和结构方面。此外,一个新特性是一个自动程序,可将树的大小缩减至原始大小的12%。该程序仅利用聚类过程固有的参数。尽管大小大幅缩减,但该系统关于生物学功能的预测能力几乎不受影响。然后,我们与现有的功能性蛋白质注释进行自动比较。结果,压缩树中的78%的聚类(5300个聚类)被高度自信地赋予了生物学功能。聚类和压缩过程是无监督的且稳健。
我们提出了一种自动生成的无偏差方法,该方法对所有当前已知蛋白质进行层次分类。