蛋白质序列空间的功能层次组织。

A functional hierarchical organization of the protein sequence space.

作者信息

Kaplan Noam, Friedlich Moriah, Fromer Menachem, Linial Michal

机构信息

Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Israel.

出版信息

BMC Bioinformatics. 2004 Dec 14;5:196. doi: 10.1186/1471-2105-5-196.

DOI:10.1186/1471-2105-5-196

PMID:15596019

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC544566/

Abstract

BACKGROUND

It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity.

RESULTS

In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust.

CONCLUSIONS

We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.

摘要

背景

对所有已知蛋白质进行全面的功能分类是计算生物学面临的一项重大挑战。大多数现有方法基于已知蛋白质家族的手动验证比对，在已知蛋白质中寻找重复模式。这类方法可以实现高灵敏度，但受到必要的人工劳动的限制。这使得我们目前对蛋白质世界的认识不完整且有偏差。本文介绍ProtoNet，这是一个自动无监督全局聚类系统，仅基于序列相似性生成一棵包含超过100万个蛋白质的层次树。

结果

在本文中，我们表明ProtoNet正确地捕捉了蛋白质世界的功能和结构方面。此外，一个新特性是一个自动程序，可将树的大小缩减至原始大小的12%。该程序仅利用聚类过程固有的参数。尽管大小大幅缩减，但该系统关于生物学功能的预测能力几乎不受影响。然后，我们与现有的功能性蛋白质注释进行自动比较。结果，压缩树中的78%的聚类（5300个聚类）被高度自信地赋予了生物学功能。聚类和压缩过程是无监督的且稳健。

结论

我们提出了一种自动生成的无偏差方法，该方法对所有当前已知蛋白质进行层次分类。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/806a/544566/e5d937acdb13/1471-2105-5-196-1.jpg

相似文献

A functional hierarchical organization of the protein sequence space.蛋白质序列空间的功能层次组织。

BMC Bioinformatics. 2004 Dec 14;5:196. doi: 10.1186/1471-2105-5-196.

EvDTree: structure-dependent substitution profiles based on decision tree classification of 3D environments.EvDTree：基于三维环境决策树分类的结构相关替代概况

BMC Bioinformatics. 2005 Jan 10;6:4. doi: 10.1186/1471-2105-6-4.

ProtoNet 4.0: a hierarchical classification of one million protein sequences.ProtoNet 4.0：一百万个蛋白质序列的层次分类

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D216-8. doi: 10.1093/nar/gki007.

Automatic detection of false annotations via binary property clustering.通过二元属性聚类自动检测错误注释。

BMC Bioinformatics. 2005 Mar 8;6:46. doi: 10.1186/1471-2105-6-46.

Clustering the annotation space of proteins.对蛋白质的注释空间进行聚类。

BMC Bioinformatics. 2005 Feb 9;6:24. doi: 10.1186/1471-2105-6-24.

Profile-based direct kernels for remote homology detection and fold recognition.用于远程同源性检测和折叠识别的基于轮廓的直接内核。

Bioinformatics. 2005 Dec 1;21(23):4239-47. doi: 10.1093/bioinformatics/bti687. Epub 2005 Sep 27.

A robust method to detect structural and functional remote homologues.一种用于检测结构和功能远程同源物的强大方法。

Proteins. 2004 Nov 15;57(3):531-8. doi: 10.1002/prot.20235.

On the quality of tree-based protein classification.论基于树的蛋白质分类的质量。

Bioinformatics. 2005 May 1;21(9):1876-90. doi: 10.1093/bioinformatics/bti244. Epub 2005 Jan 12.

PCHM: A bioinformatic resource for high-throughput human mitochondrial proteome searching and comparison.PCHM：用于高通量人类线粒体蛋白质组搜索和比较的生物信息资源。

Comput Biol Med. 2009 Aug;39(8):689-96. doi: 10.1016/j.compbiomed.2009.05.006. Epub 2009 Jun 21.

Integrating multi-attribute similarity networks for robust representation of the protein space.整合多属性相似性网络以实现蛋白质空间的稳健表示。

Bioinformatics. 2006 Jul 1;22(13):1585-92. doi: 10.1093/bioinformatics/btl130. Epub 2006 Apr 4.

引用本文的文献

KMD clustering: robust general-purpose clustering of biological data.KMD 聚类：生物数据的稳健通用聚类。

Commun Biol. 2023 Nov 2;6(1):1110. doi: 10.1038/s42003-023-05480-z.

Trends in genome dynamics among major orders of insects revealed through variations in protein families.通过蛋白质家族的变异揭示昆虫主要目之间的基因组动态趋势。

BMC Genomics. 2015 Aug 7;16(1):583. doi: 10.1186/s12864-015-1771-2.

High-throughput genome scaffolding from in vivo DNA interaction frequency.基于体内 DNA 相互作用频率的高通量基因组支架搭建。

Nat Biotechnol. 2013 Dec;31(12):1143-7. doi: 10.1038/nbt.2768. Epub 2013 Nov 24.

Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex.通过 ProtoNet 族谱进行功能推断：溞属(Daphnia pulex)的未鉴定蛋白质组。

BMC Bioinformatics. 2013;14 Suppl 3(Suppl 3):S11. doi: 10.1186/1471-2105-14-S3-S11. Epub 2013 Feb 28.

Gene cluster statistics with gene families.具有基因家族的基因簇统计

Mol Biol Evol. 2009 May;26(5):957-68. doi: 10.1093/molbev/msp002. Epub 2009 Jan 15.

Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space.用于对海量数据集进行精确层次聚类的高效算法：攻克整个蛋白质空间

Bioinformatics. 2008 Jul 1;24(13):i41-9. doi: 10.1093/bioinformatics/btn174.

Global considerations in hierarchical clustering reveal meaningful patterns in data.层次聚类中的全局考量揭示了数据中有意义的模式。

PLoS One. 2008 May 21;3(5):e2247. doi: 10.1371/journal.pone.0002247.

Discovering multi-level structures in bio-molecular data through the Bernstein inequality.通过伯恩斯坦不等式发现生物分子数据中的多层次结构。

BMC Bioinformatics. 2008 Mar 26;9 Suppl 2(Suppl 2):S4. doi: 10.1186/1471-2105-9-S2-S4.

Model order selection for bio-molecular data clustering.生物分子数据聚类的模型阶次选择

BMC Bioinformatics. 2007 May 3;8 Suppl 2(Suppl 2):S7. doi: 10.1186/1471-2105-8-S2-S7.

ProtoBee: hierarchical classification and annotation of the honey bee proteome.原蜂（ProtoBee）：蜜蜂蛋白质组的分层分类与注释

Genome Res. 2006 Nov;16(11):1431-8. doi: 10.1101/gr.4916306. Epub 2006 Oct 25.

本文引用的文献

ProtoNet 4.0: a hierarchical classification of one million protein sequences.ProtoNet 4.0：一百万个蛋白质序列的层次分类

Nucleic Acids Res. 2005 Jan 1;33(Database issue):D216-8. doi: 10.1093/nar/gki007.

A robust method to detect structural and functional remote homologues.一种用于检测结构和功能远程同源物的强大方法。

Proteins. 2004 Nov 15;57(3):531-8. doi: 10.1002/prot.20235.

Progress towards mapping the universe of protein folds.绘制蛋白质折叠图谱的进展。

Genome Biol. 2004;5(5):107. doi: 10.1186/gb-2004-5-5-107. Epub 2004 Apr 29.

The number of protein folds and their distribution over families in nature.自然界中蛋白质折叠的数量及其在各家族中的分布。

Proteins. 2004 Feb 15;54(3):491-9. doi: 10.1002/prot.10514.

The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro.基因本体论注释（GOA）项目：基因本体论在SWISS-PROT、TrEMBL和InterPro中的实施。

Genome Res. 2003 Apr;13(4):662-72. doi: 10.1101/gr.461403. Epub 2003 Mar 12.

Myriads of protein families, and still counting.无数的蛋白质家族，且仍在不断增加。

Genome Biol. 2003;4(2):401. doi: 10.1186/gb-2003-4-2-401. Epub 2003 Jan 28.

Overview of structural genomics: from structure to function.结构基因组学概述：从结构到功能

Curr Opin Chem Biol. 2003 Feb;7(1):28-32. doi: 10.1016/s1367-5931(02)00015-7.

Domains, motifs and clusters in the protein universe.蛋白质世界中的结构域、基序和簇。

Curr Opin Chem Biol. 2003 Feb;7(1):5-11. doi: 10.1016/s1367-5931(02)00003-0.

The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.2003年的SWISS-PROT蛋白质知识库及其补充TrEMBL。

Nucleic Acids Res. 2003 Jan 1;31(1):365-70. doi: 10.1093/nar/gkg095.

InterPro: an integrated documentation resource for protein families, domains and functional sites.InterPro：蛋白质家族、结构域和功能位点的综合文献资源。

Brief Bioinform. 2002 Sep;3(3):225-35. doi: 10.1093/bib/3.3.225.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

蛋白质序列空间的功能层次组织。

A functional hierarchical organization of the protein sequence space.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献