提高 Pfam 对人类蛋白质组覆盖范围的挑战。

The challenge of increasing Pfam coverage of the human proteome.

机构信息

EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

出版信息

Database (Oxford). 2013 Apr 19;2013:bat023. doi: 10.1093/database/bat023. Print 2013.

DOI:10.1093/database/bat023

PMID:23603847

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC3630804/

Abstract

It is a worthy goal to completely characterize all human proteins in terms of their domains. Here, using the Pfam database, we asked how far we have progressed in this endeavour. Ninety per cent of proteins in the human proteome matched at least one of 5494 manually curated Pfam-A families. In contrast, human residue coverage by Pfam-A families was <45%, with 9418 automatically generated Pfam-B families adding a further 10%. Even after excluding predicted signal peptide regions and short regions (<50 consecutive residues) unlikely to harbour new families, for ∼38% of the human protein residues, there was no information in Pfam about conservation and evolutionary relationship with other protein regions. This uncovered portion of the human proteome was found to be distributed over almost 25 000 distinct protein regions. Comparison with proteins in the UniProtKB database suggested that the human regions that exhibited similarity to thousands of other sequences were often either divergent elements or N- or C-terminal extensions of existing families. Thirty-four per cent of regions, on the other hand, matched fewer than 100 sequences in UniProtKB. Most of these did not appear to share any relationship with existing Pfam-A families, suggesting that thousands of new families would need to be generated to cover them. Also, these latter regions were particularly rich in amino acid compositional bias such as the one associated with intrinsic disorder. This could represent a significant obstacle toward their inclusion into new Pfam families. Based on these observations, a major focus for increasing Pfam coverage of the human proteome will be to improve the definition of existing families. New families will also be built, prioritizing those that have been experimentally functionally characterized. Database URL: http://pfam.sanger.ac.uk/

摘要

从结构域的角度对所有人类蛋白质进行全面描述是一项有价值的目标。在这里，我们使用 Pfam 数据库来了解我们在这方面的进展程度。人类蛋白质组中的 90%的蛋白质至少与 5494 个手动注释的 Pfam-A 家族中的一个相匹配。相比之下，Pfam-A 家族对人类残基的覆盖率<45%，而 9418 个自动生成的 Pfam-B 家族又增加了 10%。即使在排除预测的信号肽区域和不太可能包含新家族的<50 个连续残基的短区域后，对于约 38%的人类蛋白质残基，Pfam 中仍没有关于与其他蛋白质区域的保守性和进化关系的信息。在 Pfam 中未发现的这部分人类蛋白质组被发现分布在近 25000 个不同的蛋白质区域中。与 UniProtKB 数据库中的蛋白质进行比较表明，与数千个其他序列具有相似性的人类区域通常是现有家族的发散元件或 N-或 C-末端延伸。另一方面，34%的区域在 UniProtKB 中与少于 100 个序列匹配。其中大多数似乎与现有的 Pfam-A 家族没有任何关系，这表明需要生成数千个新家族来覆盖它们。此外，这些区域特别富含氨基酸组成偏向性，例如与内在无序性相关的偏向性。这可能是将它们纳入新的 Pfam 家族的一个重大障碍。基于这些观察结果，增加 Pfam 对人类蛋白质组的覆盖范围的主要重点将是改进现有家族的定义。也将构建新的家族，优先考虑那些已经经过实验功能表征的家族。数据库网址：http://pfam.sanger.ac.uk/

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/1850/3630804/8ad3ea6d5e47/bat023f1p.jpg

相似文献

The challenge of increasing Pfam coverage of the human proteome.提高 Pfam 对人类蛋白质组覆盖范围的挑战。

Database (Oxford). 2013 Apr 19;2013:bat023. doi: 10.1093/database/bat023. Print 2013.

The Pfam protein families database.Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301. doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.

Pfam: the protein families database.Pfam：蛋白质家族数据库。

Nucleic Acids Res. 2014 Jan;42(Database issue):D222-30. doi: 10.1093/nar/gkt1223. Epub 2013 Nov 27.

Pfam: The protein families database in 2021.Pfam：2021 年的蛋白质家族数据库。

Nucleic Acids Res. 2021 Jan 8;49(D1):D412-D419. doi: 10.1093/nar/gkaa913.

The Pfam protein families database.Pfam蛋白质家族数据库。

Nucleic Acids Res. 2008 Jan;36(Database issue):D281-8. doi: 10.1093/nar/gkm960. Epub 2007 Nov 26.

The Pfam protein families database: towards a more sustainable future.Pfam蛋白质家族数据库：迈向更可持续的未来。

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

SUPFAM--a database of potential protein superfamily relationships derived by comparing sequence-based and structure-based families: implications for structural genomics and function annotation in genomes.SUPFAM——一个通过比较基于序列和基于结构的家族而得出的潜在蛋白质超家族关系数据库：对结构基因组学和基因组功能注释的意义。

Nucleic Acids Res. 2002 Jan 1;30(1):289-93. doi: 10.1093/nar/30.1.289.

The Pfam protein families database: embracing AI/ML.Pfam蛋白质家族数据库：拥抱人工智能/机器学习。

Nucleic Acids Res. 2025 Jan 6;53(D1):D523-D534. doi: 10.1093/nar/gkae997.

The Pfam protein families database.Pfam蛋白质家族数据库。

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D138-41. doi: 10.1093/nar/gkh121.

The Pfam protein families database in 2019.2019 年 Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

引用本文的文献

Evaluating large language models for annotating proteins.评估大型语言模型在蛋白质注释中的应用。

Brief Bioinform. 2024 Mar 27;25(3). doi: 10.1093/bib/bbae177.

Transfer learning: The key to functionally annotate the protein universe.迁移学习：对蛋白质全域进行功能注释的关键。

Patterns (N Y). 2023 Feb 10;4(2):100691. doi: 10.1016/j.patter.2023.100691.

DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets.DPCfam：通过对大型序列数据集的密度峰聚类进行无监督的蛋白质家族分类。

PLoS Comput Biol. 2022 Oct 19;18(10):e1010610. doi: 10.1371/journal.pcbi.1010610. eCollection 2022 Oct.

Functional dissection of human mitotic genes using CRISPR-Cas9 tiling screens.使用 CRISPR-Cas9 平铺筛选技术对人类有丝分裂基因进行功能剖析。

Genes Dev. 2022 Apr 1;36(7-8):495-510. doi: 10.1101/gad.349319.121. Epub 2022 Apr 28.

Genome assembly and annotation of the California harvester ant Pogonomyrmex californicus.加利福尼亚收获蚁 Pogonomyrmex californicus 的基因组组装和注释。

G3 (Bethesda). 2021 Jan 18;11(1). doi: 10.1093/g3journal/jkaa019.

DisProt: intrinsic protein disorder annotation in 2020.DisProt：2020 年的内在蛋白无序注释。

Nucleic Acids Res. 2020 Jan 8;48(D1):D269-D276. doi: 10.1093/nar/gkz975.

INGA 2.0: improving protein function prediction for the dark proteome.INGA 2.0：改进黑暗蛋白质组中蛋白质功能的预测。

Nucleic Acids Res. 2019 Jul 2;47(W1):W373-W378. doi: 10.1093/nar/gkz375.

GUIDES: sgRNA design for loss-of-function screens.指南：用于功能缺失筛选的sgRNA设计

Nat Methods. 2017 Aug 31;14(9):831-832. doi: 10.1038/nmeth.4423.

Exploring the dark foldable proteome by considering hydrophobic amino acids topology.探讨疏水氨基酸拓扑结构的暗可折叠蛋白质组。

Sci Rep. 2017 Jan 30;7:41425. doi: 10.1038/srep41425.

RepeatsDB 2.0: improved annotation, classification, search and visualization of repeat protein structures.RepeatsDB 2.0：改进了重复蛋白结构的注释、分类、搜索和可视化。

Nucleic Acids Res. 2017 Jan 4;45(D1):D308-D312. doi: 10.1093/nar/gkw1136. Epub 2016 Nov 29.

本文引用的文献

D²P²: database of disordered protein predictions.D²P²：紊乱蛋白预测数据库。

Nucleic Acids Res. 2013 Jan;41(Database issue):D508-16. doi: 10.1093/nar/gks1226. Epub 2012 Nov 29.

An integrated encyclopedia of DNA elements in the human genome.人类基因组中 DNA 元件的综合百科全书。

Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247.

MobiDB: a comprehensive database of intrinsic protein disorder annotations.MobiDB：一个全面的内在蛋白无序注释数据库。

Bioinformatics. 2012 Aug 1;28(15):2080-1. doi: 10.1093/bioinformatics/bts327. Epub 2012 Jun 1.

Structural disorder in eukaryotes.真核生物中的结构无序。

PLoS One. 2012;7(4):e34687. doi: 10.1371/journal.pone.0034687. Epub 2012 Apr 5.

Gene3D: a domain-based resource for comparative genomics, functional annotation and protein network analysis.Gene3D：一个基于结构域的资源，用于比较基因组学、功能注释和蛋白质网络分析。

Nucleic Acids Res. 2012 Jan;40(Database issue):D465-71. doi: 10.1093/nar/gkr1181. Epub 2011 Dec 1.

The Pfam protein families database.Pfam 蛋白质家族数据库。

Nucleic Acids Res. 2012 Jan;40(Database issue):D290-301. doi: 10.1093/nar/gkr1065. Epub 2011 Nov 29.

The UniProt-GO Annotation database in 2011.2011 年的 UniProt-GO Annotation 数据库。

Nucleic Acids Res. 2012 Jan;40(Database issue):D565-70. doi: 10.1093/nar/gkr1048. Epub 2011 Nov 28.

Saccharomyces Genome Database: the genomics resource of budding yeast.酿酒酵母基因组数据库：芽殖酵母的基因组资源。

Nucleic Acids Res. 2012 Jan;40(Database issue):D700-5. doi: 10.1093/nar/gkr1029. Epub 2011 Nov 21.

InterPro in 2011: new developments in the family and domain prediction database.InterPro 在 2011 年：家族和域预测数据库的新发展。

Nucleic Acids Res. 2012 Jan;40(Database issue):D306-12. doi: 10.1093/nar/gkr948. Epub 2011 Nov 16.

IDEAL: Intrinsically Disordered proteins with Extensive Annotations and Literature.理想：具有广泛注释和文献的无序蛋白质。

Nucleic Acids Res. 2012 Jan;40(Database issue):D507-11. doi: 10.1093/nar/gkr884. Epub 2011 Nov 8.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

提高 Pfam 对人类蛋白质组覆盖范围的挑战。

The challenge of increasing Pfam coverage of the human proteome.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献