利用深度学习标注蛋白质宇宙。

Using deep learning to annotate the protein universe.

机构信息

Google Research, Cambridge, MA, USA.

The Francis Crick Institute, London, UK.

出版信息

Nat Biotechnol. 2022 Jun;40(6):932-937. doi: 10.1038/s41587-021-01179-w. Epub 2022 Feb 21.

DOI:10.1038/s41587-021-01179-w

PMID:35190689

Abstract

Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.

摘要

理解氨基酸序列和蛋白质功能之间的关系是一个长期存在的挑战，具有深远的科学和转化意义。最先进的基于比对的技术无法预测三分之一的微生物蛋白质序列的功能，这阻碍了我们利用来自不同生物体的数据的能力。在这里，我们训练深度学习模型，以在严格的基准评估中准确预测未对齐的氨基酸序列的功能注释，这些基准评估是基于蛋白质家族数据库 Pfam 的 17929 个家族构建的。这些模型推断出已知的进化替代模式，并学习准确地对来自未见家族的序列进行聚类的表示。将深度学习模型与现有方法相结合，可显著提高远程同源性检测的效果，这表明深度学习模型学习到了互补信息。这种方法将 Pfam 的覆盖范围扩大了超过 9.5%，超过了过去十年的增加量，并预测了 360 个人类参考蛋白质组蛋白以前没有 Pfam 注释的功能。这些结果表明，深度学习模型将成为未来蛋白质注释工具的核心组成部分。

相似文献

Using deep learning to annotate the protein universe.

Nat Biotechnol. 2022 Jun;40(6):932-937. doi: 10.1038/s41587-021-01179-w. Epub 2022 Feb 21.

Transfer learning: The key to functionally annotate the protein universe.

Patterns (N Y). 2023 Feb 10;4(2):100691. doi: 10.1016/j.patter.2023.100691.

Uncovering new families and folds in the natural protein universe.

Nature. 2023 Oct;622(7983):646-653. doi: 10.1038/s41586-023-06622-3. Epub 2023 Sep 13.

The challenge of increasing Pfam coverage of the human proteome.

Database (Oxford). 2013 Apr 19;2013:bat023. doi: 10.1093/database/bat023. Print 2013.

Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints.

Nat Commun. 2019 Sep 4;10(1):3977. doi: 10.1038/s41467-019-11994-0.

AlphaFun: Structural-Alignment-Based Proteome Annotation Reveals why the Functionally Unknown Proteins (uPE1) Are So Understudied.

J Proteome Res. 2024 May 3;23(5):1593-1602. doi: 10.1021/acs.jproteome.3c00678. Epub 2024 Apr 16.

The Pfam protein families database: towards a more sustainable future.

Nucleic Acids Res. 2016 Jan 4;44(D1):D279-85. doi: 10.1093/nar/gkv1344. Epub 2015 Dec 15.

Exploring the dark foldable proteome by considering hydrophobic amino acids topology.

Sci Rep. 2017 Jan 30;7:41425. doi: 10.1038/srep41425.

Detection of orphan domains in Drosophila using "hydrophobic cluster analysis".

Biochimie. 2015 Dec;119:244-53. doi: 10.1016/j.biochi.2015.02.019. Epub 2015 Feb 28.

The Pfam protein families database in 2019.

Nucleic Acids Res. 2019 Jan 8;47(D1):D427-D432. doi: 10.1093/nar/gky995.

引用本文的文献

Deciphering enzymatic potential in metagenomic reads through DNA language models.

Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf836.

DeepSEA: an alignment-free explainable approach to annotate antimicrobial resistance proteins.

BMC Bioinformatics. 2025 Sep 1;26(1):224. doi: 10.1186/s12859-025-06256-4.

Protein functional site annotation using local structure embeddings.

Proc Natl Acad Sci U S A. 2025 Aug 26;122(34):e2513219122. doi: 10.1073/pnas.2513219122. Epub 2025 Aug 20.

Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction.

Molecules. 2025 Aug 1;30(15):3226. doi: 10.3390/molecules30153226.

Rational engineering of allosteric protein switches by in silico prediction of domain insertion sites.

Nat Methods. 2025 Aug;22(8):1698-1706. doi: 10.1038/s41592-025-02741-z. Epub 2025 Aug 4.

In silico prediction of variant effects: promises and limitations for precision plant breeding.

Theor Appl Genet. 2025 Jul 28;138(8):193. doi: 10.1007/s00122-025-04973-1.

Harnessing deep learning for proteome-scale detection of amyloid signaling motifs.

Bioinformatics. 2025 Jul 1;41(Supplement_1):i420-i428. doi: 10.1093/bioinformatics/btaf200.

ProtGO: universal protein function prediction utilizing multi-modal gene ontology knowledge.

Bioinformatics. 2025 Jul 1;41(7). doi: 10.1093/bioinformatics/btaf390.

Cutting-edge deep-learning based tools for metagenomic research.

Natl Sci Rev. 2025 Feb 19;12(6):nwaf056. doi: 10.1093/nsr/nwaf056. eCollection 2025 Jun.

PathoGraph: A Graph-Based Method for Standardized Representation of Pathology Knowledge.

Sci Data. 2025 May 27;12(1):872. doi: 10.1038/s41597-025-04906-z.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用深度学习标注蛋白质宇宙。

Using deep learning to annotate the protein universe.

机构信息

出版信息

相似文献

引用本文的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献