生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

机构信息

Facebook AI Research, New York, NY 10003;

Department of Computer Science, New York University, New York, NY 10012.

出版信息

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

DOI:10.1073/pnas.2016239118

PMID:33876751

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8053943/

Abstract

In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In the life sciences, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Protein language modeling at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. To this end, we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multiscale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections. Representation learning produces features that generalize across a range of applications, enabling state-of-the-art supervised prediction of mutational effect and secondary structure and improving state-of-the-art features for long-range contact prediction.

摘要

在人工智能领域，通过无监督学习实现的数据规模和模型容量的结合，推动了表示学习和统计生成方面的重大进展。在生命科学领域，测序技术的预期发展有望带来前所未有的自然序列多样性数据。在进化尺度上进行蛋白质语言建模是实现生物学领域预测性和生成性人工智能的合乎逻辑的步骤。为此，我们使用无监督学习在跨越进化多样性的 2.5 亿个蛋白质序列的 860 亿个氨基酸上训练深度上下文语言模型。由此产生的模型在其表示中包含关于生物特性的信息。这些表示是仅从序列数据中学习到的。所学习的表示空间具有多尺度组织，反映了从氨基酸生化特性到蛋白质远程同源性的结构。关于二级和三级结构的信息被编码在表示中，并可以通过线性投影来识别。表示学习产生的特征可以在一系列应用中泛化，能够实现突变效应和二级结构的最先进的监督预测，并改进用于远程接触预测的最先进特征。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/41cc/8053943/837e62287634/pnas.2016239118fig01.jpg

相似文献

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Learning representation hierarchies by sharing visual features: a computational investigation of Persian character recognition with unsupervised deep learning.通过共享视觉特征学习表征层次结构：基于无监督深度学习的波斯文字符识别的计算研究

Cogn Process. 2017 Aug;18(3):273-284. doi: 10.1007/s10339-017-0796-7. Epub 2017 Feb 25.

Integrating unsupervised language model with multi-view multiple sequence alignments for high-accuracy inter-chain contact prediction.将无监督语言模型与多视图多序列比对相结合，实现高精度的链间接触预测。

Comput Biol Med. 2023 Nov;166:107529. doi: 10.1016/j.compbiomed.2023.107529. Epub 2023 Sep 20.

Unsupervised Representation Learning for Proteochemometric Modeling.无监督表示学习在定量构效关系建模中的应用。

Int J Mol Sci. 2021 Nov 28;22(23):12882. doi: 10.3390/ijms222312882.

Evolutionary-scale prediction of atomic-level protein structure with a language model.用语言模型进行原子级蛋白质结构的进化尺度预测。

Science. 2023 Mar 17;379(6637):1123-1130. doi: 10.1126/science.ade2574. Epub 2023 Mar 16.

Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition.Mol2vec：具有化学直觉的无监督机器学习方法。

J Chem Inf Model. 2018 Jan 22;58(1):27-35. doi: 10.1021/acs.jcim.7b00616. Epub 2018 Jan 10.

Remote protein homology detection using recurrence quantification analysis and amino acid physicochemical properties.利用递归定量分析和氨基酸理化性质进行远程蛋白质同源性检测。

J Theor Biol. 2008 May 7;252(1):145-54. doi: 10.1016/j.jtbi.2008.01.028. Epub 2008 Feb 7.

Learning the protein language: Evolution, structure, and function.学习蛋白质语言：进化、结构和功能。

Cell Syst. 2021 Jun 16;12(6):654-669.e3. doi: 10.1016/j.cels.2021.05.017.

Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art.基于机器学习的蛋白质-RNA 界面残基预测：现状评估。

BMC Bioinformatics. 2012 May 10;13:89. doi: 10.1186/1471-2105-13-89.

Machine learning methods for protein structure prediction.机器学习方法在蛋白质结构预测中的应用。

IEEE Rev Biomed Eng. 2008;1:41-9. doi: 10.1109/RBME.2008.2008239.

引用本文的文献

Graph neural network integrated with pretrained protein language model for predicting human-virus protein-protein interactions.结合预训练蛋白质语言模型的图神经网络用于预测人-病毒蛋白质-蛋白质相互作用

Brief Bioinform. 2025 Aug 31;26(5). doi: 10.1093/bib/bbaf461.

Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information.利用注意力图引导的图卷积网络结合蛋白质语言嵌入和物理化学信息预测核酸结合位点。

Brief Bioinform. 2025 Aug 31;26(5). doi: 10.1093/bib/bbaf457.

Language Modelling Techniques for Analysing the Impact of Human Genetic Variation.用于分析人类基因变异影响的语言建模技术

Bioinform Biol Insights. 2025 Sep 2;19:11779322251358314. doi: 10.1177/11779322251358314. eCollection 2025.

Pre-training Genomic Language Model with Variants for Better Modeling Functional Genomics.使用变异体预训练基因组语言模型以更好地建模功能基因组学。

bioRxiv. 2025 Aug 23:2025.02.26.640468. doi: 10.1101/2025.02.26.640468.

Alphappimi: a comprehensive deep learning framework for predicting PPI-modulator interactions.Alphappimi：用于预测蛋白质-蛋白质相互作用调节剂相互作用的综合深度学习框架。

J Cheminform. 2025 Aug 29;17(1):134. doi: 10.1186/s13321-025-01077-2.

Functional and clinical insights into nuclear receptor variants for advancing precision diagnostics in male infertility.核受体变体在男性不育精准诊断中的功能与临床见解

EBioMedicine. 2025 Aug 28;119:105899. doi: 10.1016/j.ebiom.2025.105899.

WaveSeekerNet: accurate prediction of influenza A virus subtypes and host source using attention-based deep learning.WaveSeekerNet：基于注意力机制的深度学习对甲型流感病毒亚型和宿主来源的准确预测

Gigascience. 2025 Jan 6;14. doi: 10.1093/gigascience/giaf089.

Predicting the DNA binding specificity of transcription factor mutants using family-level biophysically interpretable machine learning.利用家族水平的具有生物物理可解释性的机器学习预测转录因子突变体的DNA结合特异性。

Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf831.

IQSPred-PLM: An Interpretable Quorum Sensing Peptides Prediction Model Based on Protein Language Model.IQSPred-PLM：一种基于蛋白质语言模型的可解释群体感应肽预测模型。

Interdiscip Sci. 2025 Aug 26. doi: 10.1007/s12539-025-00766-8.

An efficient machine-learning framework for predicting protein post-translational modification sites.一种用于预测蛋白质翻译后修饰位点的高效机器学习框架。

Sci Rep. 2025 Aug 25;15(1):31179. doi: 10.1038/s41598-025-13178-x.

本文引用的文献

Generating functional protein variants with variational autoencoders.利用变分自动编码器生成功能性蛋白质变体。

PLoS Comput Biol. 2021 Feb 26;17(2):e1008736. doi: 10.1371/journal.pcbi.1008736. eCollection 2021 Feb.

Evaluating Protein Transfer Learning with TAPE.使用TAPE评估蛋白质迁移学习。

Adv Neural Inf Process Syst. 2019 Dec;32:9689-9701.

Improved protein structure prediction using potentials from deep learning.利用深度学习势进行蛋白质结构预测的改进。

Nature. 2020 Jan;577(7792):706-710. doi: 10.1038/s41586-019-1923-7. Epub 2020 Jan 15.

UDSMProt: universal deep sequence models for protein classification.UDSMProt：用于蛋白质分类的通用深度序列模型。

Bioinformatics. 2020 Apr 15;36(8):2401-2409. doi: 10.1093/bioinformatics/btaa003.

Modeling aspects of the language of life through transfer-learning protein sequences.通过转移学习蛋白质序列来模拟生命语言的各个方面。

BMC Bioinformatics. 2019 Dec 17;20(1):723. doi: 10.1186/s12859-019-3220-8.

Unified rational protein engineering with sequence-based deep representation learning.基于序列的深度学习表示的统一理性蛋白质工程。

Nat Methods. 2019 Dec;16(12):1315-1322. doi: 10.1038/s41592-019-0598-1. Epub 2019 Oct 21.

Critical assessment of methods of protein structure prediction (CASP)-Round XIII.蛋白质结构预测方法的关键评估（CASP）-第十三轮。

Proteins. 2019 Dec;87(12):1011-1020. doi: 10.1002/prot.25823. Epub 2019 Oct 23.

Machine-learning-guided directed evolution for protein engineering.基于机器学习的定向进化蛋白质工程。

Nat Methods. 2019 Aug;16(8):687-694. doi: 10.1038/s41592-019-0496-6. Epub 2019 Jul 15.

Revealing evolutionary constraints on proteins through sequence analysis.通过序列分析揭示蛋白质的进化约束。

PLoS Comput Biol. 2019 Apr 24;15(4):e1007010. doi: 10.1371/journal.pcbi.1007010. eCollection 2019 Apr.

NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning.NetSurfP-2.0：通过集成深度学习改进蛋白质结构特征预测。

Proteins. 2019 Jun;87(6):520-527. doi: 10.1002/prot.25674. Epub 2019 Mar 9.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

生物结构和功能源于将无监督学习扩展到 2.5 亿个蛋白质序列。

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献