大型语言模型可生成不同家族的功能性蛋白质序列。

Large language models generate functional protein sequences across diverse families.

机构信息

Salesforce Research, Palo Alto, CA, USA.

Profluent Bio, San Francisco, CA, USA.

出版信息

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

DOI:10.1038/s41587-022-01618-2

PMID:36702895

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10400306/

Abstract

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

摘要

深度学习语言模型在各种生物技术应用中显示出了潜力，包括蛋白质设计和工程。在这里，我们描述了 ProGen，这是一种可以在大型蛋白质家族中生成具有可预测功能的蛋白质序列的语言模型，类似于生成语法和语义正确的关于各种主题的自然语言句子。该模型是在来自>19,000 个家族的 2.8 亿个蛋白质序列上进行训练的，并使用控制标签来指定蛋白质的特性进行扩充。ProGen 可以进一步针对经过策展的序列和标签进行微调，以提高具有足够同源样本的家族的蛋白质可控生成性能。经过微调的人工蛋白质在五个不同的溶菌酶家族中显示出与天然溶菌酶相似的催化效率，与天然蛋白质的序列同一性低至 31.4%。ProGen 可以很容易地适应各种蛋白质家族，我们用分支酸变位酶和苹果酸脱氢酶进行了演示。

相似文献

Large language models generate functional protein sequences across diverse families.

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

An evolution-based model for designing chorismate mutase enzymes.

Science. 2020 Jul 24;369(6502):440-445. doi: 10.1126/science.aba3304.

Generative power of a protein language model trained on multiple sequence alignments.

Elife. 2023 Feb 3;12:e79854. doi: 10.7554/eLife.79854.

Cloning and expression in yeast of a higher plant chorismate mutase. Molecular cloning, sequencing of the cDNA and characterization of the Arabidopsis thaliana enzyme expressed in yeast.

FEBS Lett. 1993 Nov 15;334(2):233-6. doi: 10.1016/0014-5793(93)81718-f.

Purified recombinant hypothetical protein coded by open reading frame Rv1885c of Mycobacterium tuberculosis exhibits a monofunctional AroQ class of periplasmic chorismate mutase activity.

J Biol Chem. 2005 May 20;280(20):19641-8. doi: 10.1074/jbc.M413026200. Epub 2005 Feb 28.

Mycobacterium tuberculosis chorismate mutase: A potential target for TB.

Bioorg Med Chem. 2017 Mar 15;25(6):1725-1736. doi: 10.1016/j.bmc.2017.02.001. Epub 2017 Feb 4.

Computationally designed variants of Escherichia coli chorismate mutase show altered catalytic activity.

Protein Eng Des Sel. 2005 Apr;18(4):161-3. doi: 10.1093/protein/gzi015. Epub 2005 Apr 8.

A glutamate residue in the catalytic center of the yeast chorismate mutase restricts enzyme activity to acidic conditions.

Proc Natl Acad Sci U S A. 1997 Aug 5;94(16):8491-6. doi: 10.1073/pnas.94.16.8491.

Monofunctional chorismate mutase from Bacillus subtilis: purification of the protein, molecular cloning of the gene, and overexpression of the gene product in Escherichia coli.

Biochemistry. 1990 Jan 16;29(2):376-83. doi: 10.1021/bi00454a011.

The monofunctional chorismate mutase from Bacillus subtilis. Structure determination of chorismate mutase and its complexes with a transition state analog and prephenate, and implications for the mechanism of the enzymatic reaction.

J Mol Biol. 1994 Jul 29;240(5):476-500. doi: 10.1006/jmbi.1994.1462.

引用本文的文献

Multimodal integration strategies for clinical application in oncology.

Front Pharmacol. 2025 Aug 20;16:1609079. doi: 10.3389/fphar.2025.1609079. eCollection 2025.

Integrating experimental feedback improves generative models for biological sequences.

Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf832.

Deciphering enzymatic potential in metagenomic reads through DNA language models.

Nucleic Acids Res. 2025 Aug 27;53(16). doi: 10.1093/nar/gkaf836.

Rational protein engineering using an omni-directional multipoint mutagenesis generation pipeline.

iScience. 2025 Aug 5;28(9):113273. doi: 10.1016/j.isci.2025.113273. eCollection 2025 Sep 19.

Exploring the space of self-reproducing ribozymes using generative models.

Nat Commun. 2025 Aug 22;16(1):7836. doi: 10.1038/s41467-025-63151-5.

SALM: Sequence-Structure Pre-trained Large Language Model for Comprehensive Antibody Representation Learning.

Research (Wash D C). 2025 Aug 19;8:0721. doi: 10.34133/research.0721. eCollection 2025.

Tokenization and deep learning architectures in genomics: A comprehensive review.

Comput Struct Biotechnol J. 2025 Jul 28;27:3547-3555. doi: 10.1016/j.csbj.2025.07.038. eCollection 2025.

Application of Protein Structure Encodings and Sequence Embeddings for Transporter Substrate Prediction.

Molecules. 2025 Aug 1;30(15):3226. doi: 10.3390/molecules30153226.

Target sequence-conditioned design of peptide binders using masked language modeling.

Nat Biotechnol. 2025 Aug 13. doi: 10.1038/s41587-025-02761-2.

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA.

ArXiv. 2025 Aug 4:arXiv:2412.05430v2.

本文引用的文献

ProtGPT2 is a deep unsupervised language model for protein design.

Nat Commun. 2022 Jul 27;13(1):4348. doi: 10.1038/s41467-022-32007-7.

ColabFold: making protein folding accessible to all.

Nat Methods. 2022 Jun;19(6):679-682. doi: 10.1038/s41592-022-01488-1. Epub 2022 May 30.

A backbone-centred energy function of neural networks for protein design.

Nature. 2022 Feb;602(7897):523-528. doi: 10.1038/s41586-021-04383-5. Epub 2022 Feb 9.

Protein sequence design with a learned potential.

Nat Commun. 2022 Feb 8;13(1):746. doi: 10.1038/s41467-022-28313-9.

De novo protein design by deep network hallucination.

Nature. 2021 Dec;600(7889):547-552. doi: 10.1038/s41586-021-04184-w. Epub 2021 Dec 1.

Highly accurate protein structure prediction with AlphaFold.

Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.

Protein design and variant prediction using autoregressive generative models.

Nat Commun. 2021 Apr 23;12(1):2403. doi: 10.1038/s41467-021-22732-w.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences.

Proc Natl Acad Sci U S A. 2021 Apr 13;118(15). doi: 10.1073/pnas.2016239118.

Low-N protein engineering with data-efficient deep learning.

Nat Methods. 2021 Apr;18(4):389-396. doi: 10.1038/s41592-021-01100-y. Epub 2021 Apr 7.

Fast and sensitive taxonomic assignment to metagenomic contigs.

Bioinformatics. 2021 Sep 29;37(18):3029-3031. doi: 10.1093/bioinformatics/btab184.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

大型语言模型可生成不同家族的功能性蛋白质序列。

Large language models generate functional protein sequences across diverse families.

机构信息

Salesforce Research, Palo Alto, CA, USA.

Profluent Bio, San Francisco, CA, USA.

出版信息

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

DOI:10.1038/s41587-022-01618-2

PMID:36702895

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10400306/

Abstract

摘要

大型语言模型可生成不同家族的功能性蛋白质序列。

Large language models generate functional protein sequences across diverse families.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

大型语言模型可生成不同家族的功能性蛋白质序列。

Large language models generate functional protein sequences across diverse families.

机构信息

出版信息