Suppr超能文献

大型语言模型可生成不同家族的功能性蛋白质序列。

Large language models generate functional protein sequences across diverse families.

机构信息

Salesforce Research, Palo Alto, CA, USA.

Profluent Bio, San Francisco, CA, USA.

出版信息

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

Abstract

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

摘要

深度学习语言模型在各种生物技术应用中显示出了潜力,包括蛋白质设计和工程。在这里,我们描述了 ProGen,这是一种可以在大型蛋白质家族中生成具有可预测功能的蛋白质序列的语言模型,类似于生成语法和语义正确的关于各种主题的自然语言句子。该模型是在来自>19,000 个家族的 2.8 亿个蛋白质序列上进行训练的,并使用控制标签来指定蛋白质的特性进行扩充。ProGen 可以进一步针对经过策展的序列和标签进行微调,以提高具有足够同源样本的家族的蛋白质可控生成性能。经过微调的人工蛋白质在五个不同的溶菌酶家族中显示出与天然溶菌酶相似的催化效率,与天然蛋白质的序列同一性低至 31.4%。ProGen 可以很容易地适应各种蛋白质家族,我们用分支酸变位酶和苹果酸脱氢酶进行了演示。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/6a53/10400306/9791eb0159ec/nihms-1915837-f0001.jpg

相似文献

引用本文的文献

1
Multimodal integration strategies for clinical application in oncology.肿瘤学临床应用中的多模态整合策略
Front Pharmacol. 2025 Aug 20;16:1609079. doi: 10.3389/fphar.2025.1609079. eCollection 2025.
7
Tokenization and deep learning architectures in genomics: A comprehensive review.基因组学中的词法分析与深度学习架构:全面综述
Comput Struct Biotechnol J. 2025 Jul 28;27:3547-3555. doi: 10.1016/j.csbj.2025.07.038. eCollection 2025.

本文引用的文献

2
ColabFold: making protein folding accessible to all.ColabFold:让蛋白质折叠变得人人可用。
Nat Methods. 2022 Jun;19(6):679-682. doi: 10.1038/s41592-022-01488-1. Epub 2022 May 30.
5
De novo protein design by deep network hallucination.基于深度网络幻觉的从头设计蛋白质。
Nature. 2021 Dec;600(7889):547-552. doi: 10.1038/s41586-021-04184-w. Epub 2021 Dec 1.
6
Highly accurate protein structure prediction with AlphaFold.利用 AlphaFold 进行高精度蛋白质结构预测。
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
9
Low-N protein engineering with data-efficient deep learning.低蛋白工程与数据高效深度学习。
Nat Methods. 2021 Apr;18(4):389-396. doi: 10.1038/s41592-021-01100-y. Epub 2021 Apr 7.
10
Fast and sensitive taxonomic assignment to metagenomic contigs.快速而敏感的宏基因组序列分类学分配。
Bioinformatics. 2021 Sep 29;37(18):3029-3031. doi: 10.1093/bioinformatics/btab184.

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验