大型语言模型可生成不同家族的功能性蛋白质序列。

Large language models generate functional protein sequences across diverse families.

机构信息

Salesforce Research, Palo Alto, CA, USA.

Profluent Bio, San Francisco, CA, USA.

出版信息

Nat Biotechnol. 2023 Aug;41(8):1099-1106. doi: 10.1038/s41587-022-01618-2. Epub 2023 Jan 26.

Abstract

Deep-learning language models have shown promise in various biotechnological applications, including protein design and engineering. Here we describe ProGen, a language model that can generate protein sequences with a predictable function across large protein families, akin to generating grammatically and semantically correct natural language sentences on diverse topics. The model was trained on 280 million protein sequences from >19,000 families and is augmented with control tags specifying protein properties. ProGen can be further fine-tuned to curated sequences and tags to improve controllable generation performance of proteins from families with sufficient homologous samples. Artificial proteins fine-tuned to five distinct lysozyme families showed similar catalytic efficiencies as natural lysozymes, with sequence identity to natural proteins as low as 31.4%. ProGen is readily adapted to diverse protein families, as we demonstrate with chorismate mutase and malate dehydrogenase.

摘要

深度学习语言模型在各种生物技术应用中显示出了潜力,包括蛋白质设计和工程。在这里,我们描述了 ProGen,这是一种可以在大型蛋白质家族中生成具有可预测功能的蛋白质序列的语言模型,类似于生成语法和语义正确的关于各种主题的自然语言句子。该模型是在来自>19,000 个家族的 2.8 亿个蛋白质序列上进行训练的,并使用控制标签来指定蛋白质的特性进行扩充。ProGen 可以进一步针对经过策展的序列和标签进行微调,以提高具有足够同源样本的家族的蛋白质可控生成性能。经过微调的人工蛋白质在五个不同的溶菌酶家族中显示出与天然溶菌酶相似的催化效率,与天然蛋白质的序列同一性低至 31.4%。ProGen 可以很容易地适应各种蛋白质家族,我们用分支酸变位酶和苹果酸脱氢酶进行了演示。

相似文献

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索