Suppr超能文献

整合遗传算法与语言模型以优化酶设计

Integrating genetic algorithms and language models for enhanced enzyme design.

作者信息

Nana Teukam Yves Gaetan, Zipoli Federico, Laino Teodoro, Criscuolo Emanuele, Grisoni Francesca, Manica Matteo

机构信息

IBM Research Europe, Säumerstrasse 4, CH-8803 Rüschlikon, Switzerland.

Institute for Complex Molecular Systems and Department of Biomedical Engineering, Eindhoven University of Technology, 5612 AZ Eindhoven, the Netherlands.

出版信息

Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae675.

Abstract

Enzymes are molecular machines optimized by nature to allow otherwise impossible chemical processes to occur. Their design is a challenging task due to the complexity of the protein space and the intricate relationships between sequence, structure, and function. Recently, large language models (LLMs) have emerged as powerful tools for modeling and analyzing biological sequences, but their application to protein design is limited by the high cardinality of the protein space. This study introduces a framework that combines LLMs with genetic algorithms (GAs) to optimize enzymes. LLMs are trained on a large dataset of protein sequences to learn relationships between amino acid residues linked to structure and function. This knowledge is then leveraged by GAs to efficiently search for sequences with improved catalytic performance. We focused on two optimization tasks: improving the feasibility of biochemical reactions and increasing their turnover rate. Systematic evaluations on 105 biocatalytic reactions demonstrated that the LLM-GA framework generated mutants outperforming the wild-type enzymes in terms of feasibility in 90% of the instances. Further in-depth evaluation of seven reactions reveals the power of this methodology to make "the best of both worlds" and create mutants with structural features and flexibility comparable with the wild types. Our approach advances the state-of-the-art computational design of biocatalysts, ultimately opening opportunities for more sustainable chemical processes.

摘要

酶是自然界优化的分子机器,能使原本不可能发生的化学过程得以发生。由于蛋白质空间的复杂性以及序列、结构和功能之间的复杂关系,酶的设计是一项具有挑战性的任务。最近,大语言模型(LLMs)已成为建模和分析生物序列的强大工具,但其在蛋白质设计中的应用受到蛋白质空间高基数的限制。本研究引入了一个将大语言模型与遗传算法(GAs)相结合以优化酶的框架。大语言模型在一个大型蛋白质序列数据集上进行训练,以学习与结构和功能相关的氨基酸残基之间的关系。然后,遗传算法利用这些知识有效地搜索具有改进催化性能的序列。我们专注于两项优化任务:提高生化反应的可行性和提高其周转速率。对105个生物催化反应的系统评估表明,在90%的情况下,大语言模型-遗传算法框架生成的突变体在可行性方面优于野生型酶。对七个反应的进一步深入评估揭示了这种方法“两全其美”的能力,并创造出具有与野生型相当的结构特征和灵活性的突变体。我们的方法推动了生物催化剂计算设计的最新技术水平,最终为更可持续的化学过程开辟了机会。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/4745/11711099/5008fc9ec50c/bbae675f1.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验