Nana Teukam Yves Gaetan, Zipoli Federico, Laino Teodoro, Criscuolo Emanuele, Grisoni Francesca, Manica Matteo
IBM Research Europe, Säumerstrasse 4, CH-8803 Rüschlikon, Switzerland.
Institute for Complex Molecular Systems and Department of Biomedical Engineering, Eindhoven University of Technology, 5612 AZ Eindhoven, the Netherlands.
Brief Bioinform. 2024 Nov 22;26(1). doi: 10.1093/bib/bbae675.
Enzymes are molecular machines optimized by nature to allow otherwise impossible chemical processes to occur. Their design is a challenging task due to the complexity of the protein space and the intricate relationships between sequence, structure, and function. Recently, large language models (LLMs) have emerged as powerful tools for modeling and analyzing biological sequences, but their application to protein design is limited by the high cardinality of the protein space. This study introduces a framework that combines LLMs with genetic algorithms (GAs) to optimize enzymes. LLMs are trained on a large dataset of protein sequences to learn relationships between amino acid residues linked to structure and function. This knowledge is then leveraged by GAs to efficiently search for sequences with improved catalytic performance. We focused on two optimization tasks: improving the feasibility of biochemical reactions and increasing their turnover rate. Systematic evaluations on 105 biocatalytic reactions demonstrated that the LLM-GA framework generated mutants outperforming the wild-type enzymes in terms of feasibility in 90% of the instances. Further in-depth evaluation of seven reactions reveals the power of this methodology to make "the best of both worlds" and create mutants with structural features and flexibility comparable with the wild types. Our approach advances the state-of-the-art computational design of biocatalysts, ultimately opening opportunities for more sustainable chemical processes.
酶是自然界优化的分子机器,能使原本不可能发生的化学过程得以发生。由于蛋白质空间的复杂性以及序列、结构和功能之间的复杂关系,酶的设计是一项具有挑战性的任务。最近,大语言模型(LLMs)已成为建模和分析生物序列的强大工具,但其在蛋白质设计中的应用受到蛋白质空间高基数的限制。本研究引入了一个将大语言模型与遗传算法(GAs)相结合以优化酶的框架。大语言模型在一个大型蛋白质序列数据集上进行训练,以学习与结构和功能相关的氨基酸残基之间的关系。然后,遗传算法利用这些知识有效地搜索具有改进催化性能的序列。我们专注于两项优化任务:提高生化反应的可行性和提高其周转速率。对105个生物催化反应的系统评估表明,在90%的情况下,大语言模型-遗传算法框架生成的突变体在可行性方面优于野生型酶。对七个反应的进一步深入评估揭示了这种方法“两全其美”的能力,并创造出具有与野生型相当的结构特征和灵活性的突变体。我们的方法推动了生物催化剂计算设计的最新技术水平,最终为更可持续的化学过程开辟了机会。