面向增强生物活性和热稳定性的语义和几何蛋白质编码

Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability.

作者信息

Tan Yang, Zhou Bingxin, Zheng Lirong, Fan Guisheng, Hong Liang

机构信息

Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong University, Chongqing, China.

School of Information Science and Engineering, East China University of Science and Technology, Shanghai, China.

出版信息

Elife. 2025 May 2;13:RP98033. doi: 10.7554/eLife.98033.

DOI:10.7554/eLife.98033

PMID:40314227

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC12048155/

Abstract

Protein engineering is a pivotal aspect of synthetic biology, involving the modification of amino acids within existing protein sequences to achieve novel or enhanced functionalities and physical properties. Accurate prediction of protein variant effects requires a thorough understanding of protein sequence, structure, and function. Deep learning methods have demonstrated remarkable performance in guiding protein modification for improved functionality. However, existing approaches predominantly rely on protein sequences, which face challenges in efficiently encoding the geometric aspects of amino acids' local environment and often fall short in capturing crucial details related to protein folding stability, internal molecular interactions, and bio-functions. Furthermore, there lacks a fundamental evaluation for developed methods in predicting protein thermostability, although it is a key physical property that is frequently investigated in practice. To address these challenges, this article introduces a novel pre-training framework that integrates sequential and geometric encoders for protein primary and tertiary structures. This framework guides mutation directions toward desired traits by simulating natural selection on wild-type proteins and evaluates variant effects based on their fitness to perform specific functions. We assess the proposed approach using three benchmarks comprising over 300 deep mutational scanning assays. The prediction results showcase exceptional performance across extensive experiments compared to other zero-shot learning methods, all while maintaining a minimal cost in terms of trainable parameters. This study not only proposes an effective framework for more accurate and comprehensive predictions to facilitate efficient protein engineering, but also enhances the in silico assessment system for future deep learning models to better align with empirical requirements. The PyTorch implementation is available at https://github.com/ai4protein/ProtSSN.

摘要

蛋白质工程是合成生物学的一个关键方面，涉及对现有蛋白质序列中的氨基酸进行修饰，以实现新的或增强的功能及物理特性。准确预测蛋白质变体的影响需要全面了解蛋白质的序列、结构和功能。深度学习方法在指导蛋白质修饰以改善功能方面已展现出卓越性能。然而，现有方法主要依赖蛋白质序列，在有效编码氨基酸局部环境的几何方面面临挑战，并且在捕捉与蛋白质折叠稳定性、内部分子相互作用和生物功能相关的关键细节方面往往有所欠缺。此外，对于已开发的预测蛋白质热稳定性的方法缺乏基本评估，尽管热稳定性是实际中经常研究的关键物理特性。为应对这些挑战，本文引入了一种新颖的预训练框架，该框架整合了用于蛋白质一级和三级结构的序列编码器和几何编码器。该框架通过模拟对野生型蛋白质的自然选择来指导突变方向朝着期望的特性发展，并根据变体执行特定功能的适应性来评估变体的影响。我们使用包含300多个深度突变扫描实验的三个基准来评估所提出的方法。与其他零样本学习方法相比，预测结果在广泛的实验中展现出卓越性能，同时在可训练参数方面保持最低成本。本研究不仅提出了一个有效框架以进行更准确和全面的预测，促进高效的蛋白质工程，还增强了用于未来深度学习模型的计算机模拟评估系统，使其更好地符合实验要求。PyTorch实现可在https://github.com/ai4protein/ProtSSN获取。