Sindeeva Maria, Telepov Alexander, Ivanisenko Nikita, Shashkova Tatiana, Khrabrov Kuzma, Tsypin Artem, Kadurin Artur, Kardymon Olga
Bioinformatics Group, AIRI, Moscow 121170, Russia.
Brief Bioinform. 2025 Jul 2;26(4). doi: 10.1093/bib/bbaf324.
A key challenge in protein engineering is understanding how mutations affect protein fitness and stability. Most of current state-of-the-art models fine-tune protein structure prediction or protein language models or even pretrain their own. Despite its widespread use within computational workflows, AlphaFold2 exhibits limited sensitivity in assessing the effects of amino acid point mutations on protein structure, thereby constraining its utility in sequence design and protein engineering. In this work, we propose a simple modification of AlphaFold2 inference that improves the model's capacity to capture the structural impacts of amino acid mutations. We achieve this by discarding the multiple sequence alignment and masking the template in recycling stages. Moreover, we introduce AFToolkit, a framework that leverages the embeddings of the modified AlphaFold2 model and simple adapter models to solve multiple protein engineering tasks. In contrast to other methods, our approach does not require fine-tuning the AlphaFold2 model or pretraining a new model from scratch on large datasets. It also supports handling multiple mutations, insertions, and deletions by directly modifying the input protein sequence. The proposed approach achieves strong performance across established benchmarks in terms of Spearman correlation: $0.68$ on PTMul, $0.60$ on cDNA-indel, and $0.57$ on C380.
蛋白质工程中的一个关键挑战是理解突变如何影响蛋白质的适应性和稳定性。当前大多数最先进的模型都是对蛋白质结构预测或蛋白质语言模型进行微调,甚至是自行预训练。尽管AlphaFold2在计算工作流程中被广泛使用,但它在评估氨基酸点突变对蛋白质结构的影响时灵敏度有限,从而限制了其在序列设计和蛋白质工程中的应用。在这项工作中,我们提出了一种对AlphaFold2推理的简单修改,以提高该模型捕捉氨基酸突变结构影响的能力。我们通过在循环阶段丢弃多序列比对并屏蔽模板来实现这一点。此外,我们引入了AFToolkit,这是一个利用修改后的AlphaFold2模型的嵌入和简单适配器模型来解决多个蛋白质工程任务的框架。与其他方法不同,我们的方法不需要对AlphaFold2模型进行微调,也不需要在大型数据集上从头开始预训练新模型。它还支持通过直接修改输入蛋白质序列来处理多个突变、插入和缺失。在Spearman相关性方面,所提出的方法在既定基准上取得了强劲的性能:在PTMul上为0.68,在cDNA - indel上为0.60,在C380上为0.57。