Barnes Jonathan E, América Chi L, Marty Ytreberg F, Patel Jagdish Suresh
Institute for Modeling Collaboration and Innovation, University of Idaho, Moscow, ID, USA.
Department of Chemical and Biological Engineering, University of Idaho, Moscow, ID, USA.
bioRxiv. 2024 Sep 25:2024.09.23.614615. doi: 10.1101/2024.09.23.614615.
Proteins play a pivotal role in many biological processes, and changes in their amino acid sequences can lead to dysfunction and disease. These changes can affect protein folding or interaction with other biomolecules, such as preventing antibodies from inhibiting a viral infection or causing proteins to misfold. The ability to predict the effects of mutations in proteins is crucial. Although experimental techniques can accurately quantify the effect of mutations on protein folding free energies and protein-protein binding free energies, they are often time-consuming and costly. By contrast, computational techniques offer fast and cost-effective alternatives for estimating free energies, but they typically suffer from lower accuracy. Enhancing the accuracy of computational predictions is therefore of high importance, with the potential to greatly impact fields ranging from drug design to understanding disease mechanisms. One such widely used computational method, FoldX, is capable of rapidly predicting the relative folding stability ( ) for a protein as well as the relative binding affinity ( ) between proteins using a single protein structure as input. However, it can suffer from low accuracy, especially for antibody-antigen systems. In this work, we trained a neural network on FoldX output to enhance its prediction accuracy. We first performed FoldX calculations on the largest datasets available for mutations that affect binding (SKEMPIv2) and folding (ProTherm4) with experimentally measured . Features were then extracted from the FoldX output files including its prediction for . We then developed and optimized a neural network framework to predict the difference between FoldX's estimated and the experimental data, creating a model capable of producing a correction factor. Our approach showed significant improvements in Pearson correlation performance. For single mutations affecting folding, the correlation improved from a baseline of 0.3 to 0.66. In terms of binding, performance increased from 0.37 to 0.61 for single mutations and from 0.52 to 0.81 for double mutations. For epistasis, the correlation for binding affinity (both singles and doubles) improved from 0.19 to 0.59. Our results also indicated that models trained on double mutations enhanced accuracy when predicting higher-order mutations (such as triple or quadruple mutations), whereas models trained on singles did not. This suggests that interaction energy and epistasis effects present in the FoldX output are not fully utilized by FoldX itself. Once trained, these models add minimal computational time but provide a substantial increase in performance, especially for higher-order mutations and epistasis. This makes them a valuable addition to any free energy prediction pipeline using FoldX. Furthermore, we believe this technique can be further optimized and tested for predicting antibody escape, aiding in the efficient development of watch lists.
蛋白质在许多生物过程中起着关键作用,其氨基酸序列的变化可能导致功能障碍和疾病。这些变化会影响蛋白质折叠或与其他生物分子的相互作用,比如阻止抗体抑制病毒感染或导致蛋白质错误折叠。预测蛋白质中突变影响的能力至关重要。尽管实验技术能够准确量化突变对蛋白质折叠自由能和蛋白质 - 蛋白质结合自由能的影响,但它们通常既耗时又昂贵。相比之下,计算技术为估计自由能提供了快速且经济高效的替代方法,不过其准确性通常较低。因此,提高计算预测的准确性非常重要,这有可能对从药物设计到理解疾病机制等众多领域产生重大影响。一种广泛使用的计算方法FoldX,能够使用单个蛋白质结构作为输入,快速预测蛋白质的相对折叠稳定性( )以及蛋白质之间的相对结合亲和力( )。然而,它可能存在准确性较低的问题,尤其是对于抗体 - 抗原系统。在这项工作中,我们在FoldX输出结果上训练了一个神经网络以提高其预测准确性。我们首先对影响结合(SKEMPIv2)和折叠(ProTherm4)的最大可用突变数据集进行FoldX计算,并结合实验测量的 。然后从FoldX输出文件中提取特征,包括其对 的预测。接着,我们开发并优化了一个神经网络框架,以预测FoldX估计的 与实验数据之间的差异,从而创建一个能够产生校正因子的模型。我们的方法在皮尔逊相关性能方面有显著提升。对于影响折叠的单个突变,相关性从基线的0.3提高到了0.66。在结合方面,单个突变的性能从0.37提高到0.61,双突变的性能从0.52提高到0.81。对于上位性,结合亲和力(单突变和双突变)的相关性从0.19提高到0.59。我们的结果还表明,在双突变上训练的模型在预测高阶突变(如三突变或四突变)时提高了准确性,而在单突变上训练的模型则没有。这表明FoldX输出中存在的相互作用能和上位性效应未被FoldX本身充分利用。一旦训练完成,这些模型增加的计算时间极少,但性能有显著提升,尤其是对于高阶突变和上位性。这使得它们成为使用FoldX的任何自由能预测流程中的宝贵补充。此外,我们相信这种技术可以进一步优化和测试,以预测抗体逃逸,有助于高效制定观察清单。