Gurusinghe Sagara N S, Wu Yibing, DeGrado William, Shifman Julia M
Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel.
Department of Pharmaceutical Chemistry, School of Pharmacy, University of California San Francisco, CA 94158, United States.
Bioinformatics. 2025 May 6;41(5). doi: 10.1093/bioinformatics/btaf270.
Protein-protein interactions (PPIs) govern virtually all cellular processes, and a single mutation within a PPI can significantly impact protein functionality, potentially leading to diseases. While numerous approaches have emerged to predict changes in the free energy of binding due to mutations (ΔΔGbind), most lack precision. Recently, protein language models (PLMs) have shown powerful predictive capabilities by leveraging both sequence and structural data from protein complexes, yet they have not been optimized specifically for ΔΔGbind prediction.
We developed an approach, ProBASS (Protein Binding Affinity from Structure and Sequence), to predict the effects of mutations on ΔΔGbind using two most advanced PLMs, ESM2 and ESM-IF1, which incorporate sequence and structural features, respectively. We first generated embeddings for each PPI mutant from the two PLMs and then fine-tuned ProBASS by training on a large dataset of experimental ΔΔGbind values. When training and testing were done on the same PPI, ProBASS achieved correlations with experimental ΔΔGbind values of 0.83 ± 0.05 and 0.69 ± 0.04 for single and double mutations, respectively. Additionally, when evaluated on a dataset of 2,325 single mutations across 131 PPIs, ProBASS reached a correlation of 0.81 ± 0.02, substantially outperforming other PLMs in predictive accuracy. Our results demonstrate that refining pre-trained PLMs with extensive ΔΔGbind datasets across multiple PPIs is a successful approach for creating a precise and broadly applicable ΔΔGbind prediction model, facilitating future protein engineering and design studies. ProBASS's accuracy could be further improved through training as more experimental data becomes available.
ProBASS is available at: https://colab.research.google.com/github/sagagugit/ProBASS/blob/main/ProBASS.ipynb.
蛋白质-蛋白质相互作用(PPI)几乎控制着所有细胞过程,PPI内的单个突变会显著影响蛋白质功能,可能导致疾病。虽然已经出现了许多方法来预测由于突变引起的结合自由能变化(ΔΔGbind),但大多数方法缺乏精度。最近,蛋白质语言模型(PLM)通过利用蛋白质复合物的序列和结构数据显示出强大的预测能力,但它们尚未针对ΔΔGbind预测进行专门优化。
我们开发了一种方法ProBASS(基于结构和序列的蛋白质结合亲和力),使用两种最先进的PLM(分别包含序列和结构特征的ESM2和ESM-IF1)来预测突变对ΔΔGbind的影响。我们首先从这两种PLM为每个PPI突变体生成嵌入,然后通过在大量实验ΔΔGbind值数据集上进行训练来微调ProBASS。当在相同的PPI上进行训练和测试时,ProBASS对于单突变和双突变与实验ΔΔGbind值的相关性分别达到0.83±0.05和0.69±0.04。此外,当在131个PPI的2325个单突变数据集上进行评估时,ProBASS的相关性达到0.81±0.02,在预测准确性方面显著优于其他PLM。我们的结果表明,使用跨多个PPI的大量ΔΔGbind数据集对预训练的PLM进行优化是创建精确且广泛适用的ΔΔGbind预测模型的成功方法,有助于未来的蛋白质工程和设计研究。随着更多实验数据的获得,通过训练可以进一步提高ProBASS的准确性。
ProBASS可在以下网址获取:https://colab.research.google.com/github/sagagugit/ProBASS/blob/main/ProBASS.ipynb 。