Dieckhaus Henry, Kuhlman Brian
Department of Biochemistry and Biophysics, University of North Carolina School of Medicine, Chapel Hill, North Carolina, USA.
Division of Chemical Biology and Medicinal Chemistry, University of North Carolina Eshelman School of Pharmacy, Chapel Hill, North Carolina, USA.
Protein Sci. 2025 Jan;34(1):e70003. doi: 10.1002/pro.70003.
There is strong interest in accurate methods for predicting changes in protein stability resulting from amino acid mutations to the protein sequence. Recombinant proteins must often be stabilized to be used as therapeutics or reagents, and destabilizing mutations are implicated in a variety of diseases. Due to increased data availability and improved modeling techniques, recent studies have shown advancements in predicting changes in protein stability when a single-point mutation is made. Less focus has been directed toward predicting changes in protein stability when there are two or more mutations. Here, we analyze the largest available dataset of double point mutation stability and benchmark several widely used protein stability models on this and other datasets. We find that additive models of protein stability perform surprisingly well on this task, achieving similar performance to comparable non-additive predictors according to most metrics. Accordingly, we find that neither artificial intelligence-based nor physics-based protein stability models consistently capture epistatic interactions between single mutations. We observe one notable deviation from this trend, which is that epistasis-aware models provide marginally better predictions than additive models on stabilizing double point mutations. We develop an extension of the ThermoMPNN framework for double mutant modeling, as well as a novel data augmentation scheme, which mitigates some of the limitations in currently available datasets. Collectively, our findings indicate that current protein stability models fail to capture the nuanced epistatic interactions between concurrent mutations due to several factors, including training dataset limitations and insufficient model sensitivity.
人们对准确预测由于蛋白质序列中的氨基酸突变而导致的蛋白质稳定性变化的方法有着浓厚的兴趣。重组蛋白通常必须经过稳定化处理才能用作治疗剂或试剂,而不稳定的突变与多种疾病有关。由于数据可用性的提高和建模技术的改进,最近的研究表明在预测单点突变时蛋白质稳定性的变化方面取得了进展。而对于预测两个或更多突变时蛋白质稳定性的变化则关注较少。在这里,我们分析了最大的双点突变稳定性可用数据集,并在这个数据集和其他数据集上对几个广泛使用的蛋白质稳定性模型进行了基准测试。我们发现蛋白质稳定性的加性模型在这项任务中表现出奇地好,根据大多数指标,其性能与可比的非加性预测器相似。因此,我们发现基于人工智能的和基于物理学的蛋白质稳定性模型都不能始终捕捉单个突变之间的上位性相互作用。我们观察到一个明显偏离这一趋势的情况,即上位性感知模型在稳定双点突变方面提供的预测略优于加性模型。我们开发了一个用于双突变体建模的ThermoMPNN框架扩展,以及一种新颖的数据增强方案,该方案减轻了当前可用数据集中的一些限制。总的来说,我们的研究结果表明,由于包括训练数据集限制和模型灵敏度不足在内的几个因素,当前的蛋白质稳定性模型未能捕捉到并发突变之间细微的上位性相互作用。