Department of Physics and Astronomy, Clemson University, Clemson, SC 29634, USA.
Department of Biological Sciences, Clemson University, Clemson, SC 29634, USA.
Int J Mol Sci. 2023 Jul 28;24(15):12073. doi: 10.3390/ijms241512073.
The development of methods and algorithms to predict the effect of mutations on protein stability, protein-protein interaction, and protein-DNA/RNA binding is necessitated by the needs of protein engineering and for understanding the molecular mechanism of disease-causing variants. The vast majority of the leading methods require a database of experimentally measured folding and binding free energy changes for training. These databases are collections of experimental data taken from scientific investigations typically aimed at probing the role of particular residues on the above-mentioned thermodynamic characteristics, i.e., the mutations are not introduced at random and do not necessarily represent mutations originating from single nucleotide variants (SNV). Thus, the reported performance of the leading algorithms assessed on these databases or other limited cases may not be applicable for predicting the effect of SNVs seen in the human population. Indeed, we demonstrate that the SNVs and non-SNVs are not equally presented in the corresponding databases, and the distribution of the free energy changes is not the same. It is shown that the Pearson correlation coefficients (PCCs) of folding and binding free energy changes obtained in cases involving SNVs are smaller than for non-SNVs, indicating that caution should be used in applying them to reveal the effect of human SNVs. Furthermore, it is demonstrated that some methods are sensitive to the chemical nature of the mutations, resulting in PCCs that differ by a factor of four across chemically different mutations. All methods are found to underestimate the energy changes by roughly a factor of 2.
需要开发方法和算法来预测突变对蛋白质稳定性、蛋白质-蛋白质相互作用以及蛋白质-DNA/RNA 结合的影响,这是蛋白质工程的需要,也是理解致病变异分子机制的需要。绝大多数领先的方法都需要一个经过实验测量的折叠和结合自由能变化数据库来进行训练。这些数据库是从科学研究中收集的实验数据的集合,这些研究通常旨在探测特定残基对上述热力学特性的作用,即突变不是随机引入的,不一定代表来自单核苷酸变异(SNV)的突变。因此,在这些数据库或其他有限情况下评估领先算法的报告性能可能不适用于预测人类群体中观察到的 SNV 的影响。事实上,我们证明了 SNV 和非 SNV 在相应的数据库中并不均等出现,自由能变化的分布也不相同。结果表明,涉及 SNV 的情况下获得的折叠和结合自由能变化的 Pearson 相关系数(PCC)小于非 SNV 的 PCC,表明在应用它们来揭示人类 SNV 的影响时应谨慎使用。此外,还证明了一些方法对突变的化学性质敏感,导致化学性质不同的突变的 PCC 差异达四倍。所有方法都被发现低估了大约 2 倍的能量变化。