Department of Computer Science, Aalto University, Espoo, Finland.
Helsinki Institute for Information Technology, Espoo, Finland.
Bioinformatics. 2018 Jul 1;34(13):i274-i283. doi: 10.1093/bioinformatics/bty238.
Proteins are commonly used by biochemical industry for numerous processes. Refining these proteins' properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability is necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data.
We have developed mGPfusion, a novel Gaussian process (GP) method for predicting protein's stability changes upon single and multiple mutations. This method complements the limited experimental data with large amounts of molecular simulation data. We introduce a Bayesian data fusion model that re-calibrates the experimental and in silico data sources and then learns a predictive GP model from the combined data. Our protein-specific model requires experimental data only regarding the protein of interest and performs well even with few experimental measurements. The mGPfusion models proteins by contact maps and infers the stability effects caused by mutations with a mixture of graph kernels. Our results show that mGPfusion outperforms state-of-the-art methods in predicting protein stability on a dataset of 15 different proteins and that incorporating molecular simulation data improves the model learning and prediction accuracy.
Software implementation and datasets are available at github.com/emmijokinen/mgpfusion.
Supplementary data are available at Bioinformatics online.
蛋白质在许多过程中被生化工业广泛应用。通过突变来改进这些蛋白质的性质会产生稳定性效应。需要一种准确的计算方法来预测突变如何影响蛋白质稳定性,以促进有效的蛋白质设计。然而,预测模型的准确性最终受到实验数据有限的限制。
我们开发了 mGPfusion,这是一种用于预测单突变和多突变对蛋白质稳定性影响的新型高斯过程 (GP) 方法。该方法用大量分子模拟数据补充有限的实验数据。我们引入了一种贝叶斯数据融合模型,该模型重新校准了实验和计算数据源,然后从组合数据中学习预测性 GP 模型。我们的蛋白质特异性模型仅需要有关目标蛋白质的实验数据,即使只有少数实验测量,也能表现良好。mGPfusion 通过接触图对蛋白质进行建模,并通过图核的混合物推断突变引起的稳定性效应。我们的结果表明,mGPfusion 在 15 种不同蛋白质的数据集上在预测蛋白质稳定性方面优于最先进的方法,并且纳入分子模拟数据可以提高模型学习和预测准确性。
软件实现和数据集可在 github.com/emmijokinen/mgpfusion 上获得。
补充数据可在 Bioinformatics 在线获得。