Berliner Niklas, Teyra Joan, Colak Recep, Garcia Lopez Sebastian, Kim Philip M
Terrence Donnelly Centre for Cellular and Biomolecular Research (CCBR), University of Toronto, Toronto, Ontario, Canada.
Terrence Donnelly Centre for Cellular and Biomolecular Research (CCBR), University of Toronto, Toronto, Ontario, Canada; Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
PLoS One. 2014 Sep 22;9(9):e107353. doi: 10.1371/journal.pone.0107353. eCollection 2014.
Advances in sequencing have led to a rapid accumulation of mutations, some of which are associated with diseases. However, to draw mechanistic conclusions, a biochemical understanding of these mutations is necessary. For coding mutations, accurate prediction of significant changes in either the stability of proteins or their affinity to their binding partners is required. Traditional methods have used semi-empirical force fields, while newer methods employ machine learning of sequence and structural features. Here, we show how combining both of these approaches leads to a marked boost in accuracy. We introduce ELASPIC, a novel ensemble machine learning approach that is able to predict stability effects upon mutation in both, domain cores and domain-domain interfaces. We combine semi-empirical energy terms, sequence conservation, and a wide variety of molecular details with a Stochastic Gradient Boosting of Decision Trees (SGB-DT) algorithm. The accuracy of our predictions surpasses existing methods by a considerable margin, achieving correlation coefficients of 0.77 for stability, and 0.75 for affinity predictions. Notably, we integrated homology modeling to enable proteome-wide prediction and show that accurate prediction on modeled structures is possible. Lastly, ELASPIC showed significant differences between various types of disease-associated mutations, as well as between disease and common neutral mutations. Unlike pure sequence-based prediction methods that try to predict phenotypic effects of mutations, our predictions unravel the molecular details governing the protein instability, and help us better understand the molecular causes of diseases.
测序技术的进步导致突变迅速积累,其中一些与疾病相关。然而,要得出机制性结论,有必要对这些突变进行生化理解。对于编码突变,需要准确预测蛋白质稳定性或其与结合伴侣亲和力的显著变化。传统方法使用半经验力场,而较新的方法则采用序列和结构特征的机器学习。在这里,我们展示了如何将这两种方法结合起来显著提高准确性。我们引入了ELASPIC,这是一种新颖的集成机器学习方法,能够预测结构域核心和结构域-结构域界面中突变对稳定性的影响。我们将半经验能量项、序列保守性和各种分子细节与决策树随机梯度提升(SGB-DT)算法相结合。我们预测的准确性大大超过现有方法,稳定性预测的相关系数达到0.77,亲和力预测的相关系数达到0.75。值得注意的是,我们整合了同源建模以实现全蛋白质组预测,并表明对建模结构进行准确预测是可能的。最后,ELASPIC显示出不同类型疾病相关突变之间以及疾病与常见中性突变之间存在显著差异。与试图预测突变表型效应的纯基于序列的预测方法不同,我们的预测揭示了控制蛋白质不稳定性的分子细节,并帮助我们更好地理解疾病的分子原因。