Tu Hongwei, Han Yanqiang, Wang Zhilong, Li Jinjin
Key Laboratory of Thin Film and Microfabrication of Ministry of Education, Department of Micro/Nano Electronics, Shanghai Jiao Tong University, Shanghai, 200240, China.
Brief Bioinform. 2022 Nov 19;23(6). doi: 10.1093/bib/bbac374.
Accurate and effective prediction of mutation-induced protein energy change remains a great challenge and of great interest in computational biology. However, high resource consumption and insufficient structural information of proteins severely limit the experimental techniques and structure-based prediction methods. Here, we design a structure-independent protocol to accurately and effectively predict the mutation-induced protein folding free energy change with only sequence, physicochemical and evolutionary features. The proposed clustered tree regression protocol is capable of effectively exploiting the inherent data patterns by integrating unsupervised feature clustering by K-means and supervised tree regression using XGBoost, and thus enabling fast and accurate protein predictions with different mutations, with an average Pearson correlation coefficient of 0.83 and an average root-mean-square error of 0.94kcal/mol. The proposed sequence-based method not only eliminates the dependence on protein structures, but also has potential applications in protein predictions with rare structural information.
准确有效地预测突变引起的蛋白质能量变化仍然是计算生物学中一项巨大的挑战,并且备受关注。然而,高资源消耗以及蛋白质结构信息不足严重限制了实验技术和基于结构的预测方法。在此,我们设计了一种与结构无关的方案,仅利用序列、物理化学和进化特征,就能准确有效地预测突变引起的蛋白质折叠自由能变化。所提出的聚类树回归方案能够通过整合K均值无监督特征聚类和使用XGBoost的有监督树回归,有效地利用内在数据模式,从而能够对不同突变进行快速准确的蛋白质预测,平均皮尔逊相关系数为0.83,平均均方根误差为0.94千卡/摩尔。所提出的基于序列的方法不仅消除了对蛋白质结构的依赖,而且在具有罕见结构信息的蛋白质预测中也有潜在应用。