Department of Data Science and AI, Faculty of IT, Monash University, Clayton, Victoria 3800, Australia.
Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Clayton, Victoria 3800, Australia.
J Chem Inf Model. 2022 Sep 12;62(17):4270-4282. doi: 10.1021/acs.jcim.2c00799. Epub 2022 Aug 16.
An essential step in engineering proteins and understanding disease-causing missense mutations is to accurately model protein stability changes when such mutations occur. Here, we developed a new sequence-based predictor for the tein ability (PROST) change (Gibb's free energy change, ΔΔ) upon a single-point missense mutation. PROST extracts multiple descriptors from the most promising sequence-based predictors, such as BoostDDG, SAAFEC-SEQ, and DDGun. RPOST also extracts descriptors from iFeature and AlphaFold2. The extracted descriptors include sequence-based features, physicochemical properties, evolutionary information, evolutionary-based physicochemical properties, and predicted structural features. The PROST predictor is a weighted average ensemble model based on extreme gradient boosting (XGBoost) decision trees and an extra-trees regressor; PROST is trained on both direct and hypothetical reverse mutations using the S5294 (S2647 direct mutations + S2647 inverse mutations). The parameters for the PROST model are optimized using grid searching with 5-fold cross-validation, and feature importance analysis unveils the most relevant features. The performance of PROST is evaluated in a blinded manner, employing nine distinct data sets and existing state-of-the-art sequence-based and structure-based predictors. This method consistently performs well on frataxin, S217, S349, Ssym, S669, Myoglobin, and CAGI5 data sets in blind tests and similarly to the state-of-the-art predictors for p53 and S276 data sets. When the performance of PROST is compared with the latest predictors such as BoostDDG, SAAFEC-SEQ, ACDC-NN-seq, and DDGun, PROST dominates these predictors. A case study of mutation scanning of the frataxin protein for nine wild-type residues demonstrates the utility of PROST. Taken together, these findings indicate that PROST is a well-suited predictor when no protein structural information is available. The source code of PROST, data sets, examples, and pretrained models along with how to use PROST are available at https://github.com/ShahidIqb/PROST and https://prost.erc.monash.edu/seq.
在工程蛋白质和理解致病错义突变方面,一个重要步骤是准确地模拟此类突变发生时蛋白质稳定性的变化。在这里,我们开发了一种新的基于序列的预测器,用于预测单点错义突变时的蛋白质稳定性变化(PROST)(吉布斯自由能变化,ΔΔ)。PROST 从最有前途的基于序列的预测器中提取多个描述符,例如 BoostDDG、SAAFEC-SEQ 和 DDGun。RPOST 还从 iFeature 和 AlphaFold2 中提取描述符。提取的描述符包括基于序列的特征、理化性质、进化信息、基于进化的理化性质和预测的结构特征。PROST 预测器是基于极端梯度提升(XGBoost)决策树和 Extra-trees 回归器的加权平均集成模型;PROST 使用 S5294(S2647 直接突变+S2647 反向突变)进行直接和假设反向突变的训练。使用 5 折交叉验证进行网格搜索优化 PROST 模型的参数,特征重要性分析揭示了最相关的特征。使用 9 个不同的数据集和现有的基于序列和基于结构的预测器,以盲法评估 PROST 的性能。在盲测中,该方法在 frataxin、S217、S349、Ssym、S669、肌红蛋白和 CAGI5 数据集上表现良好,与 p53 和 S276 数据集的最新预测器表现相似。当将 PROST 的性能与最新的预测器(如 BoostDDG、SAAFEC-SEQ、ACDC-NN-seq 和 DDGun)进行比较时,PROST 优于这些预测器。对 frataxin 蛋白的九个野生型残基的突变扫描案例研究表明了 PROST 的实用性。总之,当没有蛋白质结构信息时,PROST 是一种合适的预测器。PROST 的源代码、数据集、示例和预训练模型以及如何使用 PROST 可在 https://github.com/ShahidIqb/PROST 和 https://prost.erc.monash.edu/seq 上获得。