Department of Physics & Astronomy, Johns Hopkins University, Baltimore, Maryland 21218, United States.
Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland 20892, United States.
J Chem Theory Comput. 2022 Apr 12;18(4):2673-2686. doi: 10.1021/acs.jctc.1c01257. Epub 2022 Mar 15.
Protonation states of ionizable protein residues modulate many essential biological processes. For correct modeling and understanding of these processes, it is crucial to accurately determine their p values. Here, we present four tree-based machine learning models for protein p prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and p datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical p prediction tool PROPKA and 15% better than the published result from the p prediction method DelPhiPKa. The overall root-mean-square error (RMSE) for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys, and Tyr), and 0.63 when considering Asp, Glu, His, and Lys only. We provide p predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted p values close to the physiological pH.
可离子化蛋白质残基的质子化状态调节许多基本的生物过程。为了正确地模拟和理解这些过程,准确确定它们的 p 值至关重要。在这里,我们提出了四种基于树的机器学习模型用于蛋白质 p 值预测。这四个模型分别是随机森林、ExtraTrees、极端梯度提升(XGBoost)和 Light Gradient Boosting Machine(LightGBM),它们是在三个包含内部残基的实验 PDB 和 p 值数据集上进行训练的。我们观察到这四种机器学习算法的性能相似。在最大数据集上训练的最佳模型比广泛使用的经验 p 值预测工具 PROPKA 好 37%,比已发表的 p 值预测方法 DelPhiPKa 好 15%。对于考虑六种残基类型(Asp、Glu、His、Lys、Cys 和 Tyr)的模型,整体均方根误差(RMSE)为 0.69,表面 RMSE 值和埋藏 RMSE 值分别为 0.56 和 0.78,而仅考虑 Asp、Glu、His 和 Lys 时,RMSE 值为 0.63。我们提供了来自 AlphaFold 蛋白质结构数据库的人类蛋白质组中蛋白质的 p 值预测,并观察到 1%的 Asp/Glu/Lys 残基具有接近生理 pH 值的高度偏移 p 值。