Suppr超能文献

基于树的机器学习进行蛋白质 p 预测。

Protein p Prediction by Tree-Based Machine Learning.

机构信息

Department of Physics & Astronomy, Johns Hopkins University, Baltimore, Maryland 21218, United States.

Laboratory of Computational Biology, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland 20892, United States.

出版信息

J Chem Theory Comput. 2022 Apr 12;18(4):2673-2686. doi: 10.1021/acs.jctc.1c01257. Epub 2022 Mar 15.

Abstract

Protonation states of ionizable protein residues modulate many essential biological processes. For correct modeling and understanding of these processes, it is crucial to accurately determine their p values. Here, we present four tree-based machine learning models for protein p prediction. The four models, Random Forest, Extra Trees, eXtreme Gradient Boosting (XGBoost), and Light Gradient Boosting Machine (LightGBM), were trained on three experimental PDB and p datasets, two of which included a notable portion of internal residues. We observed similar performance among the four machine learning algorithms. The best model trained on the largest dataset performs 37% better than the widely used empirical p prediction tool PROPKA and 15% better than the published result from the p prediction method DelPhiPKa. The overall root-mean-square error (RMSE) for this model is 0.69, with surface and buried RMSE values being 0.56 and 0.78, respectively, considering six residue types (Asp, Glu, His, Lys, Cys, and Tyr), and 0.63 when considering Asp, Glu, His, and Lys only. We provide p predictions for proteins in human proteome from the AlphaFold Protein Structure Database and observed that 1% of Asp/Glu/Lys residues have highly shifted p values close to the physiological pH.

摘要

可离子化蛋白质残基的质子化状态调节许多基本的生物过程。为了正确地模拟和理解这些过程,准确确定它们的 p 值至关重要。在这里,我们提出了四种基于树的机器学习模型用于蛋白质 p 值预测。这四个模型分别是随机森林、ExtraTrees、极端梯度提升(XGBoost)和 Light Gradient Boosting Machine(LightGBM),它们是在三个包含内部残基的实验 PDB 和 p 值数据集上进行训练的。我们观察到这四种机器学习算法的性能相似。在最大数据集上训练的最佳模型比广泛使用的经验 p 值预测工具 PROPKA 好 37%,比已发表的 p 值预测方法 DelPhiPKa 好 15%。对于考虑六种残基类型(Asp、Glu、His、Lys、Cys 和 Tyr)的模型,整体均方根误差(RMSE)为 0.69,表面 RMSE 值和埋藏 RMSE 值分别为 0.56 和 0.78,而仅考虑 Asp、Glu、His 和 Lys 时,RMSE 值为 0.63。我们提供了来自 AlphaFold 蛋白质结构数据库的人类蛋白质组中蛋白质的 p 值预测,并观察到 1%的 Asp/Glu/Lys 残基具有接近生理 pH 值的高度偏移 p 值。

相似文献

1
Protein p Prediction by Tree-Based Machine Learning.
J Chem Theory Comput. 2022 Apr 12;18(4):2673-2686. doi: 10.1021/acs.jctc.1c01257. Epub 2022 Mar 15.
2
Accurate and Rapid Prediction of Protein p: Protein Language Models Reveal the Sequence-p Relationship.
J Chem Theory Comput. 2025 Apr 8;21(7):3752-3764. doi: 10.1021/acs.jctc.4c01288. Epub 2025 Mar 26.
4
Benchmarking pKa Prediction Methods for Residues in Proteins.
J Chem Theory Comput. 2008 Jun;4(6):951-66. doi: 10.1021/ct8000014.
6
From data to decision: Machine learning determination of aerobic and anaerobic thresholds in athletes.
PLoS One. 2024 Aug 29;19(8):e0309427. doi: 10.1371/journal.pone.0309427. eCollection 2024.
8
Overview of the SAMPL6 pK challenge: evaluating small molecule microscopic and macroscopic pK predictions.
J Comput Aided Mol Des. 2021 Feb;35(2):131-166. doi: 10.1007/s10822-020-00362-6. Epub 2021 Jan 4.
10
Machine learning models for net photosynthetic rate prediction using poplar leaf phenotype data.
PLoS One. 2020 Feb 11;15(2):e0228645. doi: 10.1371/journal.pone.0228645. eCollection 2020.

引用本文的文献

1
Structure-based rational design of covalent probes.
Commun Chem. 2025 Aug 12;8(1):242. doi: 10.1038/s42004-025-01606-y.
2
Accurate Predictions of Molecular Properties of Proteins via Graph Neural Networks and Transfer Learning.
J Chem Theory Comput. 2025 May 13;21(9):4830-4845. doi: 10.1021/acs.jctc.4c01682. Epub 2025 Apr 24.
4
KaMLs for Predicting Protein p Values and Ionization States: Are Trees All You Need?
J Chem Theory Comput. 2025 Feb 11;21(3):1446-1458. doi: 10.1021/acs.jctc.4c01602. Epub 2025 Jan 30.
6
Accurate Predictions of Molecular Properties of Proteins via Graph Neural Networks and Transfer Learning.
bioRxiv. 2024 Dec 12:2024.12.10.627714. doi: 10.1101/2024.12.10.627714.
7
KaMLs for Predicting Protein p Values and Ionization States: Are Trees All You Need?
bioRxiv. 2025 Jan 30:2024.11.09.622800. doi: 10.1101/2024.11.09.622800.
8
Ionizable networks mediate pH-dependent allostery in SH2 signaling proteins.
bioRxiv. 2024 Aug 21:2024.08.21.608875. doi: 10.1101/2024.08.21.608875.
9
Machine Learning Isotropic Values of Radical Polymers.
J Chem Theory Comput. 2024 Mar 26;20(6):2592-2604. doi: 10.1021/acs.jctc.3c01252. Epub 2024 Mar 8.
10
Accurately Predicting Protein p Values Using Nonequilibrium Alchemy.
J Chem Theory Comput. 2023 Nov 14;19(21):7833-7845. doi: 10.1021/acs.jctc.3c00721. Epub 2023 Oct 11.

本文引用的文献

1
A Fast and Interpretable Deep Learning Approach for Accurate Electrostatics-Driven p Predictions in Proteins.
J Chem Theory Comput. 2022 Aug 9;18(8):5068-5078. doi: 10.1021/acs.jctc.2c00308. Epub 2022 Jul 15.
2
Highly accurate protein structure prediction with AlphaFold.
Nature. 2021 Aug;596(7873):583-589. doi: 10.1038/s41586-021-03819-2. Epub 2021 Jul 15.
3
MolGpka: A Web Server for Small Molecule p Prediction Using a Graph-Convolutional Neural Network.
J Chem Inf Model. 2021 Jul 26;61(7):3159-3165. doi: 10.1021/acs.jcim.1c00075. Epub 2021 Jul 12.
4
Determinants of conductance of a bacterial voltage-gated sodium channel.
Biophys J. 2021 Aug 3;120(15):3050-3069. doi: 10.1016/j.bpj.2021.06.013. Epub 2021 Jun 30.
6
Development of a graph convolutional neural network model for efficient prediction of protein-ligand binding affinities.
PLoS One. 2021 Apr 8;16(4):e0249404. doi: 10.1371/journal.pone.0249404. eCollection 2021.
7
Ollivier Persistent Ricci Curvature-Based Machine Learning for the Protein-Ligand Binding Affinity Prediction.
J Chem Inf Model. 2021 Apr 26;61(4):1617-1626. doi: 10.1021/acs.jcim.0c01415. Epub 2021 Mar 16.
8
Open-source QSAR models for pKa prediction using multiple machine learning approaches.
J Cheminform. 2019 Sep 18;11(1):60. doi: 10.1186/s13321-019-0384-1.
9
pH-Dependent Conformational Changes Lead to a Highly Shifted p for a Buried Glutamic Acid Mutant of SNase.
J Phys Chem B. 2020 Dec 10;124(49):11072-11080. doi: 10.1021/acs.jpcb.0c07136. Epub 2020 Dec 1.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验