使用多种机器学习方法进行pKa预测的开源定量构效关系模型

Open-source QSAR models for pKa prediction using multiple machine learning approaches.

作者信息

Mansouri Kamel, Cariello Neal F, Korotcov Alexandru, Tkachenko Valery, Grulke Chris M, Sprankle Catherine S, Allen David, Casey Warren M, Kleinstreuer Nicole C, Williams Antony J

机构信息

Integrated Laboratory Systems, Inc., P.O. Box 13501, Research Triangle Park, NC, 27709, USA.

Science Data Software LLC, 14914 Bradwill Court, Rockville, MD, 20850, USA.

出版信息

J Cheminform. 2019 Sep 18;11(1):60. doi: 10.1186/s13321-019-0384-1.

DOI:10.1186/s13321-019-0384-1

PMID:33430972

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6749653/

Abstract

BACKGROUND

The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction.

METHODS

The experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure-activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN).

RESULTS

The three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products.

CONCLUSIONS

This work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.

摘要

背景

对数酸解离常数pKa反映了一种化学物质的电离情况，这会影响其亲脂性、溶解度、蛋白质结合能力以及穿过质膜的能力。因此，pKa会影响化学物质的吸收、分布、代谢、排泄和毒性特性。有多个用于预测pKa的专有软件包，但据我们所知，目前尚无用于此目的的免费开源程序。我们使用一个免费数据集和三种机器学习方法，开发了用于pKa预测的开源模型。

方法

从免费软件包DataWarrior中获取了7912种化学物质在水中的实验最强酸性和最强碱性pKa值。使用KNIME对化学结构进行整理和标准化，以用于定量构效关系（QSAR）建模，并使用初始数据集的79%作为子集进行建模。为了评估不同的建模方法，基于对具有酸性和/或碱性pKa的化学结构的不同处理构建了几个数据集。使用PaDEL生成连续分子描述符、二元指纹和片段计数，并使用三种机器学习方法创建pKa预测模型：（1）支持向量机（SVM）结合k近邻（kNN），（2）极端梯度提升（XGB），（3）深度神经网络（DNN）。

结果

这三种方法在训练集和测试集上的表现相当，均方根误差（RMSE）约为1.5，决定系数（R）约为0.80。使用来自ACD/Labs和ChemAxon的两个商业pKa预测器对本研究中开发的三个最佳模型进行基准测试，我们模型的性能优于商业产品。

结论

本研究提供了多个QSAR模型，用于预测化学物质的最强酸性和最强碱性pKa，这些模型使用公开可用的数据构建，并作为免费开源软件在GitHub上提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2619/6749653/f4f052a05a21/13321_2019_384_Fig1_HTML.jpg

相似文献

Open-source QSAR models for pKa prediction using multiple machine learning approaches.使用多种机器学习方法进行pKa预测的开源定量构效关系模型

J Cheminform. 2019 Sep 18;11(1):60. doi: 10.1186/s13321-019-0384-1.

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.我们是否需要不同的机器学习算法来进行定量构效关系建模？对 16 种机器学习算法在 14 个定量构效关系数据集上的综合评估。

Brief Bioinform. 2021 Jul 20;22(4). doi: 10.1093/bib/bbaa321.

In Silico Study of In Vitro GPCR Assays by QSAR Modeling.通过定量构效关系（QSAR）建模对体外G蛋白偶联受体（GPCR）分析进行计算机模拟研究。

Methods Mol Biol. 2016;1425:361-81. doi: 10.1007/978-1-4939-3609-0_16.

Informing the Human Plasma Protein Binding of Environmental Chemicals by Machine Learning in the Pharmaceutical Space: Applicability Domain and Limits of Predictability.机器学习在制药领域中对环境化学物质与人血浆蛋白结合的预测：适用范围与预测局限性

J Chem Inf Model. 2016 Nov 28;56(11):2243-2252. doi: 10.1021/acs.jcim.6b00291. Epub 2016 Nov 3.

OPERA models for predicting physicochemical properties and environmental fate endpoints.用于预测物理化学性质和环境归宿终点的OPERA模型。

J Cheminform. 2018 Mar 8;10(1):10. doi: 10.1186/s13321-018-0263-1.

Comparison of logP and logD correction models trained with public and proprietary data sets.比较使用公共数据集和专有数据集训练的 logP 和 logD 校正模型。

J Comput Aided Mol Des. 2022 Mar;36(3):253-262. doi: 10.1007/s10822-022-00450-9. Epub 2022 Apr 1.

MF-SuP-p: Multi-fidelity modeling with subgraph pooling mechanism for p prediction.MF-SuP-p：用于p预测的具有子图池化机制的多保真度建模

Acta Pharm Sin B. 2023 Jun;13(6):2572-2584. doi: 10.1016/j.apsb.2022.11.010. Epub 2022 Nov 11.

Predicting Solute Descriptors for Organic Chemicals by a Deep Neural Network (DNN) Using Basic Chemical Structures and a Surrogate Metric.基于基本化学结构和替代度量指标，使用深度神经网络（DNN）预测有机化合物的溶质描述符。

Environ Sci Technol. 2022 Feb 1;56(3):2054-2064. doi: 10.1021/acs.est.1c05398. Epub 2022 Jan 7.

Combinatorial QSAR of ambergris fragrance compounds.龙涎香香料化合物的组合定量构效关系

J Chem Inf Comput Sci. 2004 Mar-Apr;44(2):582-95. doi: 10.1021/ci034203t.

Overview of the SAMPL6 pK challenge: evaluating small molecule microscopic and macroscopic pK predictions.SAMPL6 pK 挑战概述：评估小分子微观和宏观 pK 预测。

J Comput Aided Mol Des. 2021 Feb;35(2):131-166. doi: 10.1007/s10822-020-00362-6. Epub 2021 Jan 4.

引用本文的文献

prediction of p values using explainable deep learning methods.使用可解释深度学习方法预测p值。

J Pharm Anal. 2025 Jun;15(6):101174. doi: 10.1016/j.jpha.2024.101174. Epub 2024 Dec 28.

HEPOM: Using Graph Neural Networks for the Accelerated Predictions of Hydrolysis Free Energies in Different pH Conditions.HEPOM：利用图神经网络加速预测不同pH条件下的水解自由能

J Chem Inf Model. 2025 Apr 28;65(8):3963-3975. doi: 10.1021/acs.jcim.4c02443. Epub 2025 Apr 4.

Comparative Analysis of p Predictions for Arsonic Acids Using Density Functional Theory-Based and Machine Learning Approaches.基于密度泛函理论和机器学习方法的胂酸p预测的比较分析

ACS Omega. 2025 Jan 16;10(3):3128-3140. doi: 10.1021/acsomega.4c10413. eCollection 2025 Jan 28.

pK prediction in non-aqueous solvents.非水溶剂中的pK预测。

J Comput Chem. 2025 Jan 5;46(1):e27517. doi: 10.1002/jcc.27517.

Predicting Collision-Induced-Dissociation Tandem Mass Spectra (CID-MS/MS) Using Ab Initio Molecular Dynamics.使用从头算分子动力学预测碰撞诱导解离串联质谱（CID-MS/MS）。

J Chem Inf Model. 2024 Oct 14;64(19):7470-7487. doi: 10.1021/acs.jcim.4c00760. Epub 2024 Sep 27.

Bridging Machine Learning and Thermodynamics for Accurate p Prediction.将机器学习与热力学相结合以实现准确的p预测

JACS Au. 2024 Jul 17;4(9):3451-3465. doi: 10.1021/jacsau.4c00271. eCollection 2024 Sep 23.

GR-pKa: a message-passing neural network with retention mechanism for pKa prediction.GR-pKa：一种具有保留机制的消息传递神经网络，用于预测 pKa。

Brief Bioinform. 2024 Jul 25;25(5). doi: 10.1093/bib/bbae408.

A database of chemical absorption in human skin with mechanistic modeling applications.化学物质在人体皮肤中吸收的数据库，具有机制建模应用。

Sci Data. 2024 Jul 10;11(1):755. doi: 10.1038/s41597-024-03588-3.

Multimodal fused deep learning for drug property prediction: Integrating chemical language and molecular graph.用于药物性质预测的多模态融合深度学习：整合化学语言和分子图

Comput Struct Biotechnol J. 2024 Apr 12;23:1666-1679. doi: 10.1016/j.csbj.2024.04.030. eCollection 2024 Dec.

Alternatives of Animal Models for Biomedical Research: a Comprehensive Review of Modern Approaches.替代生物医学研究中的动物模型：现代方法的综合评述。

Stem Cell Rev Rep. 2024 May;20(4):881-899. doi: 10.1007/s12015-024-10701-x. Epub 2024 Mar 1.

本文引用的文献

Computational biology: deep learning.计算生物学：深度学习

Emerg Top Life Sci. 2017 Nov 14;1(3):257-274. doi: 10.1042/ETLS20160025.

OPERA models for predicting physicochemical properties and environmental fate endpoints.用于预测物理化学性质和环境归宿终点的OPERA模型。

J Cheminform. 2018 Mar 8;10(1):10. doi: 10.1186/s13321-018-0263-1.

A comparison of three liquid chromatography (LC) retention time prediction models.三种液相色谱（LC）保留时间预测模型的比较。

Talanta. 2018 May 15;182:371-379. doi: 10.1016/j.talanta.2018.01.022. Epub 2018 Jan 11.

Evaluating In Vitro-In Vivo Extrapolation of Toxicokinetics.评估毒代动力学的体外-体内外推。

Toxicol Sci. 2018 May 1;163(1):152-169. doi: 10.1093/toxsci/kfy020.

High-throughput in-silico prediction of ionization equilibria for pharmacokinetic modeling.高通量计算预测用于药代动力学建模的电离平衡。

Sci Total Environ. 2018 Feb 15;615:150-160. doi: 10.1016/j.scitotenv.2017.09.033. Epub 2017 Sep 29.

Predicting Organ Toxicity Using in Vitro Bioactivity Data and Chemical Structure.利用体外生物活性数据和化学结构预测器官毒性

Chem Res Toxicol. 2017 Nov 20;30(11):2046-2059. doi: 10.1021/acs.chemrestox.7b00084. Epub 2017 Oct 9.

Deep learning for computational chemistry.用于计算化学的深度学习

J Comput Chem. 2017 Jun 15;38(16):1291-1307. doi: 10.1002/jcc.24764. Epub 2017 Mar 8.

Deep Learning to Predict the Formation of Quinone Species in Drug Metabolism.深度学习预测药物代谢中醌类物质的形成

Chem Res Toxicol. 2017 Feb 20;30(2):642-656. doi: 10.1021/acs.chemrestox.6b00385. Epub 2017 Feb 2.

Extreme Gradient Boosting as a Method for Quantitative Structure-Activity Relationships.极端梯度提升在定量构效关系中的应用。

J Chem Inf Model. 2016 Dec 27;56(12):2353-2360. doi: 10.1021/acs.jcim.6b00591. Epub 2016 Dec 13.

An automated curation procedure for addressing chemical errors and inconsistencies in public datasets used in QSAR modelling.一种用于解决QSAR建模中使用的公共数据集中化学错误和不一致性的自动化编目程序。

SAR QSAR Environ Res. 2016 Nov;27(11):939-965. doi: 10.1080/1062936X.2016.1253611.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

使用多种机器学习方法进行pKa预测的开源定量构效关系模型

Open-source QSAR models for pKa prediction using multiple machine learning approaches.

作者信息

机构信息

出版信息

BACKGROUND

METHODS

RESULTS

CONCLUSIONS

背景

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献