Suppr超能文献

使用多种机器学习方法进行pKa预测的开源定量构效关系模型

Open-source QSAR models for pKa prediction using multiple machine learning approaches.

作者信息

Mansouri Kamel, Cariello Neal F, Korotcov Alexandru, Tkachenko Valery, Grulke Chris M, Sprankle Catherine S, Allen David, Casey Warren M, Kleinstreuer Nicole C, Williams Antony J

机构信息

Integrated Laboratory Systems, Inc., P.O. Box 13501, Research Triangle Park, NC, 27709, USA.

Science Data Software LLC, 14914 Bradwill Court, Rockville, MD, 20850, USA.

出版信息

J Cheminform. 2019 Sep 18;11(1):60. doi: 10.1186/s13321-019-0384-1.

Abstract

BACKGROUND

The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction.

METHODS

The experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure-activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN).

RESULTS

The three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products.

CONCLUSIONS

This work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.

摘要

背景

对数酸解离常数pKa反映了一种化学物质的电离情况,这会影响其亲脂性、溶解度、蛋白质结合能力以及穿过质膜的能力。因此,pKa会影响化学物质的吸收、分布、代谢、排泄和毒性特性。有多个用于预测pKa的专有软件包,但据我们所知,目前尚无用于此目的的免费开源程序。我们使用一个免费数据集和三种机器学习方法,开发了用于pKa预测的开源模型。

方法

从免费软件包DataWarrior中获取了7912种化学物质在水中的实验最强酸性和最强碱性pKa值。使用KNIME对化学结构进行整理和标准化,以用于定量构效关系(QSAR)建模,并使用初始数据集的79%作为子集进行建模。为了评估不同的建模方法,基于对具有酸性和/或碱性pKa的化学结构的不同处理构建了几个数据集。使用PaDEL生成连续分子描述符、二元指纹和片段计数,并使用三种机器学习方法创建pKa预测模型:(1)支持向量机(SVM)结合k近邻(kNN),(2)极端梯度提升(XGB),(3)深度神经网络(DNN)。

结果

这三种方法在训练集和测试集上的表现相当,均方根误差(RMSE)约为1.5,决定系数(R)约为0.80。使用来自ACD/Labs和ChemAxon的两个商业pKa预测器对本研究中开发的三个最佳模型进行基准测试,我们模型的性能优于商业产品。

结论

本研究提供了多个QSAR模型,用于预测化学物质的最强酸性和最强碱性pKa,这些模型使用公开可用的数据构建,并作为免费开源软件在GitHub上提供。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/2619/6749653/f4f052a05a21/13321_2019_384_Fig1_HTML.jpg

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验