用于信息性溶解度估计的极端梯度提升与共形预测器相结合

Extreme Gradient Boosting Combined with Conformal Predictors for Informative Solubility Estimation.

作者信息

Jovic Ozren, Mouras Rabah

机构信息

Pharmaceutical Manufacturing Technology Centre, Bernal Institute, Department of Chemical Sciences, University of Limerick, V94 T9PX Limerick, Ireland.

出版信息

Molecules. 2023 Dec 19;29(1):19. doi: 10.3390/molecules29010019.

DOI:10.3390/molecules29010019

PMID:38202602

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC10779886/

Abstract

We used the extreme gradient boosting (XGB) algorithm to predict the experimental solubility of chemical compounds in water and organic solvents and to select significant molecular descriptors. The accuracy of prediction of our forward stepwise top-importance XGB (FSTI-XGB) on curated solubility data sets in terms of RMSE was found to be 0.59-0.76 Log(S) for two water data sets, while for organic solvent data sets it was 0.69-0.79 Log(S) for the Methanol data set, 0.65-0.79 for the Ethanol data set, and 0.62-0.70 Log(S) for the Acetone data set. That was the first step. In the second step, we used uncurated and curated AquaSolDB data sets for applicability domain (AD) tests of Drugbank, PubChem, and COCONUT databases and determined that more than 95% of studied ca. 500,000 compounds were within the AD. In the third step, we applied conformal prediction to obtain narrow prediction intervals and we successfully validated them using test sets' true solubility values. With prediction intervals obtained in the last fourth step, we were able to estimate individual error margins and the accuracy class of the solubility prediction for molecules within the AD of three public databases. All that was possible without the knowledge of experimental database solubilities. We find these four steps novel because usually, solubility-related works only study the first step or the first two steps.

摘要

我们使用极端梯度提升（XGB）算法来预测化合物在水和有机溶剂中的实验溶解度，并选择重要的分子描述符。我们发现，对于两个水数据集，我们的前向逐步重要性最高的XGB（FSTI-XGB）在整理后的溶解度数据集上的预测均方根误差（RMSE）精度为0.59-0.76 Log(S)，而对于有机溶剂数据集，甲醇数据集的精度为0.69-0.79 Log(S)，乙醇数据集为0.65-0.79，丙酮数据集为0.62-0.70 Log(S)。这是第一步。第二步，我们使用未整理和整理后的AquaSolDB数据集对Drugbank、PubChem和COCONUT数据库进行适用域（AD）测试，确定在约500,000个研究化合物中，超过95%的化合物在适用域内。第三步，我们应用共形预测来获得狭窄的预测区间，并使用测试集的真实溶解度值成功地对其进行了验证。在第四步也是最后一步中，我们通过获得的预测区间，能够估计三个公共数据库适用域内分子的溶解度预测的个体误差范围和准确性等级。所有这些都是在不知道实验数据库溶解度的情况下实现的。我们发现这四个步骤具有创新性，因为通常与溶解度相关的工作只研究第一步或前两步。