Suppr超能文献

用于信息性溶解度估计的极端梯度提升与共形预测器相结合

Extreme Gradient Boosting Combined with Conformal Predictors for Informative Solubility Estimation.

作者信息

Jovic Ozren, Mouras Rabah

机构信息

Pharmaceutical Manufacturing Technology Centre, Bernal Institute, Department of Chemical Sciences, University of Limerick, V94 T9PX Limerick, Ireland.

出版信息

Molecules. 2023 Dec 19;29(1):19. doi: 10.3390/molecules29010019.

Abstract

We used the extreme gradient boosting (XGB) algorithm to predict the experimental solubility of chemical compounds in water and organic solvents and to select significant molecular descriptors. The accuracy of prediction of our forward stepwise top-importance XGB (FSTI-XGB) on curated solubility data sets in terms of RMSE was found to be 0.59-0.76 Log(S) for two water data sets, while for organic solvent data sets it was 0.69-0.79 Log(S) for the Methanol data set, 0.65-0.79 for the Ethanol data set, and 0.62-0.70 Log(S) for the Acetone data set. That was the first step. In the second step, we used uncurated and curated AquaSolDB data sets for applicability domain (AD) tests of Drugbank, PubChem, and COCONUT databases and determined that more than 95% of studied ca. 500,000 compounds were within the AD. In the third step, we applied conformal prediction to obtain narrow prediction intervals and we successfully validated them using test sets' true solubility values. With prediction intervals obtained in the last fourth step, we were able to estimate individual error margins and the accuracy class of the solubility prediction for molecules within the AD of three public databases. All that was possible without the knowledge of experimental database solubilities. We find these four steps novel because usually, solubility-related works only study the first step or the first two steps.

摘要

我们使用极端梯度提升(XGB)算法来预测化合物在水和有机溶剂中的实验溶解度,并选择重要的分子描述符。我们发现,对于两个水数据集,我们的前向逐步重要性最高的XGB(FSTI-XGB)在整理后的溶解度数据集上的预测均方根误差(RMSE)精度为0.59-0.76 Log(S),而对于有机溶剂数据集,甲醇数据集的精度为0.69-0.79 Log(S),乙醇数据集为0.65-0.79,丙酮数据集为0.62-0.70 Log(S)。这是第一步。第二步,我们使用未整理和整理后的AquaSolDB数据集对Drugbank、PubChem和COCONUT数据库进行适用域(AD)测试,确定在约500,000个研究化合物中,超过95%的化合物在适用域内。第三步,我们应用共形预测来获得狭窄的预测区间,并使用测试集的真实溶解度值成功地对其进行了验证。在第四步也是最后一步中,我们通过获得的预测区间,能够估计三个公共数据库适用域内分子的溶解度预测的个体误差范围和准确性等级。所有这些都是在不知道实验数据库溶解度的情况下实现的。我们发现这四个步骤具有创新性,因为通常与溶解度相关的工作只研究第一步或前两步。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b246/10779886/bc2643e3e28e/molecules-29-00019-g001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验