数据质量和数量对固有溶解度估算的影响：基于单数据源数据集的分析。

Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set.

机构信息

Department of Pharmacy, Uppsala University, 751 23 Uppsala, Sweden.

Pharmaceutical & Material Sciences, Janssen Pharmaceutica NV, B-2340 Beerse, Belgium.

出版信息

Mol Pharm. 2024 Oct 7;21(10):5261-5271. doi: 10.1021/acs.molpharmaceut.4c00685. Epub 2024 Sep 13.

DOI:10.1021/acs.molpharmaceut.4c00685

PMID:39267585

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11462503/

Abstract

Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log  to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log  ± 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.

摘要

水溶性是药物分子最重要的物理化学性质之一，也是口服药物吸收的主要驱动力。迄今为止，用于估算新型化学空间溶解度的计算模型的性能受到限制。为了研究造成这种情况的可能原因和补救措施，利用了强生公司内部拥有的超过 40000 种化合物的水溶性数据。所有数据都是通过相同的高通量测定法生成的，这为探索数据质量、数量和模型估算之间的关系提供了独特的机会。通过利用三种不同的方法生成了六个具有不同大小和噪声水平的固有溶解度数据集：（i）包括或排除无定形固体残留物，（ii）测量或实验 log 以识别固有溶解度，以及（iii）在数据处理工作流程中采用或省略质量检查过程。使用来自 RDKit、ADMET 预测器或 Mordred 的三组不同描述符，在数据集上训练了随机森林回归器，并通过嵌套交叉验证以及十个精炼测试集评估了性能。这些模型证实，正如预期的那样，在具有相同数据集大小的情况下，高质量的数据会导致更好的模型性能；然而，与使用小而干净且多样化的数据集训练的模型相比，使用包含分析变异性的较大数据集训练的模型也可以提供同样准确的估计。然而，在训练数据集中包含后溶解度测量的无定形固体的存在所引入的噪声是无法通过增加数据量来克服的，因为它们会在数据集中引入有偏的系统正误差，这证实了关键数据审查的重要性。最后，对第二个溶解度挑战的第一个测试集测试了两个表现最佳的模型，分别达到了 0.74 和 0.72 的 RMSE 值和 46%和 48%的 log ± 0.5。这些结果表明与竞赛结果报告的结果相比，性能有所提高，这突出表明单一来源的经过审核的数据集可以增强对固有溶解度的预测。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/3927/11462503/61b6adb23110/mp4c00685_0001.jpg

相似文献

Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set.数据质量和数量对固有溶解度估算的影响：基于单数据源数据集的分析。

Mol Pharm. 2024 Oct 7;21(10):5261-5271. doi: 10.1021/acs.molpharmaceut.4c00685. Epub 2024 Sep 13.

Comparative Analysis of Chemical Descriptors by Machine Learning Reveals Atomistic Insights into Solute-Lipid Interactions.基于机器学习的化学描述符对比分析揭示了溶质-脂质相互作用的原子水平见解。

Mol Pharm. 2024 Jul 1;21(7):3343-3355. doi: 10.1021/acs.molpharmaceut.4c00080. Epub 2024 May 23.

New QSPR study for the prediction of aqueous solubility of drug-like compounds.用于预测类药物化合物水溶性的新定量构效关系研究。

Bioorg Med Chem. 2008 Sep 1;16(17):7944-55. doi: 10.1016/j.bmc.2008.07.067. Epub 2008 Jul 29.

ADME prediction with KNIME: aqueous solubility consensus model based on supervised recursive random forest approaches.使用KNIME进行药物吸收、分布、代谢和排泄（ADME）预测：基于监督递归随机森林方法的水溶性共识模型。

ADMET DMPK. 2020 Aug 7;8(3):251-273. doi: 10.5599/admet.852. eCollection 2020.

Uniting cheminformatics and chemical theory to predict the intrinsic aqueous solubility of crystalline druglike molecules.结合化学信息学与化学理论预测类药物结晶分子的固有水溶性。

J Chem Inf Model. 2014 Mar 24;54(3):844-56. doi: 10.1021/ci4005805. Epub 2014 Mar 11.

In silico prediction of aqueous solubility: a multimodel protocol based on chemical similarity.基于化学相似性的计算预测水溶性：一种多模型协议。

Mol Pharm. 2012 Nov 5;9(11):3127-35. doi: 10.1021/mp300234q. Epub 2012 Oct 25.

Folic acid supplementation and malaria susceptibility and severity among people taking antifolate antimalarial drugs in endemic areas.在流行地区，服用抗叶酸抗疟药物的人群中，叶酸补充剂与疟疾易感性和严重程度的关系。

Cochrane Database Syst Rev. 2022 Feb 1;2(2022):CD014217. doi: 10.1002/14651858.CD014217.

Prediction of Oral Pharmacokinetics Using a Combination of In Silico Descriptors and In Vitro ADME Properties.利用体内外 ADME 特性与计算描述符组合预测口服药代动力学。

Mol Pharm. 2021 Mar 1;18(3):1071-1079. doi: 10.1021/acs.molpharmaceut.0c01009. Epub 2021 Jan 29.

In silico Prediction of Aqueous Solubility: a Comparative Study of Local and Global Predictive Models.水溶解度的计算机模拟预测：局部和全局预测模型的比较研究

Mol Inform. 2015 Jun;34(6-7):417-30. doi: 10.1002/minf.201400144. Epub 2015 Jun 18.

Major Source of Error in QSPR Prediction of Intrinsic Thermodynamic Solubility of Drugs: Solid vs Nonsolid State Contributions?药物固有热力学溶解度QSPR预测中的主要误差来源：固态与非固态贡献？

Mol Pharm. 2015 Jun 1;12(6):2126-41. doi: 10.1021/acs.molpharmaceut.5b00119. Epub 2015 Apr 30.

本文引用的文献

Mechanistically transparent models for predicting aqueous solubility of rigid, slightly flexible, and very flexible drugs (MW<2000) Accuracy near that of random forest regression.用于预测刚性、轻度柔性和高度柔性药物（分子量<2000）水溶性的机理透明模型。准确性接近随机森林回归。

ADMET DMPK. 2023 Aug 21;11(3):317-330. doi: 10.5599/admet.1879. eCollection 2023.

Revolutionizing drug formulation development: The increasing impact of machine learning.颠覆药物制剂研发：机器学习的影响日益增大。

Adv Drug Deliv Rev. 2023 Nov;202:115108. doi: 10.1016/j.addr.2023.115108. Epub 2023 Sep 27.

Prospective Validation of Machine Learning Algorithms for Absorption, Distribution, Metabolism, and Excretion Prediction: An Industrial Perspective.基于工业视角的机器学习算法在吸收、分布、代谢和排泄预测中的前瞻性验证。

J Chem Inf Model. 2023 Jun 12;63(11):3263-3274. doi: 10.1021/acs.jcim.3c00160. Epub 2023 May 22.

Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models.盲法预测和事后分析第二次溶解度挑战数据：探索机器学习和深度学习模型的训练数据和特征集选择。

J Chem Inf Model. 2023 Feb 27;63(4):1099-1113. doi: 10.1021/acs.jcim.2c01189. Epub 2023 Feb 9.

Purely Predicting the Pharmaceutical Solubility: What to Expect from PC-SAFT and COSMO-RS?纯预测药物溶解度：PC-SAFT 和 COSMO-RS 能带来什么？

Mol Pharm. 2022 Nov 7;19(11):4212-4232. doi: 10.1021/acs.molpharmaceut.2c00573. Epub 2022 Sep 22.

Evaluation of Deep Learning Architectures for Aqueous Solubility Prediction.用于水溶性预测的深度学习架构评估

ACS Omega. 2022 Apr 25;7(18):15695-15710. doi: 10.1021/acsomega.2c00642. eCollection 2022 May 10.

Novel Solubility Prediction Models: Molecular Fingerprints and Physicochemical Features vs Graph Convolutional Neural Networks.新型溶解度预测模型：分子指纹和物理化学特征与图卷积神经网络

ACS Omega. 2022 Apr 4;7(14):12268-12277. doi: 10.1021/acsomega.2c00697. eCollection 2022 Apr 12.

Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database.使用基于Wiki-pS0数据库训练的随机森林回归预测类药物分子的水相固有溶解度。

ADMET DMPK. 2020 Mar 4;8(1):29-77. doi: 10.5599/admet.766. eCollection 2020.

Predicting Solubility of Newly-Approved Drugs (2016-2020) with a Simple ABSOLV and GSE() Consensus Model Outperforming Random Forest Regression.使用简单的ABSOLV和GSE()共识模型预测新批准药物（2016 - 2020年）的溶解度，该模型优于随机森林回归。

J Solution Chem. 2022;51(9):1020-1055. doi: 10.1007/s10953-022-01141-7. Epub 2022 Feb 7.

Artificial intelligence-enabled virtual screening of ultra-large chemical libraries with deep docking.基于深度对接的人工智能辅助超大规模化学库虚拟筛选。

Nat Protoc. 2022 Mar;17(3):672-697. doi: 10.1038/s41596-021-00659-2. Epub 2022 Feb 4.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

数据质量和数量对固有溶解度估算的影响：基于单数据源数据集的分析。

Effect of Data Quality and Data Quantity on the Estimation of Intrinsic Solubility: Analysis Based on a Single-Source Data Set.

机构信息

出版信息

相似文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

本文引用的文献