Department of Pharmacy, Uppsala University, 751 23 Uppsala, Sweden.
Pharmaceutical & Material Sciences, Janssen Pharmaceutica NV, B-2340 Beerse, Belgium.
Mol Pharm. 2024 Oct 7;21(10):5261-5271. doi: 10.1021/acs.molpharmaceut.4c00685. Epub 2024 Sep 13.
Aqueous solubility is one of the most important physicochemical properties of drug molecules and a major driving force for oral drug absorption. To date, the performance of in silico models for the estimation of solubility for novel chemical space is limited. To investigate possible reasons and remedies for this, the Johnson and Johnson in-house aqueous solubility data with over 40,000 compounds was leveraged. All data were generated through the same high-throughput assay, providing a unique opportunity to explore the relationship between data quality, quantity, and model estimations. Six intrinsic solubility data sets with different sizes and noise levels were generated by making use of three different approaches: (i) inclusion or exclusion of amorphous solid residue, (ii) measured or experimental log to identify the intrinsic solubility, and (iii) adopting or omitting a quality check process in the data processing workflow. A random forest regressor was trained on the data sets with three different sets of descriptors calculated from RDKit, ADMET predictor, or Mordred, and the performances were evaluated with nested cross-validation as well as ten refined test sets. The models confirm, as expected, that with the same data set size, high-quality data leads to better model performance; however, also, models trained with larger data sets containing analytical variability can give equally accurate estimations compared to models trained with small, clean, and diverse data sets. However, noise introduced by including the presence of amorphous solid postsolubility measurement in the training data set cannot be overcome by increasing data size, as they are introducing a biased systematic positive error in the data set, confirming the importance of critical data review. Finally, two top-performing models were tested on the first test set from the second solubility challenge, achieving RMSE values of 0.74 and 0.72 and log ± 0.5 of 46 and 48%, respectively. These results demonstrated improved performance compared to those reported in the findings of the competition, highlighting that a single-source curated data set can enhance the prediction of intrinsic solubility.
水溶性是药物分子最重要的物理化学性质之一,也是口服药物吸收的主要驱动力。迄今为止,用于估算新型化学空间溶解度的计算模型的性能受到限制。为了研究造成这种情况的可能原因和补救措施,利用了强生公司内部拥有的超过 40000 种化合物的水溶性数据。所有数据都是通过相同的高通量测定法生成的,这为探索数据质量、数量和模型估算之间的关系提供了独特的机会。通过利用三种不同的方法生成了六个具有不同大小和噪声水平的固有溶解度数据集:(i)包括或排除无定形固体残留物,(ii)测量或实验 log 以识别固有溶解度,以及(iii)在数据处理工作流程中采用或省略质量检查过程。使用来自 RDKit、ADMET 预测器或 Mordred 的三组不同描述符,在数据集上训练了随机森林回归器,并通过嵌套交叉验证以及十个精炼测试集评估了性能。这些模型证实,正如预期的那样,在具有相同数据集大小的情况下,高质量的数据会导致更好的模型性能;然而,与使用小而干净且多样化的数据集训练的模型相比,使用包含分析变异性的较大数据集训练的模型也可以提供同样准确的估计。然而,在训练数据集中包含后溶解度测量的无定形固体的存在所引入的噪声是无法通过增加数据量来克服的,因为它们会在数据集中引入有偏的系统正误差,这证实了关键数据审查的重要性。最后,对第二个溶解度挑战的第一个测试集测试了两个表现最佳的模型,分别达到了 0.74 和 0.72 的 RMSE 值和 46%和 48%的 log ± 0.5。这些结果表明与竞赛结果报告的结果相比,性能有所提高,这突出表明单一来源的经过审核的数据集可以增强对固有溶解度的预测。