Bao Zeqing, Tom Gary, Cheng Austin, Watchorn Jeffrey, Aspuru-Guzik Alán, Allen Christine
Leslie Dan Faculty of Pharmacy, University of Toronto, Toronto, ON, M5S 3M2, Canada.
Department of Chemistry, University of Toronto, Toronto, ON, M5S 3H6, Canada.
J Cheminform. 2024 Oct 28;16(1):117. doi: 10.1186/s13321-024-00911-3.
Drug solubility is an important parameter in the drug development process, yet it is often tedious and challenging to measure, especially for expensive drugs or those available in small quantities. To alleviate these challenges, machine learning (ML) has been applied to predict drug solubility as an alternative approach. However, the majority of existing ML research has focused on the predictions of aqueous solubility and/or solubility at specific temperatures, which restricts the model applicability in pharmaceutical development. To bridge this gap, we compiled a dataset of 27,000 solubility datapoints, including solubility of small molecules measured in a range of binary solvent mixtures under various temperatures. Next, a panel of ML models were trained on this dataset with their hyperparameters tuned using Bayesian optimization. The resulting top-performing models, both gradient boosted decision trees (light gradient boosting machine and extreme gradient boosting), achieved mean absolute errors (MAE) of 0.33 for LogS (S in g/100 g) on the holdout set. These models were further validated through a prospective study, wherein the solubility of four drug molecules were predicted by the models and then validated with in-house solubility experiments. This prospective study demonstrated that the models accurately predicted the solubility of solutes in specific binary solvent mixtures under different temperatures, especially for drugs whose features closely align within the solutes in the dataset (MAE < 0.5 for LogS). To support future research and facilitate advancements in the field, we have made the dataset and code openly available. Scientific contribution Our research advances the state-of-the-art in predicting solubility for small molecules by leveraging ML and a uniquely comprehensive dataset. Unlike existing ML studies that predominantly focus on solubility in aqueous solvents at fixed temperatures, our work enables prediction of drug solubility in a variety of binary solvent mixtures over a broad temperature range, providing practical insights on the modeling of solubility for realistic pharmaceutical applications. These advancements along with the open access dataset and code support significant steps in the drug development process including new molecule discovery, drug analysis and formulation.
药物溶解度是药物研发过程中的一个重要参数,但测量起来往往繁琐且具有挑战性,尤其是对于昂贵药物或少量可用的药物。为了缓解这些挑战,机器学习(ML)已被应用于预测药物溶解度,作为一种替代方法。然而,大多数现有的ML研究都集中在水溶解度和/或特定温度下的溶解度预测上,这限制了模型在药物研发中的适用性。为了弥补这一差距,我们编制了一个包含27000个溶解度数据点的数据集,包括在不同温度下一系列二元溶剂混合物中测量的小分子溶解度。接下来,在这个数据集上训练了一组ML模型,并使用贝叶斯优化对其超参数进行了调整。由此产生的表现最佳的模型,即梯度提升决策树(轻梯度提升机和极端梯度提升),在验证集上对LogS(S以g/100 g为单位)的平均绝对误差(MAE)为0.33。这些模型通过一项前瞻性研究进一步得到验证,在该研究中,模型预测了四种药物分子的溶解度,然后通过内部溶解度实验进行了验证。这项前瞻性研究表明,这些模型准确地预测了溶质在不同温度下特定二元溶剂混合物中的溶解度,特别是对于其特征与数据集中溶质密切匹配的药物(LogS的MAE < 0.5)。为了支持未来的研究并促进该领域的进步,我们已将数据集和代码公开提供。科学贡献我们的研究通过利用ML和一个独特的全面数据集,推动了小分子溶解度预测方面的技术发展。与现有的主要关注固定温度下水溶性的ML研究不同,我们的工作能够预测药物在广泛温度范围内各种二元溶剂混合物中的溶解度,为实际药物应用中的溶解度建模提供了实用见解。这些进展以及开放获取的数据集和代码支持了药物研发过程中的重要步骤,包括新分子发现、药物分析和制剂开发。