Ye Zhuyifan, Ouyang Defang
State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences (ICMS), University of Macau, Macau, China.
J Cheminform. 2021 Dec 11;13(1):98. doi: 10.1186/s13321-021-00575-3.
Rapid solvent selection is of great significance in chemistry. However, solubility prediction remains a crucial challenge. This study aimed to develop machine learning models that can accurately predict compound solubility in organic solvents. A dataset containing 5081 experimental temperature and solubility data of compounds in organic solvents was extracted and standardized. Molecular fingerprints were selected to characterize structural features. lightGBM was compared with deep learning and traditional machine learning (PLS, Ridge regression, kNN, DT, ET, RF, SVM) to develop models for predicting solubility in organic solvents at different temperatures. Compared to other models, lightGBM exhibited significantly better overall generalization (logS ± 0.20). For unseen solutes, our model gave a prediction accuracy (logS ± 0.59) close to the expected noise level of experimental solubility data. lightGBM revealed the physicochemical relationship between solubility and structural features. Our method enables rapid solvent screening in chemistry and may be applied to solubility prediction in other solvents.
快速选择溶剂在化学领域具有重要意义。然而,溶解度预测仍然是一项关键挑战。本研究旨在开发能够准确预测化合物在有机溶剂中溶解度的机器学习模型。提取并标准化了一个包含5081个化合物在有机溶剂中的实验温度和溶解度数据的数据集。选择分子指纹来表征结构特征。将lightGBM与深度学习和传统机器学习(PLS、岭回归、kNN、决策树、随机森林、随机森林、支持向量机)进行比较,以开发预测不同温度下有机溶剂中溶解度的模型。与其他模型相比,lightGBM表现出明显更好的整体泛化能力(logS±0.20)。对于未见过的溶质,我们的模型给出的预测准确率(logS±0.59)接近实验溶解度数据的预期噪声水平。lightGBM揭示了溶解度与结构特征之间的物理化学关系。我们的方法能够在化学领域实现快速溶剂筛选,并可能应用于其他溶剂的溶解度预测。