Suppr超能文献

盲法预测和事后分析第二次溶解度挑战数据:探索机器学习和深度学习模型的训练数据和特征集选择。

Blinded Predictions and Post Hoc Analysis of the Second Solubility Challenge Data: Exploring Training Data and Feature Set Selection for Machine and Deep Learning Models.

机构信息

Department of Pure and Applied Chemistry, University of Strathclyde, Thomas Graham Building, 295 Cathedral Street, Glasgow G1 1XL, U.K.

Drug Metabolism and Pharmacokinetics, Research and Early Development, Respiratory & Immunology, BioPharmaceuticals R&D, AstraZeneca, Pepparedsleden 1, SE-431 83 Göteborg, Sweden.

出版信息

J Chem Inf Model. 2023 Feb 27;63(4):1099-1113. doi: 10.1021/acs.jcim.2c01189. Epub 2023 Feb 9.

Abstract

Accurate methods to predict solubility from molecular structure are highly sought after in the chemical sciences. To assess the state of the art, the American Chemical Society organized a "Second Solubility Challenge" in 2019, in which competitors were invited to submit blinded predictions of the solubilities of 132 drug-like molecules. In the first part of this article, we describe the development of two models that were submitted to the Blind Challenge in 2019 but which have not previously been reported. These models were based on computationally inexpensive molecular descriptors and traditional machine learning algorithms and were trained on a relatively small data set of 300 molecules. In the second part of the article, to test the hypothesis that predictions would improve with more advanced algorithms and higher volumes of training data, we compare these original predictions with those made after the deadline using deep learning models trained on larger solubility data sets consisting of 2999 and 5697 molecules. The results show that there are several algorithms that are able to obtain near state-of-the-art performance on the solubility challenge data sets, with the best model, a graph convolutional neural network, resulting in an RMSE of 0.86 log units. Critical analysis of the models reveals systematic differences between the performance of models using certain feature sets and training data sets. The results suggest that careful selection of high quality training data from relevant regions of chemical space is critical for prediction accuracy but that other methodological issues remain problematic for machine learning solubility models, such as the difficulty in modeling complex chemical spaces from sparse training data sets.

摘要

准确预测分子结构的溶解度在化学科学中是备受关注的。为了评估该领域的最新进展,美国化学学会于 2019 年组织了一场“第二届溶解度挑战赛”,邀请参赛者对 132 种类药分子的溶解度进行盲测预测。在本文的第一部分,我们描述了提交给 2019 年盲测挑战赛的两个模型的开发情况,但这些模型之前并未报道过。这些模型基于计算成本低廉的分子描述符和传统机器学习算法,并在 300 个分子的相对较小数据集上进行了训练。在本文的第二部分,为了检验使用更先进的算法和更多训练数据可以提高预测效果的假设,我们将这些原始预测结果与截止日期后使用基于 2999 和 5697 个分子的更大溶解度数据集进行训练的深度学习模型所做的预测进行了比较。结果表明,有几种算法能够在溶解度挑战数据集上获得接近最新水平的性能,其中表现最好的模型是图卷积神经网络,其 RMSE 为 0.86 个对数单位。对模型的批判性分析揭示了使用某些特征集和训练数据集的模型性能之间存在系统性差异。结果表明,从相关化学空间区域中精心选择高质量的训练数据对于提高预测准确性至关重要,但机器学习溶解度模型仍然存在其他方法学问题,例如从稀疏的训练数据集建模复杂化学空间的难度。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/0c3c/9976279/f4d30cc4bc02/ci2c01189_0001.jpg

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验