Kombo David C, Stepp J David, Lim Sungtaek, Elshorst Bettina, Li Yi, Cato Laura, Shomali Maysoun, Fink David, LaMarche Matthew J
Integrated Drug Discovery, Sanofi, 350 Water St., Cambridge, Massachusetts 02141, United States.
CMC Synthetics Early Development Analytics, Sanofi, Industriepark Hochst, Frankfurt 65926, Germany.
ACS Omega. 2024 Jun 18;9(26):28691-28706. doi: 10.1021/acsomega.4c02886. eCollection 2024 Jul 2.
To facilitate the triage of hits from small molecule screens, we have used various AI/ML techniques and experimentally observed data sets to build models aimed at predicting colloidal aggregation of small organic molecules in aqueous solution. We have found that Naïve Bayesian and deep neural networks outperform logistic regression, recursive partitioning tree, support vector machine, and random forest techniques by having the lowest balanced error rate (BER) for the test set. Derived predictive classification models consistently and successfully discriminated aggregator molecules from nonaggregator hits. An analysis of molecular descriptors in favor of colloidal aggregation confirms previous observations (hydrophobicity, molecular weight, and solubility) in addition to undescribed molecular descriptors such as the fraction of sp carbon atoms (Fsp3), and electrotopological state of hydroxyl groups (ES_Sum_sOH). Naïve Bayesian modeling and scaffold tree analysis have revealed chemical features/scaffolds contributing the most to colloidal aggregation and nonaggregation, respectively. These results highlight the importance of scaffolds with high Fsp3 values in promoting nonaggregation. Matched molecular pair analysis (MMPA) has also deciphered context-dependent substitutions, which can be used to design nonaggregator molecules. We found that most matched molecular pairs have a neutral effect on aggregation propensity. We have prospectively applied our predictive models to assist in chemical library triage for optimal plate selection diversity and purchase for high throughput screening (HTS) in drug discovery projects.
为便于对小分子筛选得到的命中化合物进行分类,我们运用了各种人工智能/机器学习技术以及实验观测数据集来构建模型,旨在预测小有机分子在水溶液中的胶体聚集情况。我们发现,朴素贝叶斯和深度神经网络在测试集中具有最低的平衡错误率(BER),优于逻辑回归、递归划分树、支持向量机和随机森林技术。推导得到的预测分类模型始终且成功地将聚集分子与非聚集命中化合物区分开来。对有利于胶体聚集的分子描述符的分析证实了先前的观察结果(疏水性、分子量和溶解度),此外还发现了一些未描述的分子描述符,如sp碳原子分数(Fsp3)和羟基的电子拓扑状态(ES_Sum_sOH)。朴素贝叶斯建模和支架树分析分别揭示了对胶体聚集和非聚集贡献最大的化学特征/支架。这些结果突出了具有高Fsp3值的支架在促进非聚集方面的重要性。匹配分子对分析(MMPA)也解读了上下文相关的取代情况,可用于设计非聚集分子。我们发现大多数匹配分子对在聚集倾向方面具有中性作用。我们已前瞻性地应用我们的预测模型,以协助化学文库分类,实现最佳的板选择多样性,并为药物发现项目中的高通量筛选(HTS)进行采购。