Zhang Chonghuan, Lin Qianghua, Yang Chenxi, Kong Yaxian, Yu Zhunzhun, Liao Kuangbiao
Guangzhou Municipal and Guangdong Provincial Key Laboratory of Molecular Target & Clinical Pharmacology, The NMPA and State Key Laboratory of Respiratory Disease, School of Pharmaceutical Sciences, Guangzhou Laboratory, Guangzhou Medical University Guangzhou Guangdong PR China 511436
Guangzhou National Laboratory No. 9 Xingdaohuanbei Road, Guangzhou International Bio Island Guangzhou Guangdong PR China 510005.
Chem Sci. 2025 Jun 5. doi: 10.1039/d5sc03364k.
Amide coupling is an important reaction widely applied in medicinal chemistry. However, condition recommendation remains a challenging issue due to the broad condition space. Recently, accurate condition recommendation machine learning has emerged as a novel and efficient method to find suitable conditions to achieve the desired transformations. Nonetheless, accurately predicting yields is challenging due to the complex relationships involved. Herein, we present our strategy to address this problem. Two steps were taken to ensure the quality of the dataset. First, we selected a diverse and representative set of substrates to capture a broad spectrum of substrate structures and reaction conditions using an unbiased machine-based sampling approach. Second, experiments were conducted using our in-house high-throughput experimentation (HTE) platform to minimize the influence of human factors. Additionally, we proposed an intermediate knowledge-embedded strategy to enhance the model's robustness. The performance of the model was first evaluated at three different levels-random split, partial substrate novelty, and full substrate novelty. All model metrics in these cases improved dramatically, achieving an of 0.89, MAE of 6.1%, and RMSE of 8.0% in the full substrate novelty test dataset. Moreover, the generalization of our strategy was assessed using external datasets from reported literature, delivering an of 0.71, MAE of 7%, and RMSE of 10%. Meanwhile, the model could recommend suitable conditions for some reactions to elevate the reaction yields. Besides, the model was able to identify which reaction in a reaction pair with a reactivity cliff had a higher yield. In summary, our research demonstrated the feasibility of achieving accurate yield predictions through the combination of HTE and embedding intermediate knowledge into the model. This approach also has the potential to facilitate other related machine learning tasks.
酰胺偶联是一种广泛应用于药物化学的重要反应。然而,由于条件空间广阔,条件推荐仍然是一个具有挑战性的问题。最近,精确的条件推荐机器学习已成为一种新颖且高效的方法,用于找到合适的条件以实现所需的转化。尽管如此,由于涉及复杂的关系,准确预测产率具有挑战性。在此,我们提出了解决此问题的策略。采取了两个步骤来确保数据集的质量。首先,我们使用基于无偏机器的采样方法选择了一组多样且具有代表性的底物,以涵盖广泛的底物结构和反应条件。其次,使用我们内部的高通量实验(HTE)平台进行实验,以尽量减少人为因素的影响。此外,我们提出了一种中间知识嵌入策略来增强模型的稳健性。该模型的性能首先在三个不同层面进行评估——随机拆分、部分底物新颖性和完全底物新颖性。在这些情况下,所有模型指标都有显著改善,在完全底物新颖性测试数据集中达到了0.89的R²、6.1%的平均绝对误差(MAE)和8.0%的均方根误差(RMSE)。此外,我们使用来自已发表文献的外部数据集评估了我们策略的泛化能力,得到了0.71的R²、7%的MAE和10%的RMSE。同时,该模型可以为一些反应推荐合适的条件以提高反应产率。此外,该模型能够识别具有反应活性悬崖的反应对中哪个反应产率更高。总之,我们的研究证明了通过将高通量实验与将中间知识嵌入模型相结合来实现准确产率预测的可行性。这种方法也有可能促进其他相关的机器学习任务。