Ball Matthew, Horvath Dragos, Kogej Thierry, Kabeshov Mikhail, Varnek Alexandre
Molecular AI, Discovery Sciences RD, AstraZeneca 431 83 Gothenburg Sweden.
Laboratory of Cheminformatics, University of Strasbourg 67081 Strasbourg France
Chem Sci. 2025 Aug 6. doi: 10.1039/d5sc03045e.
The selection of optimal reaction conditions is a critical challenge in synthetic chemistry, influencing the efficiency, sustainability, and scalability of chemical processes. While machine learning (ML) has emerged as a promising tool for predicting reaction conditions in computer-aided synthesis planning (CASP), existing approaches face many significant challenges, including data quality, sparsity, choice of reaction representation and method evaluation. Recent studies have suggested that these models may fail to surpass literature-derived popularity baselines, underscoring these problems. In this work, we provide a critical review of state-of-the-art ML techniques, identifying innovations which have addressed the key challenges facing researchers when modelling conditions. To illustrate how relevant reaction representations can improve existing models, we perform a case study of heteroaromatic Suzuki-Miyaura reactions, derived from US patent data (USPTO). Using Condensed Graph of Reaction-based inputs, we demonstrate how this alternative representation can enhance the predictive power of a model beyond popularity baselines. Finally, we propose future directions for the field beyond improving data quality, suggesting potential options to mitigate data issues prevalent in existing literature data. This perspective aims to guide researchers in understanding and overcoming current limitations in computational reaction condition prediction.
选择最佳反应条件是合成化学中的一项关键挑战,它会影响化学过程的效率、可持续性和可扩展性。虽然机器学习(ML)已成为计算机辅助合成规划(CASP)中预测反应条件的一种有前景的工具,但现有方法面临许多重大挑战,包括数据质量、稀疏性、反应表示的选择和方法评估。最近的研究表明,这些模型可能无法超越基于文献得出的流行度基线,这凸显了这些问题。在这项工作中,我们对当前最先进的ML技术进行了批判性综述,确定了在对反应条件进行建模时应对研究人员面临的关键挑战的创新方法。为了说明相关的反应表示如何改进现有模型,我们对源自美国专利数据(USPTO)的杂芳基铃木 - 宫浦反应进行了案例研究。使用基于反应凝聚图的输入,我们展示了这种替代表示如何能够将模型的预测能力提高到超越流行度基线的水平。最后,我们提出了该领域在提高数据质量之外的未来发展方向,提出了减轻现有文献数据中普遍存在的数据问题的潜在选择。这一观点旨在指导研究人员理解和克服当前计算反应条件预测中的局限性。