Bradshaw John, Zhang Anji, Mahjour Babak, Graff David E, Segler Marwin H S, Coley Connor W
Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States.
ACS Cent Sci. 2025 Mar 12;11(4):539-549. doi: 10.1021/acscentsci.5c00055. eCollection 2025 Apr 23.
Deep learning models for anticipating the products of organic reactions have found many use cases, including validating retrosynthetic pathways and constraining synthesis-based molecular design tools. Despite compelling performance on popular benchmark tasks, strange and erroneous predictions sometimes ensue when using these models in practice. The core issue is that common benchmarks test models in an setting, whereas many real-world uses for these models are in settings and require a greater degree of extrapolation. To better understand how current reaction predictors work in out-of-distribution domains, we report a series of more challenging evaluations of a prototypical SMILES-based deep learning model. First, we illustrate how performance on randomly sampled data sets is overly optimistic compared to performance when generalizing to new patents or new authors. Second, we conduct time splits that evaluate how models perform when tested on reactions published years after those in their training set, mimicking real-world deployment. Finally, we consider extrapolation across reaction classes to reflect what would be required for the discovery of novel reaction types. This panel of tasks can reveal the capabilities and limitations of today's reaction predictors, acting as a crucial first step in the development of tomorrow's next-generation models capable of reaction discovery.
用于预测有机反应产物的深度学习模型已发现许多用例,包括验证逆合成途径和约束基于合成的分子设计工具。尽管在流行的基准任务上表现出色,但在实际使用这些模型时有时会出现奇怪且错误的预测。核心问题在于,常见基准在一种[此处原文缺失具体内容]设置中测试模型,而这些模型在许多实际应用中处于[此处原文缺失具体内容]设置,并且需要更大程度的外推。为了更好地理解当前反应预测器在分布外领域的工作方式,我们报告了一系列对基于原型SMILES的深度学习模型更具挑战性的评估。首先,我们说明了与推广到新专利或新作者时的性能相比,随机采样数据集上的性能如何过度乐观。其次,我们进行时间分割,以评估模型在对其训练集中反应数年之后发表的反应进行测试时的表现,模拟实际部署情况。最后,我们考虑跨反应类别进行外推,以反映发现新型反应类型所需的条件。这一系列任务可以揭示当今反应预测器的能力和局限性,作为开发能够发现反应的下一代模型的关键第一步。