Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.
J Chem Inf Model. 2024 May 13;64(9):3790-3798. doi: 10.1021/acs.jcim.4c00292. Epub 2024 Apr 22.
Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.
机器学习通过提供有助于发现新分子并减少新产品推向市场所需时间的模型,有可能为生命科学提供巨大的价值。化学反应在这些领域中起着重要的作用,但缺乏用于训练机器学习模型的高质量开源化学反应数据集。在此,我们提出了 ORDerly,这是一个用于根据日益流行的开放反应数据库 (ORD) 模式可定制和可重复制备反应数据的开源 Python 包。我们使用 ORDerly 清理存储在 ORD 中的美国专利数据,并生成用于正向预测、反合成以及反应条件预测的第一个基准数据集。我们使用 ORDerly 生成的数据集训练神经网络进行条件预测,并表明缺少关键清理步骤的数据集可能导致性能指标被静默地夸大。此外,我们还针对正向和反合成预测训练了转换器,并展示了如何使用非专利数据评估模型泛化能力。通过为清理和准备大型化学反应数据提供可定制的开源解决方案,ORDerly 有望推动机器学习在化学中的应用边界。