Suppr超能文献

有序数据集和化学反应数据基准

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data.

机构信息

Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.

出版信息

J Chem Inf Model. 2024 May 13;64(9):3790-3798. doi: 10.1021/acs.jcim.4c00292. Epub 2024 Apr 22.

Abstract

Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.

摘要

机器学习通过提供有助于发现新分子并减少新产品推向市场所需时间的模型,有可能为生命科学提供巨大的价值。化学反应在这些领域中起着重要的作用,但缺乏用于训练机器学习模型的高质量开源化学反应数据集。在此,我们提出了 ORDerly,这是一个用于根据日益流行的开放反应数据库 (ORD) 模式可定制和可重复制备反应数据的开源 Python 包。我们使用 ORDerly 清理存储在 ORD 中的美国专利数据,并生成用于正向预测、反合成以及反应条件预测的第一个基准数据集。我们使用 ORDerly 生成的数据集训练神经网络进行条件预测,并表明缺少关键清理步骤的数据集可能导致性能指标被静默地夸大。此外,我们还针对正向和反合成预测训练了转换器,并展示了如何使用非专利数据评估模型泛化能力。通过为清理和准备大型化学反应数据提供可定制的开源解决方案,ORDerly 有望推动机器学习在化学中的应用边界。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验