有序数据集和化学反应数据基准

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data.

机构信息

Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge CB3 0AS, U.K.

出版信息

J Chem Inf Model. 2024 May 13;64(9):3790-3798. doi: 10.1021/acs.jcim.4c00292. Epub 2024 Apr 22.

DOI:10.1021/acs.jcim.4c00292

PMID:38648077

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11094788/

Abstract

Machine learning has the potential to provide tremendous value to life sciences by providing models that aid in the discovery of new molecules and reduce the time for new products to come to market. Chemical reactions play a significant role in these fields, but there is a lack of high-quality open-source chemical reaction data sets for training machine learning models. Herein, we present ORDerly, an open-source Python package for the customizable and reproducible preparation of reaction data stored in accordance with the increasingly popular Open Reaction Database (ORD) schema. We use ORDerly to clean United States patent data stored in ORD and generate data sets for forward prediction, retrosynthesis, as well as the first benchmark for reaction condition prediction. We train neural networks on data sets generated with ORDerly for condition prediction and show that data sets missing key cleaning steps can lead to silently overinflated performance metrics. Additionally, we train transformers for forward and retrosynthesis prediction and demonstrate how non-patent data can be used to evaluate model generalization. By providing a customizable open-source solution for cleaning and preparing large chemical reaction data, ORDerly is poised to push forward the boundaries of machine learning applications in chemistry.

摘要

机器学习通过提供有助于发现新分子并减少新产品推向市场所需时间的模型，有可能为生命科学提供巨大的价值。化学反应在这些领域中起着重要的作用，但缺乏用于训练机器学习模型的高质量开源化学反应数据集。在此，我们提出了 ORDerly，这是一个用于根据日益流行的开放反应数据库 (ORD) 模式可定制和可重复制备反应数据的开源 Python 包。我们使用 ORDerly 清理存储在 ORD 中的美国专利数据，并生成用于正向预测、反合成以及反应条件预测的第一个基准数据集。我们使用 ORDerly 生成的数据集训练神经网络进行条件预测，并表明缺少关键清理步骤的数据集可能导致性能指标被静默地夸大。此外，我们还针对正向和反合成预测训练了转换器，并展示了如何使用非专利数据评估模型泛化能力。通过为清理和准备大型化学反应数据提供可定制的开源解决方案，ORDerly 有望推动机器学习在化学中的应用边界。

相似文献

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data.有序数据集和化学反应数据基准

J Chem Inf Model. 2024 May 13;64(9):3790-3798. doi: 10.1021/acs.jcim.4c00292. Epub 2024 Apr 22.

Simple User-Friendly Reaction Format.简单易用的反应格式。

Mol Inform. 2025 Jan;44(1):e202400361. doi: 10.1002/minf.202400361.

AiZynthTrain: Robust, Reproducible, and Extensible Pipelines for Training Synthesis Prediction Models.AiZynthTrain：用于训练合成预测模型的强大、可重现且可扩展的管道。

J Chem Inf Model. 2023 Apr 10;63(7):1841-1846. doi: 10.1021/acs.jcim.2c01486. Epub 2023 Mar 23.

ASAS-NANP symposium: mathematical modeling in animal nutrition-Making sense of big data and machine learning: how open-source code can advance training of animal scientists.ASAS-NANP 研讨会：动物营养中的数学建模——从大数据和机器学习中得出意义：开源代码如何促进动物科学家的培训。

J Anim Sci. 2023 Jan 3;101. doi: 10.1093/jas/skad317.

Chemprop: A Machine Learning Package for Chemical Property Prediction.Chemprop：一个用于化学性质预测的机器学习工具包。

J Chem Inf Model. 2024 Jan 8;64(1):9-17. doi: 10.1021/acs.jcim.3c01250. Epub 2023 Dec 26.

RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction.RetroComposer：基于模板的反合成预测的模板作曲。

Biomolecules. 2022 Sep 19;12(9):1325. doi: 10.3390/biom12091325.

Gnocis: An integrated system for interactive and reproducible analysis and modelling of cis-regulatory elements in Python 3.Gnocis：一个用于在 Python 3 中交互式和可重复分析及建模顺式调控元件的集成系统。

PLoS One. 2022 Sep 9;17(9):e0274338. doi: 10.1371/journal.pone.0274338. eCollection 2022.

Chemistry-informed molecular graph as reaction descriptor for machine-learned retrosynthesis planning.基于化学信息的分子图作为机器学习逆向合成规划的反应描述符。

Proc Natl Acad Sci U S A. 2022 Oct 11;119(41):e2212711119. doi: 10.1073/pnas.2212711119. Epub 2022 Oct 3.

Unified Deep Learning Model for Multitask Reaction Predictions with Explanation.具有解释功能的多任务反应预测统一深度学习模型。

J Chem Inf Model. 2022 Mar 28;62(6):1376-1387. doi: 10.1021/acs.jcim.1c01467. Epub 2022 Mar 10.

The Open Reaction Database.开放式反应数据库。

J Am Chem Soc. 2021 Nov 17;143(45):18820-18826. doi: 10.1021/jacs.1c09820. Epub 2021 Nov 2.

引用本文的文献

Predicting reaction conditions: a data-driven perspective.预测反应条件：数据驱动的视角

Chem Sci. 2025 Aug 6. doi: 10.1039/d5sc03045e.

ReactionT5: a pre-trained transformer model for accurate chemical reaction prediction with limited data.反应T5：一种用于在数据有限的情况下进行准确化学反应预测的预训练变压器模型。

J Cheminform. 2025 Aug 19;17(1):126. doi: 10.1186/s13321-025-01075-4.

Challenging Reaction Prediction Models to Generalize to Novel Chemistry.挑战反应预测模型以推广至新化学领域。

ACS Cent Sci. 2025 Mar 12;11(4):539-549. doi: 10.1021/acscentsci.5c00055. eCollection 2025 Apr 23.

Cross-disciplinary perspectives on the potential for artificial intelligence across chemistry.关于人工智能在化学领域潜力的跨学科观点。

Chem Soc Rev. 2025 Apr 25. doi: 10.1039/d5cs00146c.

Carbohydrate Synthesis is Entering the Data-Driven Digital Era.碳水化合物合成正步入数据驱动的数字时代。

Chemistry. 2025 May 14;31(27):e202500289. doi: 10.1002/chem.202500289. Epub 2025 Apr 18.

Machine learning-guided strategies for reaction conditions design and optimization.用于反应条件设计与优化的机器学习引导策略。

Beilstein J Org Chem. 2024 Oct 4;20:2476-2492. doi: 10.3762/bjoc.20.212. eCollection 2024.

Catalysing (organo-)catalysis: Trends in the application of machine learning to enantioselective organocatalysis.催化（有机）催化：机器学习在对映选择性有机催化中的应用趋势

Beilstein J Org Chem. 2024 Sep 10;20:2280-2304. doi: 10.3762/bjoc.20.196. eCollection 2024.

本文引用的文献

Accelerated Chemical Reaction Optimization Using Multi-Task Learning.基于多任务学习的加速化学反应优化

ACS Cent Sci. 2023 Apr 13;9(5):957-968. doi: 10.1021/acscentsci.3c00050. eCollection 2023 May 24.

Reagent prediction with a molecular transformer improves reaction data quality.使用分子变换器进行试剂预测可提高反应数据质量。

Chem Sci. 2023 Mar 1;14(12):3235-3246. doi: 10.1039/d2sc06798f. eCollection 2023 Mar 22.

AiZynthTrain: Robust, Reproducible, and Extensible Pipelines for Training Synthesis Prediction Models.AiZynthTrain：用于训练合成预测模型的强大、可重现且可扩展的管道。

J Chem Inf Model. 2023 Apr 10;63(7):1841-1846. doi: 10.1021/acs.jcim.2c01486. Epub 2023 Mar 23.

Quantitative Prediction of the Rate of Protodeboronation by a Mechanistic Density Functional Theory-Aided Algorithm.用基于机理的密度泛函理论辅助算法定量预测脱硼化反应速率。

J Phys Chem A. 2023 Mar 23;127(11):2628-2636. doi: 10.1021/acs.jpca.2c08250. Epub 2023 Mar 14.

Machine Learning C-N Couplings: Obstacles for a General-Purpose Reaction Yield Prediction.机器学习中的C-N偶联：通用反应产率预测的障碍

ACS Omega. 2023 Jan 11;8(3):3017-3025. doi: 10.1021/acsomega.2c05546. eCollection 2023 Jan 24.

Generative Modeling to Predict Multiple Suitable Conditions for Chemical Reactions.用于预测化学反应多种合适条件的生成模型。

J Chem Inf Model. 2022 Dec 12;62(23):5952-5960. doi: 10.1021/acs.jcim.2c01085. Epub 2022 Nov 22.

Closed-loop optimization of general reaction conditions for heteroaryl Suzuki-Miyaura coupling.闭环优化杂芳基 Suzuki-Miyaura 偶联的一般反应条件。

Science. 2022 Oct 28;378(6618):399-405. doi: 10.1126/science.adc8743. Epub 2022 Oct 27.

RetroComposer: Composing Templates for Template-Based Retrosynthesis Prediction.RetroComposer：基于模板的反合成预测的模板作曲。

Biomolecules. 2022 Sep 19;12(9):1325. doi: 10.3390/biom12091325.

Permutation Invariant Graph-to-Sequence Model for Template-Free Retrosynthesis and Reaction Prediction.无模板回溯合成和反应预测的置换不变图到序列模型。

J Chem Inf Model. 2022 Aug 8;62(15):3503-3513. doi: 10.1021/acs.jcim.2c00321. Epub 2022 Jul 26.

Reaction classification and yield prediction using the differential reaction fingerprint DRFP.使用微分反应指纹DRFP进行反应分类和产率预测。

Digit Discov. 2022 Jan 21;1(2):91-97. doi: 10.1039/d1dd00006c. eCollection 2022 Apr 11.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

有序数据集和化学反应数据基准

ORDerly: Data Sets and Benchmarks for Chemical Reaction Data.

机构信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献