通过有监督学习预测反应产率。

Predicting Reaction Yields via Supervised Learning.

机构信息

Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States.

出版信息

Acc Chem Res. 2021 Apr 20;54(8):1856-1865. doi: 10.1021/acs.accounts.0c00770. Epub 2021 Mar 31.

DOI:10.1021/acs.accounts.0c00770

Abstract

Numerous disciplines, such as image recognition and language translation, have been revolutionized by using machine learning (ML) to leverage big data. In organic synthesis, providing accurate chemical reactivity predictions with supervised ML could assist chemists with reaction prediction, optimization, and mechanistic interrogation.To apply supervised ML to chemical reactions, one needs to define the object of prediction (e.g., yield, enantioselectivity, solubility, or a recommendation) and represent reactions with descriptive data. Our group's effort has focused on representing chemical reactions using DFT-derived physical features of the reacting molecules and conditions, which serve as features for building supervised ML models.In this Account, we present a review and perspective on three studies conducted by our group where ML models have been employed to predict reaction yield. First, we focus on a small reaction data set where 16 phosphine ligands were evaluated in a single Ni-catalyzed Suzuki-Miyaura cross-coupling reaction, and the reaction yield was modeled with linear regression. In this setting, where the regression complexity is strongly limited by the amount of available data, we emphasize the importance of identifying single features that are directly relevant to reactivity. Next, we focus on models trained on two larger data sets obtained with high-throughput experimentation (HTE). With hundreds to thousands of reactions available, more complex models can be explored, for example, models that algorithmically perform feature selection from a broad set of candidate features. We examine how a variety of ML algorithms model these data sets and how well these models generalize to out-of-sample substrates. Specifically, we compare the ML models that use DFT-based featurization to a baseline model that is obtained with features that carry no physical information, that is, random features, and to a naive non-ML model that averages yields of reactions that share the same conditions and substrate combinations. We find that for only one of the two data sets, DFT-based featurization leads to a significant, although moderate, out-of-sample prediction improvement. The source of this improvement was further isolated to specific features which allowed us to formulate a testable mechanistic hypothesis that was validated experimentally. Finally, we offer remarks on supervised ML model building on HTE data sets focusing on algorithmic improvements in model training.Statistical methods in chemistry have a rich history, but only recently has ML gained widespread attention in reaction development. As the untapped potential of ML is explored, novel tools are likely to arise from future research. Our studies suggest that supervised ML can lead to improved predictions of reaction yield over simpler modeling methods and facilitate mechanistic understanding of reaction dynamics. However, further research and development is required to establish ML as an indispensable tool in reactivity modeling.

摘要

许多学科，如图像识别和语言翻译，都已经通过使用机器学习（ML）来利用大数据而发生了变革。在有机合成中，使用监督机器学习提供准确的化学反应性预测，可以帮助化学家进行反应预测、优化和机理研究。要将监督机器学习应用于化学反应，需要定义预测的对象（例如，产率、对映选择性、溶解度或推荐值）并使用描述性数据表示反应。我们小组的工作重点是使用 DFT 衍生的反应分子和条件的物理特征来表示化学反应，这些特征用作构建监督机器学习模型的特征。在本报告中，我们介绍了我们小组进行的三项研究的综述和观点，这些研究使用了 ML 模型来预测反应产率。首先，我们专注于一个小的反应数据集，其中在单个 Ni 催化的 Suzuki-Miyaura 交叉偶联反应中评估了 16 种膦配体，并用线性回归对反应产率进行建模。在这种情况下，回归的复杂性受到可用数据量的强烈限制，我们强调了识别与反应性直接相关的单个特征的重要性。接下来，我们专注于使用高通量实验（HTE）获得的两个更大数据集训练的模型。有数百到数千个反应可用，可以探索更复杂的模型，例如，从广泛的候选特征中进行算法特征选择的模型。我们检查了各种 ML 算法如何对这些数据集进行建模，以及这些模型对样本外底物的概括程度。具体来说，我们比较了使用基于 DFT 的特征化的 ML 模型与使用不带物理信息的特征（即随机特征）的基线模型，以及与平均具有相同条件和底物组合的反应产率的简单非 ML 模型。我们发现，只有在两个数据集之一中，基于 DFT 的特征化导致了显著的（尽管是中等的）样本外预测改进。这种改进的来源进一步隔离到特定的特征，这使我们能够提出一个可测试的机理假设，并通过实验验证。最后，我们就关注模型训练的算法改进的 HTE 数据集上的监督 ML 模型构建提出了一些看法。化学中的统计方法有着悠久的历史，但直到最近，机器学习才在反应开发中得到广泛关注。随着对 ML 未开发潜力的探索，未来的研究可能会出现新的工具。我们的研究表明，监督机器学习可以提高反应产率的预测精度，优于更简单的建模方法，并有助于对反应动力学的机制理解。然而，需要进一步的研究和开发，才能使 ML 成为反应性建模不可或缺的工具。

相似文献

Predicting Reaction Yields via Supervised Learning.通过有监督学习预测反应产率。

Acc Chem Res. 2021 Apr 20;54(8):1856-1865. doi: 10.1021/acs.accounts.0c00770. Epub 2021 Mar 31.

Molecular Machine Learning for Chemical Catalysis: Prospects and Challenges.分子机器学习在化学催化中的应用：前景与挑战。

Acc Chem Res. 2023 Feb 7;56(3):402-412. doi: 10.1021/acs.accounts.2c00801. Epub 2023 Jan 30.

Translational Metabolomics of Head Injury: Exploring Dysfunctional Cerebral Metabolism with Ex Vivo NMR Spectroscopy-Based Metabolite Quantification头部损伤的转化代谢组学：基于体外核磁共振波谱的代谢物定量分析探索脑代谢功能障碍

Importance of Engineered and Learned Molecular Representations in Predicting Organic Reactivity, Selectivity, and Chemical Properties.在预测有机反应性、选择性和化学性质方面，工程化和学习的分子表示的重要性。

Acc Chem Res. 2021 Feb 16;54(4):827-836. doi: 10.1021/acs.accounts.0c00745. Epub 2021 Feb 3.

Ultrahigh-Throughput Experimentation for Information-Rich Chemical Synthesis.高通量实验在信息丰富的化学合成中的应用。

Acc Chem Res. 2021 May 18;54(10):2337-2346. doi: 10.1021/acs.accounts.1c00119. Epub 2021 Apr 23.

Feedback in Flow for Accelerated Reaction Development.流场中的反馈促进反应开发。

Acc Chem Res. 2016 Sep 20;49(9):1786-96. doi: 10.1021/acs.accounts.6b00261. Epub 2016 Aug 15.

Assessment and statistical modeling of the relationship between remotely sensed aerosol optical depth and PM2.5 in the eastern United States.美国东部地区遥感气溶胶光学厚度与PM2.5之间关系的评估及统计建模

Res Rep Health Eff Inst. 2012 May(167):5-83; discussion 85-91.

Applications of Iridium-Catalyzed Asymmetric Allylic Substitution Reactions in Target-Oriented Synthesis.铱催化的不对称烯丙基取代反应在靶向合成中的应用。

Acc Chem Res. 2017 Oct 17;50(10):2539-2555. doi: 10.1021/acs.accounts.7b00300. Epub 2017 Sep 22.

Incorporating Synthetic Accessibility in Drug Design: Predicting Reaction Yields of Suzuki Cross-Couplings by Leveraging AbbVie's 15-Year Parallel Library Data Set.在药物设计中纳入合成可及性：利用 AbbVie 长达 15 年的平行文库数据集预测铃木交叉偶联反应产率。

J Am Chem Soc. 2024 Jun 5;146(22):15070-15084. doi: 10.1021/jacs.4c00098. Epub 2024 May 20.

Machine Learning Yield Prediction from NiCOlit, a Small-Size Literature Data Set of Nickel Catalyzed C-O Couplings.机器学习从 NiCOlit 中预测产率，NiCOlit 是一个镍催化 C-O 偶联的小规模文献数据集。

J Am Chem Soc. 2022 Aug 17;144(32):14722-14730. doi: 10.1021/jacs.2c05302. Epub 2022 Aug 8.

引用本文的文献

Discovery of Ni Complexes for CO Insertion Enabled by a Machine Learning-Computational-Selection Sequence.通过机器学习-计算-筛选序列实现的用于CO插入的镍配合物的发现

J Am Chem Soc. 2025 Jul 30;147(30):26149-26157. doi: 10.1021/jacs.5c00441. Epub 2025 Jul 18.

Discriminating models of trait evolution.性状进化的判别模型。

bioRxiv. 2025 Jun 13:2025.06.12.659377. doi: 10.1101/2025.06.12.659377.

AI Approaches to Homogeneous Catalysis with Transition Metal Complexes.过渡金属配合物均相催化的人工智能方法

ACS Catal. 2025 May 14;15(11):9089-9105. doi: 10.1021/acscatal.5c01202. eCollection 2025 Jun 6.

Intermediate knowledge enhanced the performance of the amide coupling yield prediction model.中级知识提升了酰胺偶联产率预测模型的性能。

Chem Sci. 2025 Jun 5. doi: 10.1039/d5sc03364k.

Local reaction condition optimization via machine learning.通过机器学习优化局部反应条件

J Mol Model. 2025 Apr 23;31(5):143. doi: 10.1007/s00894-025-06365-0.

Transfer Learning-Enabled Ligand Prediction for Ni-Catalyzed Atroposelective Suzuki-Miyaura Cross-Coupling Based on Mechanistic Similarity: Leveraging Pd Knowledge for Ni Discovery.基于机理相似性的迁移学习实现镍催化的对映选择性铃木-宫浦交叉偶联配体预测：利用钯的知识发现镍

J Am Chem Soc. 2025 May 7;147(18):15318-15328. doi: 10.1021/jacs.5c00838. Epub 2025 Mar 28.

Parametrization of κ-,-Oxazoline Preligands for Enantioselective Cobaltaelectro-Catalyzed C-H Activations.用于对映选择性钴电催化C-H活化的κ-恶唑啉前配体的参数化

ACS Catal. 2025 Feb 28;15(6):4450-4459. doi: 10.1021/acscatal.5c00250. eCollection 2025 Mar 21.

Designing Target-specific Data Sets for Regioselectivity Predictions on Complex Substrates.设计用于复杂底物区域选择性预测的靶向特定数据集。

J Am Chem Soc. 2025 Mar 5;147(9):7476-7484. doi: 10.1021/jacs.4c15902. Epub 2025 Feb 21.

An active representation learning method for reaction yield prediction with small-scale data.一种用于小规模数据反应产率预测的主动表示学习方法。

Commun Chem. 2025 Feb 10;8(1):42. doi: 10.1038/s42004-025-01434-0.

A Holistic Data-Driven Approach to Synthesis Predictions of Colloidal Nanocrystal Shapes.一种基于数据驱动的整体方法用于胶体纳米晶体形状的合成预测

J Am Chem Soc. 2025 Feb 19;147(7):6116-6125. doi: 10.1021/jacs.4c17283. Epub 2025 Feb 7.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

通过有监督学习预测反应产率。

Predicting Reaction Yields via Supervised Learning.

机构信息

出版信息

相似文献

引用本文的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献