Suppr超能文献

狄尔斯-阿尔德反应结果的数据高效、化学感知机器学习预测

Data-Efficient, Chemistry-Aware Machine Learning Predictions of Diels-Alder Reaction Outcomes.

作者信息

Keto Angus, Guo Taicheng, Underdue Morgan, Stuyver Thijs, Coley Connor W, Zhang Xiangliang, Krenske Elizabeth H, Wiest Olaf

机构信息

School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia.

Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, Indiana 46556, United States.

出版信息

J Am Chem Soc. 2024 Jun 12;146(23):16052-16061. doi: 10.1021/jacs.4c03131. Epub 2024 Jun 1.

Abstract

The application of machine learning models to the prediction of reaction outcomes currently needs large and/or highly featurized data sets. We show that a chemistry-aware model, NERF, which mimics the bonding changes that occur during reactions, allows for highly accurate predictions of the outcomes of Diels-Alder reactions using a relatively small training set, with no pretraining and no additional features. We establish a diverse data set of 9537 intramolecular, hetero-, aromatic, and inverse electron demand Diels-Alder reactions. This data set is used to train a NERF model, and the performance is compared against state-of-the-art classification and generative machine learning models across low- and high-data regimes, with and without pretraining. The predictive accuracy (regio- and site selectivity in the major product) achieved by NERF exceeds 90% when as little as 40% of the data set is used for training. Another high-performing model, Chemformer, requires a larger training data set (>45%) and pretraining to reach 90% Top-1 accuracy. Accurate predictions of less-represented reaction subclasses, such as those involving heteroatomic or aromatic substrates, require higher percentages of training data. We also show how NERF can use small amounts of additional training data to quickly learn new systems and improve its overall understanding of reactivity. Synthetic chemists stand to benefit as this model can be rapidly expanded and tailored to areas of chemistry corresponding to the low-data regime.

摘要

目前,将机器学习模型应用于反应结果预测需要大量和/或高度特征化的数据集。我们表明,一种化学感知模型NERF,它模仿反应过程中发生的键合变化,使用相对较小的训练集就能对狄尔斯-阿尔德反应的结果进行高度准确的预测,无需预训练和额外特征。我们建立了一个包含9537个分子内、杂原子、芳香族和逆电子需求狄尔斯-阿尔德反应的多样化数据集。该数据集用于训练NERF模型,并将其性能与低数据和高数据情况下、有无预训练的最先进分类和生成式机器学习模型进行比较。当仅使用40%的数据集进行训练时,NERF实现的预测准确率(主要产物中的区域和位点选择性)超过90%。另一个高性能模型Chemformer需要更大的训练数据集(>45%)和预训练才能达到90%的Top-1准确率。对于代表性较低的反应子类,如涉及杂原子或芳香族底物的反应子类,准确预测需要更高比例的训练数据。我们还展示了NERF如何使用少量额外的训练数据快速学习新系统并提高其对反应性的整体理解。合成化学家可能会从中受益,因为该模型可以迅速扩展并针对低数据领域的化学领域进行定制。

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验