Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States.
Acc Chem Res. 2021 Apr 20;54(8):1856-1865. doi: 10.1021/acs.accounts.0c00770. Epub 2021 Mar 31.
Numerous disciplines, such as image recognition and language translation, have been revolutionized by using machine learning (ML) to leverage big data. In organic synthesis, providing accurate chemical reactivity predictions with supervised ML could assist chemists with reaction prediction, optimization, and mechanistic interrogation.To apply supervised ML to chemical reactions, one needs to define the object of prediction (e.g., yield, enantioselectivity, solubility, or a recommendation) and represent reactions with descriptive data. Our group's effort has focused on representing chemical reactions using DFT-derived physical features of the reacting molecules and conditions, which serve as features for building supervised ML models.In this Account, we present a review and perspective on three studies conducted by our group where ML models have been employed to predict reaction yield. First, we focus on a small reaction data set where 16 phosphine ligands were evaluated in a single Ni-catalyzed Suzuki-Miyaura cross-coupling reaction, and the reaction yield was modeled with linear regression. In this setting, where the regression complexity is strongly limited by the amount of available data, we emphasize the importance of identifying single features that are directly relevant to reactivity. Next, we focus on models trained on two larger data sets obtained with high-throughput experimentation (HTE). With hundreds to thousands of reactions available, more complex models can be explored, for example, models that algorithmically perform feature selection from a broad set of candidate features. We examine how a variety of ML algorithms model these data sets and how well these models generalize to out-of-sample substrates. Specifically, we compare the ML models that use DFT-based featurization to a baseline model that is obtained with features that carry no physical information, that is, random features, and to a naive non-ML model that averages yields of reactions that share the same conditions and substrate combinations. We find that for only one of the two data sets, DFT-based featurization leads to a significant, although moderate, out-of-sample prediction improvement. The source of this improvement was further isolated to specific features which allowed us to formulate a testable mechanistic hypothesis that was validated experimentally. Finally, we offer remarks on supervised ML model building on HTE data sets focusing on algorithmic improvements in model training.Statistical methods in chemistry have a rich history, but only recently has ML gained widespread attention in reaction development. As the untapped potential of ML is explored, novel tools are likely to arise from future research. Our studies suggest that supervised ML can lead to improved predictions of reaction yield over simpler modeling methods and facilitate mechanistic understanding of reaction dynamics. However, further research and development is required to establish ML as an indispensable tool in reactivity modeling.
许多学科,如图像识别和语言翻译,都已经通过使用机器学习(ML)来利用大数据而发生了变革。在有机合成中,使用监督机器学习提供准确的化学反应性预测,可以帮助化学家进行反应预测、优化和机理研究。要将监督机器学习应用于化学反应,需要定义预测的对象(例如,产率、对映选择性、溶解度或推荐值)并使用描述性数据表示反应。我们小组的工作重点是使用 DFT 衍生的反应分子和条件的物理特征来表示化学反应,这些特征用作构建监督机器学习模型的特征。在本报告中,我们介绍了我们小组进行的三项研究的综述和观点,这些研究使用了 ML 模型来预测反应产率。首先,我们专注于一个小的反应数据集,其中在单个 Ni 催化的 Suzuki-Miyaura 交叉偶联反应中评估了 16 种膦配体,并用线性回归对反应产率进行建模。在这种情况下,回归的复杂性受到可用数据量的强烈限制,我们强调了识别与反应性直接相关的单个特征的重要性。接下来,我们专注于使用高通量实验(HTE)获得的两个更大数据集训练的模型。有数百到数千个反应可用,可以探索更复杂的模型,例如,从广泛的候选特征中进行算法特征选择的模型。我们检查了各种 ML 算法如何对这些数据集进行建模,以及这些模型对样本外底物的概括程度。具体来说,我们比较了使用基于 DFT 的特征化的 ML 模型与使用不带物理信息的特征(即随机特征)的基线模型,以及与平均具有相同条件和底物组合的反应产率的简单非 ML 模型。我们发现,只有在两个数据集之一中,基于 DFT 的特征化导致了显著的(尽管是中等的)样本外预测改进。这种改进的来源进一步隔离到特定的特征,这使我们能够提出一个可测试的机理假设,并通过实验验证。最后,我们就关注模型训练的算法改进的 HTE 数据集上的监督 ML 模型构建提出了一些看法。化学中的统计方法有着悠久的历史,但直到最近,机器学习才在反应开发中得到广泛关注。随着对 ML 未开发潜力的探索,未来的研究可能会出现新的工具。我们的研究表明,监督机器学习可以提高反应产率的预测精度,优于更简单的建模方法,并有助于对反应动力学的机制理解。然而,需要进一步的研究和开发,才能使 ML 成为反应性建模不可或缺的工具。