Department of Chemistry, Indian Institute of Technology Bombay, Mumbai 400076, India.
Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, Mumbai 400076, India.
Acc Chem Res. 2023 Feb 7;56(3):402-412. doi: 10.1021/acs.accounts.2c00801. Epub 2023 Jan 30.
ConspectusIn the domain of reaction development, one aims to obtain higher efficacies as measured in terms of yield and/or selectivities. During the empirical cycles, an admixture of outcomes from low to high yields/selectivities is expected. While it is not easy to identify all of the factors that might impact the reaction efficiency, complex and nonlinear dependence on the nature of reactants, catalysts, solvents, etc. is quite likely. Developmental stages of newer reactions would typically offer a few hundreds of samples with variations in participating molecules and/or reaction conditions. These "observations" and their "output" can be harnessed as valuable labeled data for developing molecular machine learning (ML) models. Once a robust ML model is built for a specific reaction under development, it can predict the reaction outcome for any new choice of substrates/catalyst in a few seconds/minutes and thus can expedite the identification of promising candidates for experimental validation. Recent years have witnessed impressive applications of ML in the molecular world, most of them aimed at predicting important chemical or biological properties. We believe that an integration of effective ML workflows can be made richly beneficial to reaction discovery.As with any new technology, direct adaptation of ML as used in well-developed domains, such as natural language processing (NLP) and image recognition, is unlikely to succeed in reaction discovery. Some of the challenges stem from ineffective featurization of the molecular space, unavailability of quality data and its distribution, in making the right choice of ML model and its technically robust deployment. It shall be noted that there is no universal ML model suitable for an inherently high-dimensional problem such as chemical reactions. Given these backgrounds, rendering ML tools conducive for reactions is an exciting as well as challenging endeavor at the same time. With the increased availability of efficient ML algorithms, we focused on tapping their potential for small-data reaction discovery (a few hundreds to thousands of samples).In this Account, we describe both feature engineering and feature learning approaches for molecular ML as applied to diverse reactions of high contemporary interest. Among these, catalytic asymmetric hydrogenation of imines/alkenes, β-C(sp)-H bond functionalization, and relay Heck reaction employed a feature engineering approach using the quantum-chemically derived physical organic descriptors as the molecular features─all designed to predict the enantioselectivity. The selection of molecular features to customize it for a reaction of interest is described, along with emphasizing the chemical insights that could be gathered through the use of such features. Feature learning methods for predicting the yield of Buchwald-Hartwig cross-coupling, deoxyfluorination of alcohols, and enantioselectivity of ,-acetal formation are found to offer excellent predictions. We propose a transfer learning protocol, wherein an ML model such as a language model is trained on a large number of molecules (10-10) and fine-tuned on a focused library of target task reactions, as an effective alternative for small-data reaction discovery (10-10 reactions). The exploitation of deep neural network latent space as a method for generative tasks to identify useful substrates for a reaction is demonstrated as a promising strategy.
概述在反应开发领域,人们的目标是获得更高的效率,这可以通过产率和/或选择性来衡量。在经验循环中,预计会出现从低产率/选择性到高产率/选择性的混合结果。虽然不容易确定所有可能影响反应效率的因素,但反应物、催化剂、溶剂等的性质复杂且非线性的依赖性是很有可能的。较新反应的开发阶段通常会提供数百个具有参与分子和/或反应条件变化的样本。这些“观察结果”及其“输出”可以作为有价值的标记数据,用于开发分子机器学习 (ML) 模型。一旦为特定的反应开发了一个强大的 ML 模型,它就可以在几秒钟/分钟内预测任何新选择的底物/催化剂的反应结果,从而可以加快识别有前途的实验验证候选物的速度。近年来,机器学习在分子领域的应用令人印象深刻,其中大多数旨在预测重要的化学或生物学性质。我们相信,有效的 ML 工作流程的集成可以为反应发现带来丰富的益处。与任何新技术一样,直接采用在自然语言处理 (NLP) 和图像识别等发达领域中使用的 ML 不太可能在反应发现中取得成功。一些挑战源于分子空间的有效特征化、质量数据及其分布的不可用性,以及在选择合适的 ML 模型及其技术上稳健的部署方面。应当指出,没有适合化学等固有高维问题的通用 ML 模型反应。基于这些背景,使 ML 工具适用于反应是一项令人兴奋但同时也具有挑战性的工作。随着高效 ML 算法的可用性增加,我们专注于挖掘它们在小数据反应发现(几百到几千个样本)中的潜力。在本账户中,我们描述了应用于具有当代重要意义的各种反应的分子 ML 的特征工程和特征学习方法。其中,手性催化不对称氢化亚胺/烯烃、β-C(sp)-H 键官能化和接力 Heck 反应采用了一种特征工程方法,使用量子化学衍生的物理有机描述符作为分子特征——所有这些都是为了预测对映选择性。描述了为定制感兴趣的反应选择分子特征的过程,并强调了通过使用这些特征可以收集到的化学见解。用于预测 Buchwald-Hartwig 交叉偶联、醇的脱氧氟化和,-缩醛形成的对映选择性的特征学习方法被发现可以提供出色的预测。我们提出了一种转移学习协议,其中可以使用语言模型等 ML 模型在大量分子(10-10 个)上进行训练,并在目标任务反应的聚焦库上进行微调,作为小数据反应发现(10-10 个反应)的有效替代方法。演示了利用深度神经网络潜在空间作为生成任务的方法来识别反应中有用的底物,这是一种很有前途的策略。