Roger Adams Laboratory, Department of Chemistry, University of Illinois, Urbana, Illinois 61801, United States.
Acc Chem Res. 2021 May 4;54(9):2041-2054. doi: 10.1021/acs.accounts.0c00826. Epub 2021 Apr 15.
Catalyst design in enantioselective catalysis has historically been driven by empiricism. In this endeavor, experimentalists attempt to qualitatively identify trends in structure that lead to a desired catalyst function. In this body of work, we lay the groundwork for an improved, alternative workflow that uses quantitative methods to inform decision making at every step of the process. At the outset, we define a library of synthetically accessible permutations of a catalyst scaffold with the philosophy that the library contains every potential catalyst we are willing to make. To represent these chiral molecules, we have developed general 3D representations, which can be calculated for tens of thousands of structures. This defines the total chemical space of a given catalyst scaffold; it is constructed on the basis of catalyst structure only without regard to a specific reaction or mechanism. As such, any algorithmic subset selection method, which is unsupervised (i.e., only considers catalyst structure), should provide an ideal initial screening set for any new reaction that can be catalyzed by that scaffold. Notably, because this design strategy, the same set of catalysts can be used for any reaction that can be catalyzed with that parent catalyst scaffold. These are tested experimentally, and statistical learning tools can be used to create a model relating catalyst structure to catalyst function. Further, this model can be used to predict the performance of each catalyst candidate in the greater database of virtual catalyst candidates. In this way, it is possible estimate the performance of tens of thousands of catalysts by experimentally testing a smaller subset. Using error assessment metrics, it is possible to understand the confidence in new predictions. An experimentalist using this tool can balance the predicted results (reward) with the prediction confidence (risk) when deciding which catalysts to synthesize next in an optimization campaign. These catalysts are synthesized and tested experimentally. At this stage, either the optimization is a success or the predicted values were incorrect and further optimization is required. In the case of the latter, the information can be fed back into the statistical learning model to refine the model, and this iterative process can be used to determine the optimal catalyst. In this body of work, we not only establish this workflow but quantitatively establish how best to execute each step. Herein, we evaluate several 3D molecular representations to determine how best to represent molecules. Several selection protocols are examined to best decide which set of molecules can be used to represent the library of interest. In addition, the number of reactions needed to make accurate, statistical learning models is evaluated. Taken together these components establish a tool ready to progress from the development stage to the utility stage. As such, current research endeavors focus on applying these tools to optimize new reactions.
在对映选择性催化中,催化剂的设计一直受到经验主义的驱动。在这项工作中,实验人员试图定性地识别导致所需催化剂功能的结构趋势。在本研究中,我们为改进的替代工作流程奠定了基础,该流程使用定量方法在过程的每一步做出决策。首先,我们定义了一个具有合成可及性的催化剂支架的库,其理念是该库包含我们愿意制造的每一种潜在催化剂。为了表示这些手性分子,我们开发了通用的 3D 表示,这些表示可以为成千上万的结构进行计算。这定义了给定催化剂支架的总化学空间;它是基于催化剂结构构建的,而不考虑特定的反应或机制。因此,任何无监督的算法子集选择方法(即仅考虑催化剂结构)都应该为可以用该支架催化的任何新反应提供理想的初始筛选集。值得注意的是,由于这种设计策略,相同的催化剂集可以用于任何可以用该母体催化剂支架催化的反应。这些在实验中进行了测试,并使用统计学习工具创建了一个将催化剂结构与催化剂功能相关联的模型。此外,该模型可用于预测更大的虚拟催化剂候选数据库中每个候选催化剂的性能。通过这种方式,可以通过实验测试较小的子集来估计成千上万种催化剂的性能。使用误差评估指标,可以了解新预测的置信度。使用此工具的实验人员可以在决定下一个要合成的催化剂时,根据预测结果(奖励)和预测置信度(风险)来平衡催化剂候选物。这些催化剂被合成并进行了实验测试。在这个阶段,要么优化成功,要么预测值不正确,需要进一步优化。在后一种情况下,可以将信息反馈到统计学习模型中以改进模型,并且可以使用该迭代过程来确定最佳催化剂。在本研究中,我们不仅建立了这种工作流程,而且还定量地确定了如何最好地执行每个步骤。在这里,我们评估了几种 3D 分子表示来确定如何最好地表示分子。检查了几种选择协议,以确定最佳选择可以用于表示感兴趣的库的分子集。此外,还评估了获得准确的统计学习模型所需的反应数量。这些组件共同建立了一个准备从开发阶段进入实用阶段的工具。因此,当前的研究工作重点是将这些工具应用于优化新反应。