Friederich Pascal, Dos Passos Gomes Gabriel, De Bin Riccardo, Aspuru-Guzik Alán, Balcells David
Chemical Physics Theory Group , Department of Chemistry , University of Toronto , Toronto , Ontario M5S 3H6 , Canada.
Institute of Nanotechnology , Karlsruhe Institute of Technology , Hermann-von-Helmholtz-Platz 1 , 76344 Eggenstein-Leopoldshafen , Germany.
Chem Sci. 2020 Apr 7;11(18):4584-4601. doi: 10.1039/d0sc00445f. eCollection 2020 May 14.
Homogeneous catalysis using transition metal complexes is ubiquitously used for organic synthesis, as well as technologically relevant in applications such as water splitting and CO reduction. The key steps underlying homogeneous catalysis require a specific combination of electronic and steric effects from the ligands bound to the metal center. Finding the optimal combination of ligands is a challenging task due to the exceedingly large number of possibilities and the non-trivial ligand-ligand interactions. The classic example of Vaska's complex, -[Ir(PPh)(CO)(Cl)], illustrates this scenario. The ligands of this species activate iridium for the oxidative addition of hydrogen, yielding the dihydride -[Ir(H)(PPh)(CO)(Cl)] complex. Despite the simplicity of this system, thousands of derivatives can be formulated for the activation of H, with a limited number of ligands belonging to the same general categories found in the original complex. In this work, we show how DFT and machine learning (ML) methods can be combined to enable the prediction of reactivity within large chemical spaces containing thousands of complexes. In a space of 2574 species derived from Vaska's complex, data from DFT calculations are used to train and test ML models that predict the H-activation barrier. In contrast to experiments and calculations requiring several days to be completed, the ML models were trained and used on a laptop on a time-scale of minutes. As a first approach, we combined Bayesian-optimized artificial neural networks (ANN) with features derived from autocorrelation and deltametric functions. The resulting ANNs achieved high accuracies, with mean absolute errors (MAE) between 1 and 2 kcal mol, depending on the size of the training set. By using a Gaussian process (GP) model trained with a set of selected features, including fingerprints, accuracy was further enhanced. Remarkably, this GP model minimized the MAE below 1 kcal mol, by using only 20% or less of the data available for training. The gradient boosting (GB) method was also used to assess the relevance of the features, which was used for both feature selection and model interpretation purposes. Features accounting for chemical composition, atom size and electronegativity were found to be the most determinant in the predictions. Further, the ligand fragments with the strongest influence on the H-activation barrier were identified.
使用过渡金属配合物的均相催化在有机合成中广泛应用,在诸如水分解和一氧化碳还原等技术相关应用中也很重要。均相催化的关键步骤需要与金属中心结合的配体具有特定的电子效应和空间效应组合。由于可能性数量极多且配体 - 配体相互作用复杂,找到配体的最佳组合是一项具有挑战性的任务。瓦卡配合物(- [Ir(PPh)(CO)(Cl)])就是一个典型例子。该物种的配体使铱活化以进行氢的氧化加成,生成二氢化物 - [Ir(H)(PPh)(CO)(Cl)] 配合物。尽管这个体系很简单,但可以设计出数千种用于氢活化的衍生物,其中属于原始配合物中相同一般类别的配体数量有限。在这项工作中,我们展示了如何将密度泛函理论(DFT)和机器学习(ML)方法相结合,以预测包含数千种配合物的大化学空间内的反应活性。在源自瓦卡配合物的2574种物质的空间中,DFT计算数据用于训练和测试预测氢活化能垒的ML模型。与需要数天才能完成的实验和计算不同,ML模型在笔记本电脑上以分钟为时间尺度进行训练和使用。作为第一种方法,我们将贝叶斯优化的人工神经网络(ANN)与自相关和增量函数衍生的特征相结合。所得的人工神经网络实现了高精度,平均绝对误差(MAE)在1至2千卡/摩尔之间,具体取决于训练集的大小。通过使用用一组选定特征(包括指纹)训练的高斯过程(GP)模型,准确性进一步提高。值得注意的是,该GP模型仅使用20%或更少的可用训练数据,就将MAE最小化至低于1千卡/摩尔。梯度提升(GB)方法也用于评估特征的相关性,该方法用于特征选择和模型解释目的。发现解释化学成分、原子大小和电负性的特征在预测中最为关键。此外,还确定了对氢活化能垒影响最大的配体片段。