Department of Chemical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.
Department of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States.
J Chem Inf Model. 2023 Dec 25;63(24):7642-7654. doi: 10.1021/acs.jcim.3c01226. Epub 2023 Dec 4.
Machine learning (ML) methods have shown promise for discovering novel catalysts but are often restricted to specific chemical domains. Generalizable ML models require large and diverse training data sets, which exist for heterogeneous catalysis but not for homogeneous catalysis. The tmQM data set, which contains properties of 86,665 transition metal complexes calculated at the TPSSh/def2-SVP level of density functional theory (DFT), provided a promising training data set for homogeneous catalyst systems. However, we find that ML models trained on tmQM consistently underpredict the energies of a chemically distinct subset of the data. To address this, we present the tmQM_wB97MV data set, which filters out several structures in tmQM found to be missing hydrogens and recomputes the energies of all other structures at the ωB97M-V/def2-SVPD level of DFT. ML models trained on tmQM_wB97MV show no pattern of consistently incorrect predictions and much lower errors than those trained on tmQM. The ML models tested on tmQM_wB97MV were, from best to worst, GemNet-T > PaiNN ≈ SpinConv > SchNet. Performance consistently improves when using only neutral structures instead of the entire data set. However, while models saturate with only neutral structures, more data continue to improve the models when including charged species, indicating the importance of accurately capturing a range of oxidation states in future data generation and model development. Furthermore, a fine-tuning approach in which weights were initialized from models trained on OC20 led to drastic improvements in model performance, indicating transferability between ML strategies of heterogeneous and homogeneous systems.
机器学习 (ML) 方法在发现新型催化剂方面显示出了潜力,但通常仅限于特定的化学领域。可推广的 ML 模型需要大型和多样化的训练数据集,这些数据集存在于多相催化中,但不存在于均相催化中。包含 86665 个过渡金属配合物在 TPSSh/def2-SVP 密度泛函理论 (DFT) 水平下计算的性质的 tmQM 数据集为均相催化剂系统提供了一个很有前途的训练数据集。然而,我们发现,在 tmQM 上训练的 ML 模型始终低估了数据中一个化学上不同子集的能量。为了解决这个问题,我们提出了 tmQM_wB97MV 数据集,它过滤掉了 tmQM 中几个被发现缺少氢的结构,并在 ωB97M-V/def2-SVPD 水平的 DFT 上重新计算了所有其他结构的能量。在 tmQM_wB97MV 上训练的 ML 模型没有一致错误预测的模式,而且比在 tmQM 上训练的模型误差要小得多。在 tmQM_wB97MV 上测试的 ML 模型从最好到最差依次是 GemNet-T > PaiNN ≈ SpinConv > SchNet。仅使用中性结构而不是整个数据集时,性能始终会提高。然而,虽然模型仅在使用中性结构时就会饱和,但当包括带电物种时,更多的数据会继续提高模型的性能,这表明在未来的数据生成和模型开发中准确捕捉一系列氧化态的重要性。此外,一种微调方法,其中权重是从在 OC20 上训练的模型初始化的,导致模型性能的急剧提高,这表明了在多相和均相系统的 ML 策略之间存在可转移性。