Żurański Andrzej M, Gandhi Shivaani S, Doyle Abigail G
Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States.
Department of Chemistry and Biochemistry, University of California, Los Angeles, Los Angeles, California 90095, United States.
J Am Chem Soc. 2023 Apr 12;145(14):7898-7909. doi: 10.1021/jacs.2c13093. Epub 2023 Mar 29.
The application of machine learning (ML) techniques to model high-throughput experimentation (HTE) datasets has seen a recent rise in popularity. Nevertheless, the ability to model the interplay between reaction components, known as interaction effects, with ML remains an outstanding challenge. Using a simulated HTE dataset, we find that the presence of irrelevant features poses a strong obstacle to learning interaction effects with common ML algorithms. To address this problem, we propose a two-part statistical modeling approach for HTE datasets: classical analysis of variance of the experiment to identify systematic effects that impact reaction yield across the experiment followed by regression of individual effects using chemistry-informed features. To illustrate this methodology, we use our previously published alcohol deoxyfluorination dataset comprising 740 reactions to build a compact, interpretable generalized additive model that accounts for each significant effect observed in the dataset. We achieve a sizeable performance boost compared to our previously published random forest model, reducing mean absolute error from 18 to 13% and root-mean-squared error from 22 to 17% on a newly generated validation set. Finally, we demonstrate that this approach can facilitate the generation of new mechanistic hypotheses, which, when probed experimentally, can lead to a deeper understanding of chemical reactivity.
机器学习(ML)技术在高通量实验(HTE)数据集建模中的应用近来越来越受欢迎。然而,利用ML对反应组分之间的相互作用(即交互效应)进行建模的能力仍然是一个突出的挑战。通过使用一个模拟的HTE数据集,我们发现无关特征的存在对使用常见ML算法学习交互效应构成了强大障碍。为了解决这个问题,我们针对HTE数据集提出了一种两部分的统计建模方法:对实验进行经典方差分析,以识别影响整个实验反应产率的系统效应,然后使用化学信息特征对个体效应进行回归分析。为了说明这种方法,我们使用我们之前发表的包含740个反应的醇脱氧氟化数据集,构建了一个紧凑、可解释的广义相加模型,该模型考虑了数据集中观察到的每个显著效应。与我们之前发表的随机森林模型相比,我们实现了相当大的性能提升,在新生成的验证集上,平均绝对误差从18%降至13%,均方根误差从22%降至17%。最后,我们证明这种方法可以促进新的机理假设的产生,通过实验探究这些假设可以加深对化学反应性的理解。