Sato Akinori, Asahara Ryosuke, Miyao Tomoyuki
Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan.
Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan.
ACS Omega. 2024 Sep 17;9(39):40907-40919. doi: 10.1021/acsomega.4c06113. eCollection 2024 Oct 1.
The chemical reaction yield is an important factor to determine the reaction conditions. Recently, many data-driven models for yield prediction using high-throughput experimentation datasets have been reported. In this study, we propose a neural network architecture based on the chemical graphs of the reaction components to predict the reaction yield. The proposed model is the sequential combination of a message-passing neural network and a transformer encoder (). The reaction components are converted to molecular matrices by the first network, followed by the interplay of the reaction components in the second network after adding the embeddings of the compound roles in the chemical reaction. The predictive ability of the proposed models was compared with state-of-the-art yield prediction models using two high-throughput experimental datasets: the Buchwald-Hartwig cross-coupling (BHC) and Suzuki-Miyaura cross-coupling (SMC) reaction datasets. Overall, the models showed high prediction accuracy for the BHC reaction datasets and some of the extrapolation-oriented SMC reaction datasets. These models also performed well when the training dataset size was relatively large. Furthermore, analyzing the poorly predicted reactions for the BHC reaction dataset revealed a limitation of the data-driven yield prediction approach based on the chemical structural similarity.
化学反应产率是确定反应条件的一个重要因素。最近,已有许多使用高通量实验数据集进行产率预测的数据驱动模型被报道。在本研究中,我们提出了一种基于反应组分化学图的神经网络架构来预测反应产率。所提出的模型是消息传递神经网络和变压器编码器的顺序组合。反应组分首先由第一个网络转换为分子矩阵,在添加化学反应中化合物角色的嵌入后,再由第二个网络对反应组分进行相互作用。使用两个高通量实验数据集:布赫瓦尔德-哈特维希交叉偶联(BHC)和铃木-宫浦交叉偶联(SMC)反应数据集,将所提出模型的预测能力与最先进的产率预测模型进行了比较。总体而言,这些模型对BHC反应数据集和一些面向外推的SMC反应数据集显示出较高的预测准确性。当训练数据集规模相对较大时,这些模型也表现良好。此外,对BHC反应数据集预测不佳的反应进行分析,揭示了基于化学结构相似性的数据驱动产率预测方法的局限性。