Han Jongmin, Kwon Youngchun, Choi Youn-Suk, Kang Seokho
Department of Industrial Engineering, Sungkyunkwan University, 2066 Seobu-ro, Jangan-gu, Suwon, Republic of Korea.
Samsung Advanced Institute of Technology, Samsung Electronics Co. Ltd., 130 Samsung-ro, Yeongtong-gu, Suwon, Republic of Korea.
J Cheminform. 2024 Mar 1;16(1):25. doi: 10.1186/s13321-024-00818-z.
Graph neural networks (GNNs) have proven to be effective in the prediction of chemical reaction yields. However, their performance tends to deteriorate when they are trained using an insufficient training dataset in terms of quantity or diversity. A promising solution to alleviate this issue is to pre-train a GNN on a large-scale molecular database. In this study, we investigate the effectiveness of GNN pre-training in chemical reaction yield prediction. We present a novel GNN pre-training method for performance improvement.Given a molecular database consisting of a large number of molecules, we calculate molecular descriptors for each molecule and reduce the dimensionality of these descriptors by applying principal component analysis. We define a pre-text task by assigning a vector of principal component scores as the pseudo-label to each molecule in the database. A GNN is then pre-trained to perform the pre-text task of predicting the pseudo-label for the input molecule. For chemical reaction yield prediction, a prediction model is initialized using the pre-trained GNN and then fine-tuned with the training dataset containing chemical reactions and their yields. We demonstrate the effectiveness of the proposed method through experimental evaluation on benchmark datasets.
图神经网络(GNNs)已被证明在预测化学反应产率方面是有效的。然而,当使用数量或多样性不足的训练数据集对其进行训练时,它们的性能往往会下降。缓解这一问题的一个有前景的解决方案是在大规模分子数据库上对GNN进行预训练。在本研究中,我们研究了GNN预训练在化学反应产率预测中的有效性。我们提出了一种用于性能提升的新型GNN预训练方法。给定一个由大量分子组成的分子数据库,我们为每个分子计算分子描述符,并通过应用主成分分析来降低这些描述符的维度。我们通过为数据库中的每个分子分配一个主成分得分向量作为伪标签来定义一个预训练任务。然后对一个GNN进行预训练,以执行预测输入分子伪标签的预训练任务。对于化学反应产率预测,使用预训练的GNN初始化一个预测模型,然后使用包含化学反应及其产率的训练数据集对其进行微调。我们通过在基准数据集上的实验评估证明了所提出方法的有效性。