Zhang Chonghuan, Arun Adarsh, Lapkin Alexei A
Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.
Cambridge Centre for Advanced Research and Education in Singapore, CARES Ltd., 1 CREATE Way, CREATE Tower #05-05, Singapore 138602 Singapore.
ACS Omega. 2024 Apr 10;9(16):18385-18399. doi: 10.1021/acsomega.4c00262. eCollection 2024 Apr 23.
Computer-aided synthesis planning (CASP) development of reaction routes requires an understanding of complete reaction structures. However, most reactions in the current databases are missing reaction coparticipants. Although reaction prediction and atom mapping tools can predict major reaction participants and trace atom rearrangements in reactions, they fail to identify the missing molecules to complete reactions. This is because these approaches are data-driven models trained on the current reaction databases, which comprise incomplete reactions. In this work, a workflow was developed to tackle the reaction completion challenge. This includes a heuristic-based method to identify balanced reactions from reaction databases and complete some imbalanced reactions by adding candidate molecules. A machine learning masked language model (MLM) was trained to learn from simplified molecular input line entry system (SMILES) sentences of these completed reactions. The model predicted missing molecules for the incomplete reactions, a workflow analogous to predicting missing words in sentences. The model is promising for the prediction of small- and middle-sized missing molecules in incomplete reaction records. The workflow combining both the heuristic and machine learning methods completed more than half of the entire reaction space.
计算机辅助合成路线规划(CASP)中反应路线的开发需要对完整的反应结构有所了解。然而,当前数据库中的大多数反应都缺少反应共同参与者。尽管反应预测和原子映射工具可以预测反应中的主要反应参与者并追踪原子重排,但它们无法识别完成反应所需的缺失分子。这是因为这些方法是基于当前反应数据库训练的数据驱动模型,而这些数据库中的反应是不完整的。在这项工作中,开发了一种工作流程来应对反应完成的挑战。这包括一种基于启发式的方法,用于从反应数据库中识别平衡反应,并通过添加候选分子来完成一些不平衡反应。训练了一个机器学习掩码语言模型(MLM),以从这些完成反应的简化分子输入线性输入系统(SMILES)句子中学习。该模型预测不完整反应中缺失的分子,这一工作流程类似于预测句子中缺失的单词。该模型在预测不完整反应记录中的中小型缺失分子方面很有前景。结合启发式和机器学习方法的工作流程完成了超过一半的整个反应空间。