Yao Lin, Yang Minjian, Song Jianfei, Yang Zhuo, Sun Hanyu, Shi Hui, Liu Xue, Ji Xiangyang, Deng Yafeng, Wang Xiaojian
CarbonSilicon AI Technology Co., Ltd., Beijing 100080, China.
State Key Laboratory of Bioactive Substances and Functions of Natural Medicines, Institute of Materia Medica, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing 100050, China.
Anal Chem. 2023 Mar 28;95(12):5393-5401. doi: 10.1021/acs.analchem.2c05817. Epub 2023 Mar 16.
Structure elucidation of unknown compounds based on nuclear magnetic resonance (NMR) remains a challenging problem in both synthetic organic and natural product chemistry. Library matching has been an efficient method to assist structure elucidation. However, it is limited by the coverage of libraries. In addition, prior knowledge such as molecular fragments is neglected. To solve the problem, we propose a conditional molecular generation net (CMGNet) to allow input of multiple sources of information. CMGNet not only uses C NMR spectrum data as input but molecular formulas and fragments of molecules are also employed as input conditions. Our model applies large-scale pretraining for molecular understanding and fine-tuning on two NMR spectral data sets of different granularity levels to accommodate structure elucidation tasks. CMGNet generates structures based on C NMR data, molecular formula, and fragment information, with a recovery rate of 94.17% in the top 10 recommendations. In addition, the generative model performed well in the generation of various classes of compounds and in the structural revision task. CMGNet has a deep understanding of molecular connectivities from C NMR, molecular formula, and fragments, paving the way for a new paradigm of deep learning-assisted inverse problem-solving.
基于核磁共振(NMR)对未知化合物进行结构解析,在合成有机化学和天然产物化学领域仍然是一个具有挑战性的问题。库匹配一直是辅助结构解析的有效方法。然而,它受到库覆盖范围的限制。此外,诸如分子片段等先验知识被忽视。为了解决这个问题,我们提出了一种条件分子生成网络(CMGNet),以允许输入多种信息源。CMGNet不仅将碳核磁共振谱数据作为输入,还将分子式和分子片段用作输入条件。我们的模型对分子理解进行大规模预训练,并在两个不同粒度级别的核磁共振光谱数据集上进行微调,以适应结构解析任务。CMGNet基于碳核磁共振数据、分子式和片段信息生成结构,在前10个推荐结果中的回收率为94.17%。此外,生成模型在各类化合物的生成以及结构修正任务中表现良好。CMGNet从碳核磁共振、分子式和片段中对分子连接性有深入理解,为深度学习辅助解决逆问题的新范式铺平了道路。