Liu Shiyun, Cole Jacqueline M
Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE. U.K.
Science and Technology Facilities Council, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX14 0FA, U.K.
J Chem Inf Model. 2025 Aug 25;65(16):8435-8447. doi: 10.1021/acs.jcim.5c00499. Epub 2025 Aug 13.
Nuclear magnetic resonance (NMR) spectroscopy is an indispensable tool for determining the structural characteristics of a molecule by analyzing its chemical shifts. A wealth of NMR spectra therefore exists and continues to amass on a daily basis, at an ever-increasing rate owing to the progressive automation of chemical analysis. This growth and automation have led to the data analysis step in NMR spectroscopy becoming the main bottleneck in the structural characterization of a new chemical compound. In particular, the data interpretation step is slow and prone to error as it requires manual examination by a suitably trained scientist. Machine learning (ML) methods could overcome this bottleneck, pending that they can automatically correlate the collection of peaks in an NMR spectrum with the substructure of its subject molecule. This study explores the art of the possible using three types of ML methods that are based on neural-network architectures: a multilayer perceptron (MLP) + long short-term memory (LSTM) neural network, a convolutional neural network (CNN), and an MLP + recurrent neural network (RNN). NMR spectrum-structure correlations were encoded into each type of neural network using two forms of molecular representation, one employing functional groups and the other using a novel neighbor-based method. These models were trained on 34,503 and 17,311 experimental C and H NMR spectra, respectively. The influence of incorporating metadata about experimental conditions (NMR field strength, temperature, and solvent) into the neural-network model was also investigated. The models incorporated coupling constants as a proxy for spectral intensities in the case of C NMR spectra. We found that the MLP + LSTM model achieved the highest accuracy (88%) when trained on C NMR spectra and incorporating experimental metadata (compared to 77% without incorporating it). While the CNN model performance was slightly lower (86% accuracy), it determined molecular substructures in one-third of the computational run time compared to the MLP + LSTM model. Thus, the CNN model emerged as the practically best model when considering performance, time, and cost.
核磁共振(NMR)光谱法是通过分析分子的化学位移来确定其结构特征的不可或缺的工具。因此,存在大量的NMR光谱,并且由于化学分析的逐步自动化,每天都在以不断增加的速度积累。这种增长和自动化导致NMR光谱法中的数据分析步骤成为新化合物结构表征的主要瓶颈。特别是,数据解释步骤缓慢且容易出错,因为它需要经过适当培训的科学家进行人工检查。机器学习(ML)方法可以克服这一瓶颈,前提是它们能够自动将NMR光谱中的峰集合与其目标分子的子结构相关联。本研究探索了使用基于神经网络架构的三种类型的ML方法的可能性:多层感知器(MLP)+长短期记忆(LSTM)神经网络、卷积神经网络(CNN)和MLP+循环神经网络(RNN)。使用两种分子表示形式将NMR光谱-结构相关性编码到每种类型的神经网络中,一种使用官能团,另一种使用基于邻域的新方法。这些模型分别在34,503个和17,311个实验性C和H NMR光谱上进行训练。还研究了将有关实验条件(NMR场强、温度和溶剂)的元数据纳入神经网络模型的影响。在C NMR光谱的情况下,模型将耦合常数作为光谱强度的代理。我们发现,当在C NMR光谱上进行训练并纳入实验元数据时,MLP+LSTM模型达到了最高准确率(88%)(相比之下,不纳入时为77%)。虽然CNN模型的性能略低(准确率为86%),但与MLP+LSTM模型相比,它在三分之一的计算运行时间内确定了分子子结构。因此,考虑到性能、时间和成本,CNN模型成为了实际上最好的模型。