Committee on Computational and Applied Mathematics, Department of Statistics, University of Chicago, 5747 South Ellis Avenue, Chicago, Illinois60637, United States.
Department of Computer Science, University of Chicago, 5730 South Ellis Avenue, Chicago, Illinois60637, United States.
Anal Chem. 2023 Feb 7;95(5):2653-2663. doi: 10.1021/acs.analchem.2c02093. Epub 2023 Jan 25.
Mass spectrometry is a vital tool in the analytical chemist's toolkit, commonly used to identify the presence of known compounds and elucidate unknown chemical structures. All of these applications rely on having previously measured spectra for known substances. Computational methods for predicting mass spectra from chemical structures can be used to augment existing spectral databases with predicted spectra from previously unmeasured molecules. In this paper, we present a method for prediction of electron ionization-mass spectra (EI-MS) of small molecules that combines physically plausible substructure enumeration and deep learning, which we term rapid approximate subset-based spectra prediction (RASSP). The first of our two models, , produces a probability distribution over chemical subformulae to achieve a state-of-the-art forward prediction accuracy of 92.9% weighted (Stein) dot product and database lookup recall (within top 10 ranked spectra) of 98.0% when evaluated against the NIST 2017 Mass Spectral Library. The second model, , produces a probability distribution over vertex subsets of the original molecule graph to achieve similar forward prediction accuracy and superior generalization in the high-resolution, low-data regime. Spectra predicted by our best model improve upon the previous state-of-the-art spectral database lookup error rate by a factor of 2.9×, reducing the lookup error (top 10) from 5.7 to 2.0%. Both models can train on and predict spectral data at arbitrary resolution. Source code and predicted EI-MS spectra for 73.2M small molecules from PubChem will be made freely accessible online.
质谱分析是分析化学家工具包中的重要工具,常用于鉴定已知化合物的存在并阐明未知化学结构。所有这些应用都依赖于先前已经测量过的已知物质的光谱。基于化学结构预测质谱的计算方法可用于用先前未测量分子的预测光谱来扩充现有的光谱数据库。在本文中,我们提出了一种用于预测小分子电子电离-质谱(EI-MS)的方法,该方法结合了物理上合理的子结构枚举和深度学习,我们称之为快速近似基于子集的光谱预测(RASSP)。我们的两个模型中的第一个模型 ,产生了一个化学子式的概率分布,以实现 92.9%加权(Stein)点积和数据库查找召回率(在前 10 个排名的光谱中)的最先进的正向预测精度,当与 NIST 2017 质谱库进行评估时。第二个模型 ,生成了原始分子图的顶点子集的概率分布,以实现类似的正向预测精度和在高分辨率、低数据环境下的卓越泛化能力。我们最佳模型预测的光谱比以前的最先进的光谱数据库查找错误率提高了 2.9 倍,将查找错误(前 10 名)从 5.7 降低到 2.0%。这两个模型都可以在任意分辨率下进行光谱数据的训练和预测。PubChem 上 7320 万个小分子的 EI-MS 预测光谱和源代码将在网上免费提供。