Zhang Hailiang, Yang Qiong, Xie Ting, Wang Yue, Zhang Zhimin, Lu Hongmei
College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, China.
Anal Chem. 2024 Oct 22;96(42):16599-16608. doi: 10.1021/acs.analchem.4c02426. Epub 2024 Oct 14.
Tandem mass spectrometry (MS/MS) is a powerful technique for chemical analysis in many areas of science. The vast MS/MS spectral data generated in liquid chromatography-mass spectrometry (LC-MS) experiments require efficient analysis and interpretation methods for the following compound identification. In this study, we propose MSBERT based on self-supervised learning strategies to embed MS/MS spectra into reasonable embeddings for efficient compound identification. It adopts the transformer encoder as the backbone for mask learning and uses the same spectra with different masks for contrastive learning. MSBERT is trained on the GNPS data set and tested on the GNPS data set, the MoNA data set, and the MTBLS1572 data set. It exhibits enhanced library matching and analogous compound searching capabilities compared to existing methods. The recalls at 1, 5, and 10 on a GNPS test subset with structures not in the training set are 0.7871, 0.8950, and 0.9080, respectively. The results are better than those of Spec2Vec with 0.6898, 0.8276, and 0.8620, and DreaMS with 0.7158, 0.8327, and 0.8635. The rationality of embeddings is demonstrated by t-SNE visualization, structural similarity, spectra clustering, compound identification, and analogous compound searching. A user-friendly web server is provided for efficient spectral analysis, and the source code for MSBERT is available at https://github.com/zhanghailiangcsu/MSBERT.
串联质谱(MS/MS)是科学诸多领域中一种强大的化学分析技术。液相色谱 - 质谱(LC-MS)实验中产生的海量MS/MS光谱数据需要高效的分析和解释方法用于后续化合物鉴定。在本研究中,我们提出基于自监督学习策略的MSBERT,将MS/MS光谱嵌入到合理的嵌入向量中以实现高效的化合物鉴定。它采用Transformer编码器作为掩码学习的主干,并使用带有不同掩码的相同光谱进行对比学习。MSBERT在GNPS数据集上进行训练,并在GNPS数据集、MoNA数据集和MTBLS1572数据集上进行测试。与现有方法相比,它展现出增强的库匹配和类似化合物搜索能力。在GNPS测试子集中结构不在训练集中的情况下,召回率在1、5和10时分别为0.7871、0.8950和0.9080。结果优于Spec2Vec(分别为0.6898、0.8276和0.8620)以及DreaMS(分别为0.7158、0.8327和0.8635)。通过t-SNE可视化、结构相似性、光谱聚类、化合物鉴定和类似化合物搜索证明了嵌入向量的合理性。提供了一个用户友好的网络服务器用于高效的光谱分析,MSBERT的源代码可在https://github.com/zhanghailiangcsu/MSBERT获取。