Suppr超能文献

使用DreaMS从数百万个串联质谱中进行分子表征的自监督学习。

Self-supervised learning of molecular representations from millions of tandem mass spectra using DreaMS.

作者信息

Bushuiev Roman, Bushuiev Anton, Samusevich Raman, Brungs Corinna, Sivic Josef, Pluskal Tomáš

机构信息

Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Prague, Czech Republic.

Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University, Prague, Czech Republic.

出版信息

Nat Biotechnol. 2025 May 23. doi: 10.1038/s41587-025-02663-3.

Abstract

Characterizing biological and environmental samples at a molecular level primarily uses tandem mass spectroscopy (MS/MS), yet the interpretation of tandem mass spectra from untargeted metabolomics experiments remains a challenge. Existing computational methods for predictions from mass spectra rely on limited spectral libraries and on hard-coded human expertise. Here we introduce a transformer-based neural network pre-trained in a self-supervised way on millions of unannotated tandem mass spectra from our GNPS Experimental Mass Spectra (GeMS) dataset mined from the MassIVE GNPS repository. We show that pre-training our model to predict masked spectral peaks and chromatographic retention orders leads to the emergence of rich representations of molecular structures, which we named Deep Representations Empowering the Annotation of Mass Spectra (DreaMS). Further fine-tuning the neural network yields state-of-the-art performance across a variety of tasks. We make our new dataset and model available to the community and release the DreaMS Atlas-a molecular network of 201 million MS/MS spectra constructed using DreaMS annotations.

摘要

在分子水平上对生物和环境样本进行表征主要使用串联质谱(MS/MS),然而,对非靶向代谢组学实验的串联质谱进行解读仍然是一项挑战。现有的基于质谱进行预测的计算方法依赖于有限的光谱库和硬编码的人类专业知识。在此,我们介绍一种基于Transformer的神经网络,该网络在自我监督的方式下,对从MassIVE GNPS存储库挖掘的GNPS实验质谱(GeMS)数据集中的数百万个未注释的串联质谱进行了预训练。我们表明,对我们的模型进行预训练以预测掩码光谱峰和色谱保留顺序会导致分子结构的丰富表示形式的出现,我们将其命名为“深度表示赋能质谱注释”(DreaMS)。对神经网络进行进一步微调可在各种任务中产生一流的性能。我们将新数据集和模型提供给社区,并发布了DreaMS图谱——一个使用DreaMS注释构建的包含2.01亿个MS/MS光谱的分子网络。

文献检索

告别复杂PubMed语法,用中文像聊天一样搜索,搜遍4000万医学文献。AI智能推荐,让科研检索更轻松。

立即免费搜索

文件翻译

保留排版,准确专业,支持PDF/Word/PPT等文件格式,支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述,25分钟生成高质量综述,智能提取关键信息,辅助科研写作。

立即免费体验