Suppr超能文献

用于质谱相似性搜索的深度学习嵌入器方法和工具。

Deep learning embedder method and tool for mass spectra similarity search.

作者信息

Qin Chunyuan, Luo Xiyang, Deng Chuan, Shu Kunxian, Zhu Weimin, Griss Johannes, Hermjakob Henning, Bai Mingze, Perez-Riverol Yasset

机构信息

Chongqing Key Laboratory on Big Data for Bio Intelligence, Chongqing University of Posts and telecommunications, Chongqing, China.

State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing 102206, China.

出版信息

J Proteomics. 2021 Feb 10;232:104070. doi: 10.1016/j.jprot.2020.104070. Epub 2020 Dec 8.

Abstract

Spectral similarity calculation is widely used in protein identification tools and mass spectra clustering algorithms while comparing theoretical or experimental spectra. The performance of the spectral similarity calculation plays an important role in these tools and algorithms especially in the analysis of large-scale datasets. Recently, deep learning methods have been proposed to improve the performance of clustering algorithms and protein identification by training the algorithms with existing data and the use of multiple spectra and identified peptide features. While the efficiency of these algorithms is still under study in comparison with traditional approaches, their application in proteomics data analysis is becoming more common. Here, we propose the use of deep learning to improve spectral similarity comparison. We assessed the performance of deep learning for spectral similarity, with GLEAMS and a newly trained embedder model (DLEAMSE), which uses high-quality spectra from PRIDE Cluster. Also, we developed a new bioinformatics tool (mslookup - https://github.com/bigbio/DLEAMSE/) that allows users to quickly search for spectra in previously identified mass spectra publish in public repositories and spectral libraries. Finally, we released a human database to enable bioinformaticians and biologists to search for identified spectra in their machines. SIGNIFICANCE STATEMENT: Spectral similarity calculation plays an important role in proteomics data analysis. With deep learning's ability to learn the implicit and effective features from large-scale training datasets, deep learning-based MS/MS spectra embedding models has emerged as a solution to improve mass spectral clustering similarity calculation algorithms. We compare multiple similarity scoring and deep learning methods in terms of accuracy (compute the similarity for a pair of the mass spectrum) and computing-time performance. The benchmark results showed no major differences in accuracy between DLEAMSE and normalized dot product for spectrum similarity calculations. The DLEAMSE GPU implementation is faster than NDP in preprocessing on the GPU server and the similarity calculation of DLEAMSE (Euclidean distance on 32-D vectors) takes about 1/3 of dot product calculations. The deep learning model (DLEAMSE) encoding and embedding steps needed to run once for each spectrum and the embedded 32-D points can be persisted in the repository for future comparison, which is faster for future comparisons and large-scale data. Based on these, we proposed a new tool mslookup that enables the researcher to find spectra previously identified in public data. The tool can be also used to generate in-house databases of previously identified spectra to share with other laboratories and consortiums.

摘要

在比较理论光谱或实验光谱时,光谱相似性计算在蛋白质鉴定工具和质谱聚类算法中被广泛应用。光谱相似性计算的性能在这些工具和算法中起着重要作用,尤其是在大规模数据集的分析中。最近,通过利用现有数据以及多光谱和已鉴定肽段特征对算法进行训练,人们提出了深度学习方法来提高聚类算法和蛋白质鉴定的性能。虽然与传统方法相比,这些算法的效率仍在研究之中,但它们在蛋白质组学数据分析中的应用正变得越来越普遍。在此,我们提出利用深度学习来改进光谱相似性比较。我们使用GLEAMS和一个新训练的嵌入模型(DLEAMSE)评估了深度学习在光谱相似性方面的性能,DLEAMSE使用来自PRIDE Cluster的高质量光谱。此外,我们开发了一个新的生物信息学工具(mslookup - https://github.com/bigbio/DLEAMSE/),该工具允许用户在公共存储库和光谱库中快速搜索先前鉴定的质谱中的光谱。最后,我们发布了一个人类数据库,以使生物信息学家和生物学家能够在他们的机器中搜索已鉴定的光谱。

重要声明

光谱相似性计算在蛋白质组学数据分析中起着重要作用。基于深度学习能够从大规模训练数据集中学习隐含且有效的特征,基于深度学习的MS/MS光谱嵌入模型已成为一种改进质谱聚类相似性计算算法的解决方案。我们在准确性(计算一对质谱的相似性)和计算时间性能方面比较了多种相似性评分和深度学习方法。基准测试结果表明,在光谱相似性计算方面,DLEAMSE和归一化点积在准确性上没有重大差异。在GPU服务器上进行预处理时,DLEAMSE的GPU实现比NDP更快,并且DLEAMSE的相似性计算(基于32维向量的欧几里得距离)大约需要点积计算时间的1/3。深度学习模型(DLEAMSE)的编码和嵌入步骤对于每个光谱只需运行一次,并且嵌入的32维点可以保存在存储库中以供未来比较,这对于未来的比较和大规模数据来说更快。基于这些,我们提出了一个新工具mslookup,使研究人员能够找到先前在公共数据中鉴定的光谱。该工具还可用于生成先前鉴定光谱的内部数据库,以便与其他实验室和联盟共享。

相似文献

3
A Comprehensive Evaluation of Consensus Spectrum Generation Methods in Proteomics.蛋白质组学中共识谱生成方法的综合评价
J Proteome Res. 2022 Jun 3;21(6):1566-1574. doi: 10.1021/acs.jproteome.2c00069. Epub 2022 May 13.
4

引用本文的文献

2
Proteomic repository data submission, dissemination, and reuse: key messages.蛋白质组学知识库数据提交、发布和再利用:关键信息。
Expert Rev Proteomics. 2022 Jul-Dec;19(7-12):297-310. doi: 10.1080/14789450.2022.2160324. Epub 2022 Dec 26.
7
The language of proteins: NLP, machine learning & protein sequences.蛋白质的语言:自然语言处理、机器学习与蛋白质序列
Comput Struct Biotechnol J. 2021 Mar 25;19:1750-1758. doi: 10.1016/j.csbj.2021.03.022. eCollection 2021.

本文引用的文献

5
7
pDeep: Predicting MS/MS Spectra of Peptides with Deep Learning.pDeep:基于深度学习的肽段 MS/MS 谱预测。
Anal Chem. 2017 Dec 5;89(23):12690-12697. doi: 10.1021/acs.analchem.7b02566. Epub 2017 Nov 21.
8
De novo peptide sequencing by deep learning.通过深度学习进行从头肽测序。
Proc Natl Acad Sci U S A. 2017 Aug 1;114(31):8247-8252. doi: 10.1073/pnas.1705691114. Epub 2017 Jul 18.

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验