Qin Chunyuan, Luo Xiyang, Deng Chuan, Shu Kunxian, Zhu Weimin, Griss Johannes, Hermjakob Henning, Bai Mingze, Perez-Riverol Yasset
Chongqing Key Laboratory on Big Data for Bio Intelligence, Chongqing University of Posts and telecommunications, Chongqing, China.
State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Life Omics, Beijing 102206, China.
J Proteomics. 2021 Feb 10;232:104070. doi: 10.1016/j.jprot.2020.104070. Epub 2020 Dec 8.
Spectral similarity calculation is widely used in protein identification tools and mass spectra clustering algorithms while comparing theoretical or experimental spectra. The performance of the spectral similarity calculation plays an important role in these tools and algorithms especially in the analysis of large-scale datasets. Recently, deep learning methods have been proposed to improve the performance of clustering algorithms and protein identification by training the algorithms with existing data and the use of multiple spectra and identified peptide features. While the efficiency of these algorithms is still under study in comparison with traditional approaches, their application in proteomics data analysis is becoming more common. Here, we propose the use of deep learning to improve spectral similarity comparison. We assessed the performance of deep learning for spectral similarity, with GLEAMS and a newly trained embedder model (DLEAMSE), which uses high-quality spectra from PRIDE Cluster. Also, we developed a new bioinformatics tool (mslookup - https://github.com/bigbio/DLEAMSE/) that allows users to quickly search for spectra in previously identified mass spectra publish in public repositories and spectral libraries. Finally, we released a human database to enable bioinformaticians and biologists to search for identified spectra in their machines. SIGNIFICANCE STATEMENT: Spectral similarity calculation plays an important role in proteomics data analysis. With deep learning's ability to learn the implicit and effective features from large-scale training datasets, deep learning-based MS/MS spectra embedding models has emerged as a solution to improve mass spectral clustering similarity calculation algorithms. We compare multiple similarity scoring and deep learning methods in terms of accuracy (compute the similarity for a pair of the mass spectrum) and computing-time performance. The benchmark results showed no major differences in accuracy between DLEAMSE and normalized dot product for spectrum similarity calculations. The DLEAMSE GPU implementation is faster than NDP in preprocessing on the GPU server and the similarity calculation of DLEAMSE (Euclidean distance on 32-D vectors) takes about 1/3 of dot product calculations. The deep learning model (DLEAMSE) encoding and embedding steps needed to run once for each spectrum and the embedded 32-D points can be persisted in the repository for future comparison, which is faster for future comparisons and large-scale data. Based on these, we proposed a new tool mslookup that enables the researcher to find spectra previously identified in public data. The tool can be also used to generate in-house databases of previously identified spectra to share with other laboratories and consortiums.
在比较理论光谱或实验光谱时,光谱相似性计算在蛋白质鉴定工具和质谱聚类算法中被广泛应用。光谱相似性计算的性能在这些工具和算法中起着重要作用,尤其是在大规模数据集的分析中。最近,通过利用现有数据以及多光谱和已鉴定肽段特征对算法进行训练,人们提出了深度学习方法来提高聚类算法和蛋白质鉴定的性能。虽然与传统方法相比,这些算法的效率仍在研究之中,但它们在蛋白质组学数据分析中的应用正变得越来越普遍。在此,我们提出利用深度学习来改进光谱相似性比较。我们使用GLEAMS和一个新训练的嵌入模型(DLEAMSE)评估了深度学习在光谱相似性方面的性能,DLEAMSE使用来自PRIDE Cluster的高质量光谱。此外,我们开发了一个新的生物信息学工具(mslookup - https://github.com/bigbio/DLEAMSE/),该工具允许用户在公共存储库和光谱库中快速搜索先前鉴定的质谱中的光谱。最后,我们发布了一个人类数据库,以使生物信息学家和生物学家能够在他们的机器中搜索已鉴定的光谱。
光谱相似性计算在蛋白质组学数据分析中起着重要作用。基于深度学习能够从大规模训练数据集中学习隐含且有效的特征,基于深度学习的MS/MS光谱嵌入模型已成为一种改进质谱聚类相似性计算算法的解决方案。我们在准确性(计算一对质谱的相似性)和计算时间性能方面比较了多种相似性评分和深度学习方法。基准测试结果表明,在光谱相似性计算方面,DLEAMSE和归一化点积在准确性上没有重大差异。在GPU服务器上进行预处理时,DLEAMSE的GPU实现比NDP更快,并且DLEAMSE的相似性计算(基于32维向量的欧几里得距离)大约需要点积计算时间的1/3。深度学习模型(DLEAMSE)的编码和嵌入步骤对于每个光谱只需运行一次,并且嵌入的32维点可以保存在存储库中以供未来比较,这对于未来的比较和大规模数据来说更快。基于这些,我们提出了一个新工具mslookup,使研究人员能够找到先前在公共数据中鉴定的光谱。该工具还可用于生成先前鉴定光谱的内部数据库,以便与其他实验室和联盟共享。