Department of Computer Science, Luddy School of Informatics, Computing and Engineering, Indiana University, IN 47408, United States.
Bioinformatics. 2024 Jun 28;40(Suppl 1):i257-i265. doi: 10.1093/bioinformatics/btae220.
Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification.
We evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%-2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%-15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%-12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder's potential to enhance peptide identification for proteomic data analyses.
The source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.
串联质谱(MS/MS)是大规模蛋白质组分析的关键技术。蛋白质数据库搜索或光谱库搜索常用于从 MS/MS 光谱中鉴定肽,但由于重复光谱之间的实验变化以及不同肽之间的相似碎裂模式,可能会面临挑战。为了解决这个挑战,我们提出了 SpecEncoder,这是一种深度度量学习方法,通过将 MS/MS 光谱转换为潜在空间中的稳健和敏感的嵌入向量来解决这些挑战。SpecEncoder 模型还可以嵌入预测的肽 MS/MS 光谱,从而实现结合光谱库和蛋白质数据库搜索的混合搜索方法,用于肽鉴定。
我们在三个大型人类蛋白质组学数据集上评估了 SpecEncoder,结果表明肽鉴定的一致性得到了提高。对于光谱库搜索,SpecEncoder 比 SpectraST 多鉴定 1%-2%的独特肽(和 PSM)。对于蛋白质数据库搜索,它比 Percolator 增强的 MSGF+多鉴定 6%-15%的独特肽。此外,当利用实验和预测光谱的组合库时,SpecEncoder 还可以鉴定 6%-12%的额外独特肽。与深度学习增强的方法(MSFragger 由 MSBooster 增强)相比,SpecEncoder 也可以鉴定更多的肽。这些结果表明 SpecEncoder 有可能增强蛋白质组数据分析中的肽鉴定。
SpecEncoder 和肽鉴定的源代码和脚本可在 GitHub 上获得,网址为 https://github.com/lkytal/SpecEncoder。联系人:hatang@iu.edu。