He Qingzu, Li Xiang, Zhong Jinjin, Yang Gen, Han Jiahuai, Shuai Jianwei
Department of Physics National Institute for Data Science in Health and Medicine Xiamen University Xiamen China.
Wenzhou Key Laboratory of Biophysics Wenzhou Institute University of Chinese Academy of Sciences Wenzhou Zhejiang China.
Smart Med. 2024 Aug 27;3(3):e20240014. doi: 10.1002/SMMD.20240014. eCollection 2024 Sep.
Peptide spectrum matching is the process of linking mass spectrometry data with peptide sequences. An experimental spectrum can match thousands of candidate peptides with variable modifications leading to an exponential increase in candidates. Completing the search within a limited time is a key challenge. Traditional searches expedite the process by restricting peptide mass errors and variable modifications, but this limits interpretive capability. To address this challenge, we propose Dear-PSM, a peptide search engine that supports full database searching. Dear-PSM does not restrict peptide mass errors, matching each spectrum to all peptides in the database and increasing the number of variable modifications per peptide from the conventional 3-20. Leveraging inverted index technology, Dear-PSM creates a high-performance index table of experimental spectra and utilizes deep learning algorithms for peptide validation. Through these techniques, Dear-PSM achieves a speed breakthrough 7 times faster than mainstream search engines on a regular desktop computer, with a remarkable 240-fold reduction in memory consumption. Benchmark test results demonstrate that Dear-PSM, in full database search mode, can reproduce over 90% of the results obtained by mainstream search engines when handling complex mass spectrometry data collected from different species using various instruments. Furthermore, it uncovers a substantial number of new peptides and proteins. Dear-PSM has been publicly released on the GitHub repository https://github.com/jianweishuai/Dear-PSM.
肽谱匹配是将质谱数据与肽序列相联系的过程。一个实验谱可以与数千个带有可变修饰的候选肽相匹配,这导致候选肽数量呈指数级增长。在有限时间内完成搜索是一项关键挑战。传统搜索通过限制肽质量误差和可变修饰来加快进程,但这限制了解释能力。为应对这一挑战,我们提出了Dear-PSM,一种支持全数据库搜索的肽搜索引擎。Dear-PSM不限制肽质量误差,将每个谱与数据库中的所有肽进行匹配,并将每个肽的可变修饰数量从传统的3 - 20个增加。利用倒排索引技术,Dear-PSM创建了一个实验谱的高性能索引表,并利用深度学习算法进行肽验证。通过这些技术,Dear-PSM在普通台式计算机上实现了比主流搜索引擎快7倍的速度突破,内存消耗显著减少了240倍。基准测试结果表明,在全数据库搜索模式下,Dear-PSM在处理使用各种仪器从不同物种收集的复杂质谱数据时,能够重现主流搜索引擎获得的90%以上的结果。此外,它还发现了大量新的肽和蛋白质。Dear-PSM已在GitHub仓库https://github.com/jianweishuai/Dear-PSM上公开发布。