Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA.
Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.
Bioinformatics. 2024 Jun 28;40(Suppl 1):i410-i417. doi: 10.1093/bioinformatics/btae218.
One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools.
To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.
蛋白质串联质谱数据分析中的核心问题之一是肽分配问题:确定每个观察到的光谱,负责生成光谱的肽序列。解决这个问题的主要有两类方法:数据库搜索和从头测序。最新的从头测序方法使用机器学习方法,而大多数数据库搜索引擎使用手工设计的评分函数来评估观察到的光谱与数据库中候选肽之间的匹配质量。我们假设,从头测序的机器学习模型隐式地学习了一个评分函数,该函数捕捉了肽与光谱之间的关系,因此可以重新用作数据库搜索的评分函数。由于这个评分函数是从大量质谱数据中训练出来的,它有可能比现有的手工设计的数据库搜索工具表现更好。
为了验证这一假设,我们重新设计了 Casanovo,它被证明具有最先进的从头测序能力,为给定的肽-谱对分配分数。然后,我们评估了 Casanovo-DB 这种 Casanovo 评分函数在三个不同物种的三个质谱运行的基准上检测肽的统计能力。此外,我们还表明,用 Percolator 后处理器重新评分对 Casanovo-DB 更有利,进一步增加了检测到的肽的数量。