一个有学问的评分函数提高了质谱数据库搜索的能力。

A learned score function improves the power of mass spectrometry database search.

机构信息

Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA.

Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA.

出版信息

Bioinformatics. 2024 Jun 28;40(Suppl 1):i410-i417. doi: 10.1093/bioinformatics/btae218.

DOI:10.1093/bioinformatics/btae218

PMID:38940129

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC11211853/

Abstract

MOTIVATION

One of the core problems in the analysis of protein tandem mass spectrometry data is the peptide assignment problem: determining, for each observed spectrum, the peptide sequence that was responsible for generating the spectrum. Two primary classes of methods are used to solve this problem: database search and de novo peptide sequencing. State-of-the-art methods for de novo sequencing use machine learning methods, whereas most database search engines use hand-designed score functions to evaluate the quality of a match between an observed spectrum and a candidate peptide from the database. We hypothesized that machine learning models for de novo sequencing implicitly learn a score function that captures the relationship between peptides and spectra, and thus may be re-purposed as a score function for database search. Because this score function is trained from massive amounts of mass spectrometry data, it could potentially outperform existing, hand-designed database search tools.

RESULTS

To test this hypothesis, we re-engineered Casanovo, which has been shown to provide state-of-the-art de novo sequencing capabilities, to assign scores to given peptide-spectrum pairs. We then evaluated the statistical power of this Casanovo score function, Casanovo-DB, to detect peptides on a benchmark of three mass spectrometry runs from three different species. In addition, we show that re-scoring with the Percolator post-processor benefits Casanovo-DB more than other score functions, further increasing the number of detected peptides.

摘要

动机

蛋白质串联质谱数据分析中的核心问题之一是肽分配问题：确定每个观察到的光谱，负责生成光谱的肽序列。解决这个问题的主要有两类方法：数据库搜索和从头测序。最新的从头测序方法使用机器学习方法，而大多数数据库搜索引擎使用手工设计的评分函数来评估观察到的光谱与数据库中候选肽之间的匹配质量。我们假设，从头测序的机器学习模型隐式地学习了一个评分函数，该函数捕捉了肽与光谱之间的关系，因此可以重新用作数据库搜索的评分函数。由于这个评分函数是从大量质谱数据中训练出来的，它有可能比现有的手工设计的数据库搜索工具表现更好。

结果

为了验证这一假设，我们重新设计了 Casanovo，它被证明具有最先进的从头测序能力，为给定的肽-谱对分配分数。然后，我们评估了 Casanovo-DB 这种 Casanovo 评分函数在三个不同物种的三个质谱运行的基准上检测肽的统计能力。此外，我们还表明，用 Percolator 后处理器重新评分对 Casanovo-DB 更有利，进一步增加了检测到的肽的数量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/b1c2/11211853/7949702a753e/btae218f1.jpg

相似文献

A learned score function improves the power of mass spectrometry database search.

Bioinformatics. 2024 Jun 28;40(Suppl 1):i410-i417. doi: 10.1093/bioinformatics/btae218.

Sequence-to-sequence translation from mass spectra to peptides with a transformer model.

Nat Commun. 2024 Jul 30;15(1):6427. doi: 10.1038/s41467-024-49731-x.

Tutorial on de novo peptide sequencing using MS/MS mass spectrometry.

J Bioinform Comput Biol. 2012 Dec;10(6):1231002. doi: 10.1142/S0219720012310026. Epub 2012 Aug 7.

A ranking-based scoring function for peptide-spectrum matches.

J Proteome Res. 2009 May;8(5):2241-52. doi: 10.1021/pr800678b.

Spectral dictionaries: Integrating de novo peptide sequencing with database search of tandem mass spectra.

Mol Cell Proteomics. 2009 Jan;8(1):53-69. doi: 10.1074/mcp.M800103-MCP200. Epub 2008 Aug 14.

Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics.

Bioinformatics. 2005 Oct 1;21(19):3726-32. doi: 10.1093/bioinformatics/bti620. Epub 2005 Aug 16.

Algorithms for the de novo sequencing of peptides from tandem mass spectra.

Expert Rev Proteomics. 2011 Oct;8(5):645-57. doi: 10.1586/epr.11.54.

De novo sequencing methods in proteomics.

Methods Mol Biol. 2010;604:105-21. doi: 10.1007/978-1-60761-444-9_8.

Algorithms for de-novo sequencing of peptides by tandem mass spectrometry: A review.

Anal Chim Acta. 2023 Aug 8;1268:341330. doi: 10.1016/j.aca.2023.341330. Epub 2023 May 8.

TIDD: tool-independent and data-dependent machine learning for peptide identification.

BMC Bioinformatics. 2022 Mar 30;23(1):109. doi: 10.1186/s12859-022-04640-y.

引用本文的文献

De novo peptide databases enable protein-based stable isotope probing of microbial communities with up to species-level resolution.

Environ Microbiome. 2025 Aug 26;20(1):111. doi: 10.1186/s40793-025-00767-6.

Recent Advances in Mass Spectrometry-Based Bottom-Up Proteomics.

Anal Chem. 2025 Mar 11;97(9):4728-4749. doi: 10.1021/acs.analchem.4c06750. Epub 2025 Feb 25.

本文引用的文献

Bidirectional de novo peptide sequencing using a transformer model.

PLoS Comput Biol. 2024 Feb 28;20(2):e1011892. doi: 10.1371/journal.pcbi.1011892. eCollection 2024 Feb.

Target-decoy false discovery rate estimation using Crema.

Proteomics. 2024 Apr;24(8):e2300084. doi: 10.1002/pmic.202300084. Epub 2024 Feb 21.

Introducing π-HelixNovo for practical large-scale de novo peptide sequencing.

Brief Bioinform. 2024 Jan 22;25(2). doi: 10.1093/bib/bbae021.

Deep learning-driven fragment ion series classification enables highly precise and sensitive de novo peptide sequencing.

Nat Commun. 2024 Jan 2;15(1):151. doi: 10.1038/s41467-023-44323-7.

Accurate de novo peptide sequencing using fully convolutional neural networks.

Nat Commun. 2023 Dec 2;14(1):7974. doi: 10.1038/s41467-023-43010-x.

Sage: An Open-Source Tool for Fast Proteomics Searching and Quantification at Scale.

J Proteome Res. 2023 Nov 3;22(11):3652-3659. doi: 10.1021/acs.jproteome.3c00486. Epub 2023 Oct 11.

PGPointNovo: an efficient neural network-based tool for parallel peptide sequencing.

Bioinform Adv. 2023 Apr 25;3(1):vbad057. doi: 10.1093/bioadv/vbad057. eCollection 2023.

MSFragger-Labile: A Flexible Method to Improve Labile PTM Analysis in Proteomics.

Mol Cell Proteomics. 2023 May;22(5):100538. doi: 10.1016/j.mcpro.2023.100538. Epub 2023 Mar 31.

Improving Peptide-Level Mass Spectrometry Analysis via Double Competition.

J Proteome Res. 2022 Oct 7;21(10):2412-2420. doi: 10.1021/acs.jproteome.2c00282. Epub 2022 Sep 27.

A comprehensive LFQ benchmark dataset on modern day acquisition strategies in proteomics.

Sci Data. 2022 Mar 30;9(1):126. doi: 10.1038/s41597-022-01216-6.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

一个有学问的评分函数提高了质谱数据库搜索的能力。

A learned score function improves the power of mass spectrometry database search.

机构信息

出版信息

MOTIVATION

RESULTS

动机

结果

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献