利用分区选择偏差实现高质量的质谱聚类。

Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra.

机构信息

Laboratory of Structural and Computational Proteomics, Carlos Chagas Institute, Fiocruz Paraná, Brazil.

Department of Chemical Biology, Leibniz - Forschungsinstitut für Molekulare Pharmakologie (FMP), Berlin, Germany.

出版信息

J Proteomics. 2021 Aug 15;245:104282. doi: 10.1016/j.jprot.2021.104282. Epub 2021 Jun 2.

DOI:10.1016/j.jprot.2021.104282

PMID:34089898

Abstract

In proteomics, the identification of peptides from mass spectral data can be mathematically described as the partitioning of mass spectra into clusters (i.e., groups of spectra derived from the same peptide). The way partitions are validated is just as important, having evolved side by side with the clustering algorithms themselves and given rise to many partition assessment measures. An assessment measure is said to have a selection bias if, and only if, the probability that a randomly chosen partition scoring a high value depends on the number of clusters in the partition. In the context of clustering mass spectra, this might mislead the validation process to favor clustering algorithms that generate too many (or few) spectral clusters, regardless of the underlying peptide sequence. A selection bias toward the number of peptides is desirable for proteomics as it estimates the number of peptides in a complex protein mixture. Here, we introduce an assessment measure that is purposely biased toward the number of peptide ion species. We also introduce a partition assessment framework for proteomics, called the Partition Assessment Tool, and demonstrate its importance by evaluating the performance of eight clustering algorithms on seven proteomics datasets while discussing the trade-offs involved. SIGNIFICANCE: Clustering algorithms are widely adopted in proteomics for undertaking several tasks such as speeding up search engines, generating consensus mass spectra, and to aid in the classification of proteomic profiles. Choosing which algorithm is most fit for the task at hand is not simple as each algorithm has advantages and disadvantages; furthermore, specifying clustering parameters is also a necessary and fundamental step. For example, deciding on whether to generate "pure clusters" or fewer clusters but accepting noise. With this as motivation, we verify the performance of several widely adopted algorithms on proteomic datasets and introduce a theoretical framework for drawing conclusions on which approach is suitable for the task at hand.

摘要

在蛋白质组学中，从质谱数据中鉴定肽可以在数学上描述为将质谱分成簇（即，源自同一肽的光谱组）。验证分区的方式同样重要，它与聚类算法本身一起发展，并产生了许多分区评估措施。如果并且仅当随机选择的分区得分高的概率取决于分区中的簇数，则评估措施被认为存在选择偏差。在聚类质谱的上下文中，这可能会误导验证过程，偏向于生成过多（或过少）光谱簇的聚类算法，而不管潜在的肽序列如何。选择偏向肽的数量对于蛋白质组学是可取的，因为它估计了复杂蛋白质混合物中的肽数量。在这里，我们引入了一种评估措施，该措施有意偏向肽离子种类的数量。我们还介绍了一种用于蛋白质组学的分区评估框架，称为分区评估工具，并通过在讨论所涉及的权衡时评估八个聚类算法在七个蛋白质组学数据集上的性能来证明其重要性。意义：聚类算法在蛋白质组学中被广泛采用，用于执行多项任务，例如加快搜索引擎，生成共识质谱，并帮助对蛋白质组学图谱进行分类。选择最适合手头任务的算法并不简单，因为每种算法都有优点和缺点；此外，指定聚类参数也是必要的和基本的步骤。例如，决定是生成“纯簇”还是生成较少的簇但接受噪声。出于这个动机，我们在蛋白质组学数据集上验证了几种广泛采用的算法的性能，并引入了一个理论框架，以便就哪种方法适合手头的任务得出结论。

相似文献

Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra.

J Proteomics. 2021 Aug 15;245:104282. doi: 10.1016/j.jprot.2021.104282. Epub 2021 Jun 2.

msCRUSH: Fast Tandem Mass Spectral Clustering Using Locality Sensitive Hashing.

J Proteome Res. 2019 Jan 4;18(1):147-158. doi: 10.1021/acs.jproteome.8b00448. Epub 2018 Dec 14.

In-depth analysis of protein inference algorithms using multiple search engines and well-defined metrics.

J Proteomics. 2017 Jan 6;150:170-182. doi: 10.1016/j.jprot.2016.08.002. Epub 2016 Aug 4.

Deep learning embedder method and tool for mass spectra similarity search.

J Proteomics. 2021 Feb 10;232:104070. doi: 10.1016/j.jprot.2020.104070. Epub 2020 Dec 8.

The APEX Quantitative Proteomics Tool: generating protein quantitation estimates from LC-MS/MS proteomics results.

BMC Bioinformatics. 2008 Dec 9;9:529. doi: 10.1186/1471-2105-9-529.

Comparative database search engine analysis on massive tandem mass spectra of pork-based food products for halal proteomics.

J Proteomics. 2021 Jun 15;241:104240. doi: 10.1016/j.jprot.2021.104240. Epub 2021 Apr 21.

Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra.

J Proteome Res. 2017 Nov 3;16(11):4035-4044. doi: 10.1021/acs.jproteome.7b00427.

ClusterSheep: A Graphics Processing Unit-Accelerated Software Tool for Large-Scale Clustering of Tandem Mass Spectra from Shotgun Proteomics.

J Proteome Res. 2021 Dec 3;20(12):5359-5367. doi: 10.1021/acs.jproteome.1c00485. Epub 2021 Nov 4.

Implementation and application of a versatile clustering tool for tandem mass spectrometry data.

Proteomics. 2007 Sep;7(18):3245-58. doi: 10.1002/pmic.200700160.

A novel approach for clustering proteomics data using Bayesian fast Fourier transform.

Bioinformatics. 2005 May 15;21(10):2210-24. doi: 10.1093/bioinformatics/bti383. Epub 2005 Mar 15.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

利用分区选择偏差实现高质量的质谱聚类。

Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra.

机构信息

Laboratory of Structural and Computational Proteomics, Carlos Chagas Institute, Fiocruz Paraná, Brazil.

Department of Chemical Biology, Leibniz - Forschungsinstitut für Molekulare Pharmakologie (FMP), Berlin, Germany.

出版信息

J Proteomics. 2021 Aug 15;245:104282. doi: 10.1016/j.jprot.2021.104282. Epub 2021 Jun 2.

DOI:10.1016/j.jprot.2021.104282

PMID:34089898

Abstract

摘要

利用分区选择偏差实现高质量的质谱聚类。

Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra.

机构信息

出版信息

相似文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献

利用分区选择偏差实现高质量的质谱聚类。

Leveraging the partition selection bias to achieve a high-quality clustering of mass spectra.

机构信息

出版信息

相似文献