Laboratory of Structural and Computational Proteomics, Carlos Chagas Institute, Fiocruz Paraná, Brazil.
Department of Chemical Biology, Leibniz - Forschungsinstitut für Molekulare Pharmakologie (FMP), Berlin, Germany.
J Proteomics. 2021 Aug 15;245:104282. doi: 10.1016/j.jprot.2021.104282. Epub 2021 Jun 2.
In proteomics, the identification of peptides from mass spectral data can be mathematically described as the partitioning of mass spectra into clusters (i.e., groups of spectra derived from the same peptide). The way partitions are validated is just as important, having evolved side by side with the clustering algorithms themselves and given rise to many partition assessment measures. An assessment measure is said to have a selection bias if, and only if, the probability that a randomly chosen partition scoring a high value depends on the number of clusters in the partition. In the context of clustering mass spectra, this might mislead the validation process to favor clustering algorithms that generate too many (or few) spectral clusters, regardless of the underlying peptide sequence. A selection bias toward the number of peptides is desirable for proteomics as it estimates the number of peptides in a complex protein mixture. Here, we introduce an assessment measure that is purposely biased toward the number of peptide ion species. We also introduce a partition assessment framework for proteomics, called the Partition Assessment Tool, and demonstrate its importance by evaluating the performance of eight clustering algorithms on seven proteomics datasets while discussing the trade-offs involved. SIGNIFICANCE: Clustering algorithms are widely adopted in proteomics for undertaking several tasks such as speeding up search engines, generating consensus mass spectra, and to aid in the classification of proteomic profiles. Choosing which algorithm is most fit for the task at hand is not simple as each algorithm has advantages and disadvantages; furthermore, specifying clustering parameters is also a necessary and fundamental step. For example, deciding on whether to generate "pure clusters" or fewer clusters but accepting noise. With this as motivation, we verify the performance of several widely adopted algorithms on proteomic datasets and introduce a theoretical framework for drawing conclusions on which approach is suitable for the task at hand.
在蛋白质组学中,从质谱数据中鉴定肽可以在数学上描述为将质谱分成簇(即,源自同一肽的光谱组)。验证分区的方式同样重要,它与聚类算法本身一起发展,并产生了许多分区评估措施。如果并且仅当随机选择的分区得分高的概率取决于分区中的簇数,则评估措施被认为存在选择偏差。在聚类质谱的上下文中,这可能会误导验证过程,偏向于生成过多(或过少)光谱簇的聚类算法,而不管潜在的肽序列如何。选择偏向肽的数量对于蛋白质组学是可取的,因为它估计了复杂蛋白质混合物中的肽数量。在这里,我们引入了一种评估措施,该措施有意偏向肽离子种类的数量。我们还介绍了一种用于蛋白质组学的分区评估框架,称为分区评估工具,并通过在讨论所涉及的权衡时评估八个聚类算法在七个蛋白质组学数据集上的性能来证明其重要性。意义:聚类算法在蛋白质组学中被广泛采用,用于执行多项任务,例如加快搜索引擎,生成共识质谱,并帮助对蛋白质组学图谱进行分类。选择最适合手头任务的算法并不简单,因为每种算法都有优点和缺点;此外,指定聚类参数也是必要的和基本的步骤。例如,决定是生成“纯簇”还是生成较少的簇但接受噪声。出于这个动机,我们在蛋白质组学数据集上验证了几种广泛采用的算法的性能,并引入了一个理论框架,以便就哪种方法适合手头的任务得出结论。