串联质谱数据聚类算法的比较与评估。

Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra.

机构信息

Department of Statistics, TU Dortmund University , 44221 Dortmund, Germany.

Medizinische Fakultät, Medizinisches Proteom-Center, Ruhr-University Bochum , 44801 Bochum, Germany.

出版信息

J Proteome Res. 2017 Nov 3;16(11):4035-4044. doi: 10.1021/acs.jproteome.7b00427.

DOI:10.1021/acs.jproteome.7b00427

PMID:28959885

Abstract

In proteomics, liquid chromatography-tandem mass spectrometry (LC-MS/MS) is established for identifying peptides and proteins. Duplicated spectra, that is, multiple spectra of the same peptide, occur both in single MS/MS runs and in large spectral libraries. Clustering tandem mass spectra is used to find consensus spectra, with manifold applications. First, it speeds up database searches, as performed for instance by Mascot. Second, it helps to identify novel peptides across species. Third, it is used for quality control to detect wrongly annotated spectra. We compare different clustering algorithms based on the cosine distance between spectra. CAST, MS-Cluster, and PRIDE Cluster are popular algorithms to cluster tandem mass spectra. We add well-known algorithms for large data sets, hierarchical clustering, DBSCAN, and connected components of a graph, as well as the new method N-Cluster. All algorithms are evaluated on real data with varied parameter settings. Cluster results are compared with each other and with peptide annotations based on validation measures such as purity. Quality control, regarding the detection of wrongly (un)annotated spectra, is discussed for exemplary resulting clusters. N-Cluster proves to be highly competitive. All clustering results benefit from the so-called DISMS2 filter that integrates additional information, for example, on precursor mass.

摘要

在蛋白质组学中，液相色谱-串联质谱（LC-MS/MS）被广泛应用于鉴定肽和蛋白质。在单 MS/MS 运行和大型光谱库中，都会出现重复的光谱，即同一种肽的多个光谱。串联质谱的聚类用于寻找共识光谱，具有多种应用。首先，它可以加速数据库搜索，例如使用 Mascot 进行的搜索。其次，它有助于识别跨物种的新型肽。第三，它用于质量控制，以检测错误注释的光谱。我们比较了基于光谱余弦距离的不同聚类算法。CAST、MS-Cluster 和 PRIDE Cluster 是常用的聚类串联质谱的算法。我们还添加了用于大数据集的知名算法，如层次聚类、DBSCAN 和图的连通分量，以及新方法 N-Cluster。所有算法都在具有不同参数设置的真实数据上进行了评估。使用基于纯度等验证措施的肽注释，对聚类结果进行了相互比较和评估。质量控制方面，针对错误（未）注释的光谱检测进行了讨论，举例说明了生成的聚类。N-Cluster 被证明具有很强的竞争力。所有聚类结果都受益于所谓的 DISMS2 滤波器，该滤波器集成了其他信息，例如前体质量。