CAMS-RS：一种使用受限搜索空间和智能随机采样的大规模质谱数据聚类算法。

CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling.

作者信息

Saeed Fahad, Hoffert Jason D, Knepper Mark A

出版信息

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):128-41. doi: 10.1109/TCBB.2013.152.

DOI:10.1109/TCBB.2013.152

PMID:26355513

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC6143137/

Abstract

High-throughput mass spectrometers can produce massive amounts of redundant data at an astonishing rate with many of them having poor signal-to-noise (S/N) ratio. These low S/N ratio spectra may not get interpreted using conventional spectra-to-database matching techniques. In this paper, we present an efficient algorithm, CAMS-RS (Clustering Algorithm for Mass Spectra using Restricted Space and Sampling) for clustering of raw mass spectrometry data. CAMS-RS utilizes a novel metric (called F-set) that exploits the temporal and spatial patterns to accurately assess similarity between two given spectra. The F-set similarity metric is independent of the retention time and allows clustering of mass spectrometry data from independent LC-MS/MS runs. A novel restricted search space strategy is devised to limit the comparisons of the number of spectra. An intelligent sampling method is executed on individual bins that allow merging of the results to make the final clusters. Our experiments, using experimentally generated data sets, show that the proposed algorithm is able to cluster spectra with high accuracy and is helpful in interpreting low S/N ratio spectra. The CAMS-RS algorithm is highly scalable with increasing number of spectra and our implementation allows clustering of up to a million spectra within minutes.

摘要

高通量质谱仪能够以惊人的速度产生大量冗余数据，其中许多数据的信噪比（S/N）很低。这些低信噪比的光谱可能无法使用传统的光谱与数据库匹配技术进行解读。在本文中，我们提出了一种高效算法，即用于原始质谱数据聚类的CAMS-RS（使用受限空间和采样的质谱聚类算法）。CAMS-RS利用一种新颖的度量（称为F集），该度量利用时间和空间模式来准确评估两个给定光谱之间的相似性。F集相似性度量与保留时间无关，并允许对来自独立液相色谱-串联质谱（LC-MS/MS）运行的质谱数据进行聚类。设计了一种新颖的受限搜索空间策略，以限制光谱数量的比较。在各个数据仓上执行智能采样方法，允许合并结果以形成最终聚类。我们使用实验生成的数据集进行的实验表明，所提出的算法能够高精度地聚类光谱，并且有助于解读低信噪比光谱。CAMS-RS算法随着光谱数量的增加具有高度可扩展性，我们的实现允许在几分钟内对多达一百万个光谱进行聚类。

相似文献

CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling.

IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):128-41. doi: 10.1109/TCBB.2013.152.

An Efficient Algorithm for Clustering of Large-Scale Mass Spectrometry Data.

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2012 Oct 4:1-4. doi: 10.1109/BIBM.2012.6392738.

Exploiting Thread-Level and Instruction-Level Parallelism to Cluster Mass Spectrometry Data using Multicore Architectures.

Netw Model Anal Health Inform Bioinform. 2014 Apr;3:54. doi: 10.1007/s13721-014-0054-1.

msCRUSH: Fast Tandem Mass Spectral Clustering Using Locality Sensitive Hashing.

J Proteome Res. 2019 Jan 4;18(1):147-158. doi: 10.1021/acs.jproteome.8b00448. Epub 2018 Dec 14.

Enhanced peptide quantification using spectral count clustering and cluster abundance.

BMC Bioinformatics. 2011 Oct 28;12:423. doi: 10.1186/1471-2105-12-423.

Clustering and filtering tandem mass spectra acquired in data-independent mode.

J Am Soc Mass Spectrom. 2013 Dec;24(12):1862-71. doi: 10.1007/s13361-013-0720-z. Epub 2013 Sep 5.

Implementation and application of a versatile clustering tool for tandem mass spectrometry data.

Proteomics. 2007 Sep;7(18):3245-58. doi: 10.1002/pmic.200700160.

Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra.

J Proteome Res. 2017 Nov 3;16(11):4035-4044. doi: 10.1021/acs.jproteome.7b00427.

Deep learning embedder method and tool for mass spectra similarity search.

J Proteomics. 2021 Feb 10;232:104070. doi: 10.1016/j.jprot.2020.104070. Epub 2020 Dec 8.

Response to "Comparison and Evaluation of Clustering Algorithms for Tandem Mass Spectra".

J Proteome Res. 2018 May 4;17(5):1993-1996. doi: 10.1021/acs.jproteome.7b00824. Epub 2018 Apr 25.

引用本文的文献

CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis.

BMC Bioinformatics. 2021 Feb 12;22(1):68. doi: 10.1186/s12859-021-03969-0.

Methods for Proteogenomics Data Analysis, Challenges, and Scalability Bottlenecks: A Survey.

IEEE Access. 2021;9:5497-5516. doi: 10.1109/ACCESS.2020.3047588. Epub 2020 Dec 25.

GPU-DAEMON: GPU algorithm design, data management & optimization template for array based big omics data.

Comput Biol Med. 2018 Oct 1;101:163-173. doi: 10.1016/j.compbiomed.2018.08.015. Epub 2018 Aug 16.

An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics.

ACM BCB. 2017 Aug;2017:550-555. doi: 10.1145/3107411.3107466.

Soil and leaf litter metaproteomics-a brief guideline from sampling to understanding.

FEMS Microbiol Ecol. 2016 Nov;92(11). doi: 10.1093/femsec/fiw180. Epub 2016 Aug 21.

Exploiting Thread-Level and Instruction-Level Parallelism to Cluster Mass Spectrometry Data using Multicore Architectures.

Netw Model Anal Health Inform Bioinform. 2014 Apr;3:54. doi: 10.1007/s13721-014-0054-1.

本文引用的文献

An Efficient Dynamic Programming Algorithm for Phosphorylation Site Assignment of Large-Scale Mass Spectrometry Data.

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2012 Oct 4:618-625. doi: 10.1109/BIBMW.2012.6470210.

An Efficient Algorithm for Clustering of Large-Scale Mass Spectrometry Data.

Proceedings (IEEE Int Conf Bioinformatics Biomed). 2012 Oct 4:1-4. doi: 10.1109/BIBM.2012.6392738.

Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments.

Proteomics. 2012 May;12(10):1639-55. doi: 10.1002/pmic.201100537.

Dynamics of the G protein-coupled vasopressin V2 receptor signaling network revealed by quantitative phosphoproteomics.

Mol Cell Proteomics. 2012 Feb;11(2):M111.014613. doi: 10.1074/mcp.M111.014613. Epub 2011 Nov 21.

Confident phosphorylation site localization using the Mascot Delta Score.

Mol Cell Proteomics. 2011 Feb;10(2):M110.003830. doi: 10.1074/mcp.M110.003830. Epub 2010 Nov 6.

Glycoprotein capture and quantitative phosphoproteomics indicate coordinated regulation of cell migration upon lysophosphatidic acid stimulation.

Mol Cell Proteomics. 2010 Nov;9(11):2337-53. doi: 10.1074/mcp.M110.000737. Epub 2010 Jul 16.

Classification filtering strategy to improve the coverage and sensitivity of phosphoproteome analysis.

Anal Chem. 2010 Jul 15;82(14):6168-75. doi: 10.1021/ac100975t.

Retention time alignment algorithms for LC/MS data must consider non-linear shifts.

Bioinformatics. 2009 Mar 15;25(6):758-64. doi: 10.1093/bioinformatics/btp052. Epub 2009 Jan 28.

A fast SEQUEST cross correlation algorithm.

J Proteome Res. 2008 Oct;7(10):4598-602. doi: 10.1021/pr800420s. Epub 2008 Sep 6.

Linear discriminant analysis-based estimation of the false discovery rate for phosphopeptide identifications.

J Proteome Res. 2008 Jun;7(6):2195-203. doi: 10.1021/pr070510t. Epub 2008 Apr 19.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

CAMS-RS：一种使用受限搜索空间和智能随机采样的大规模质谱数据聚类算法。

CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data Using Restricted Search Space and Intelligent Random Sampling.

作者信息

出版信息

相似文献

引用本文的文献

本文引用的文献

文献AI研究员

用中文搜PubMed

文档翻译

Suppr 超能文献