使用快速最近邻搜索的大规模串联质谱聚类

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching.

作者信息

Bittremieux Wout, Laukens Kris, Noble William Stafford, Dorrestein Pieter C

机构信息

Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States.

Department of Computer Science, University of Antwerp, Antwerp, Belgium.

出版信息

Rapid Commun Mass Spectrom. 2025 May;39 Suppl 1(Suppl 1):e9153. doi: 10.1002/rcm.9153. Epub 2021 Jul 20.

DOI:10.1002/rcm.9153

PMID:34169593

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC8709870/

Abstract

RATIONALE

Advanced algorithmic solutions are necessary to process the ever-increasing amounts of mass spectrometry data that are being generated. In this study, we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.

METHODS

falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters.

RESULTS

Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome data set consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.

CONCLUSIONS

falcon is a highly efficient spectrum clustering tool, which is publicly available as an open source under the permissive BSD license at https://github.com/bittremieux/falcon.

摘要

原理

需要先进的算法解决方案来处理不断增加的质谱数据量。在本研究中，我们描述了用于对数百万个MS/MS谱进行高效聚类的falcon谱聚类工具。

方法

falcon使用先进的快速谱相似性搜索技术成功地对数大量质谱数据进行高效聚类。首先，对高分辨率谱进行分箱，并使用特征哈希将其转换为低维向量。接下来，使用谱向量构建最近邻索引以进行快速相似性搜索。最近邻索引用于高效计算稀疏成对距离矩阵，而无需在相关前体质量容差内详尽地执行所有成对谱比较。最后，进行基于密度的聚类以将相似谱分组为簇。

结果

使用由2500万个谱组成的大型人类蛋白质组草图数据集对几种最先进的谱聚类工具进行了评估，表明其他工具产生具有不同特征的聚类结果。值得注意的是，falcon生成的高纯度簇比其他工具更大，从而在不损失相关信息的情况下更大程度地减少数据量，以便进行更高效的下游处理。

结论

falcon是一种高效的谱聚类工具，可在https://github.com/bittremieux/falcon上根据宽松的BSD许可作为开源软件公开获取。

Suppr 超能文献

文献检索

文件翻译

深度研究

Suppr 超能文献

文献检索

文件翻译

深度研究

使用快速最近邻搜索的大规模串联质谱聚类

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching.

作者信息

机构信息

出版信息

RATIONALE

METHODS

RESULTS

CONCLUSIONS

原理

方法

结果

结论

相似文献

引用本文的文献

本文引用的文献

相似文献

引用本文的文献

本文引用的文献

使用快速最近邻搜索的大规模串联质谱聚类

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching.

作者信息

机构信息

出版信息

RATIONALE

METHODS

RESULTS

CONCLUSIONS

原理

方法

结果

结论