Suppr超能文献

使用快速最近邻搜索的大规模串联质谱聚类

Large-scale tandem mass spectrum clustering using fast nearest neighbor searching.

作者信息

Bittremieux Wout, Laukens Kris, Noble William Stafford, Dorrestein Pieter C

机构信息

Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States.

Department of Computer Science, University of Antwerp, Antwerp, Belgium.

出版信息

Rapid Commun Mass Spectrom. 2025 May;39 Suppl 1(Suppl 1):e9153. doi: 10.1002/rcm.9153. Epub 2021 Jul 20.

Abstract

RATIONALE

Advanced algorithmic solutions are necessary to process the ever-increasing amounts of mass spectrometry data that are being generated. In this study, we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.

METHODS

falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters.

RESULTS

Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome data set consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.

CONCLUSIONS

falcon is a highly efficient spectrum clustering tool, which is publicly available as an open source under the permissive BSD license at https://github.com/bittremieux/falcon.

摘要

原理

需要先进的算法解决方案来处理不断增加的质谱数据量。在本研究中,我们描述了用于对数百万个MS/MS谱进行高效聚类的falcon谱聚类工具。

方法

falcon使用先进的快速谱相似性搜索技术成功地对数大量质谱数据进行高效聚类。首先,对高分辨率谱进行分箱,并使用特征哈希将其转换为低维向量。接下来,使用谱向量构建最近邻索引以进行快速相似性搜索。最近邻索引用于高效计算稀疏成对距离矩阵,而无需在相关前体质量容差内详尽地执行所有成对谱比较。最后,进行基于密度的聚类以将相似谱分组为簇。

结果

使用由2500万个谱组成的大型人类蛋白质组草图数据集对几种最先进的谱聚类工具进行了评估,表明其他工具产生具有不同特征的聚类结果。值得注意的是,falcon生成的高纯度簇比其他工具更大,从而在不损失相关信息的情况下更大程度地减少数据量,以便进行更高效的下游处理。

结论

falcon是一种高效的谱聚类工具,可在https://github.com/bittremieux/falcon上根据宽松的BSD许可作为开源软件公开获取。

相似文献

1
Large-scale tandem mass spectrum clustering using fast nearest neighbor searching.使用快速最近邻搜索的大规模串联质谱聚类
Rapid Commun Mass Spectrom. 2025 May;39 Suppl 1(Suppl 1):e9153. doi: 10.1002/rcm.9153. Epub 2021 Jul 20.
4

引用本文的文献

8
HyperSpec: Ultrafast Mass Spectra Clustering in Hyperdimensional Space.超高维空间中的超快质谱聚类分析
J Proteome Res. 2023 Jun 2;22(6):1639-1648. doi: 10.1021/acs.jproteome.2c00612. Epub 2023 May 11.

本文引用的文献

3
Array programming with NumPy.使用 NumPy 进行数组编程。
Nature. 2020 Sep;585(7825):357-362. doi: 10.1038/s41586-020-2649-2. Epub 2020 Sep 16.
4
SciPy 1.0: fundamental algorithms for scientific computing in Python.SciPy 1.0:Python 中的科学计算基础算法。
Nat Methods. 2020 Mar;17(3):261-272. doi: 10.1038/s41592-019-0686-2. Epub 2020 Feb 3.
8
Pyteomics 4.0: Five Years of Development of a Python Proteomics Framework.Pyteomics 4.0:五年 Python 蛋白质组学框架的发展。
J Proteome Res. 2019 Feb 1;18(2):709-714. doi: 10.1021/acs.jproteome.8b00717. Epub 2019 Jan 8.
9

文献AI研究员

20分钟写一篇综述,助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型,支持多种主流文档格式。

立即体验