Bittremieux Wout, Laukens Kris, Noble William Stafford, Dorrestein Pieter C
Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, California, United States.
Department of Computer Science, University of Antwerp, Antwerp, Belgium.
Rapid Commun Mass Spectrom. 2025 May;39 Suppl 1(Suppl 1):e9153. doi: 10.1002/rcm.9153. Epub 2021 Jul 20.
Advanced algorithmic solutions are necessary to process the ever-increasing amounts of mass spectrometry data that are being generated. In this study, we describe the falcon spectrum clustering tool for efficient clustering of millions of MS/MS spectra.
falcon succeeds in efficiently clustering large amounts of mass spectral data using advanced techniques for fast spectrum similarity searching. First, high-resolution spectra are binned and converted to low-dimensional vectors using feature hashing. Next, the spectrum vectors are used to construct nearest neighbor indexes for fast similarity searching. The nearest neighbor indexes are used to efficiently compute a sparse pairwise distance matrix without having to exhaustively perform all pairwise spectrum comparisons within the relevant precursor mass tolerance. Finally, density-based clustering is performed to group similar spectra into clusters.
Several state-of-the-art spectrum clustering tools were evaluated using a large draft human proteome data set consisting of 25 million spectra, indicating that alternative tools produce clustering results with different characteristics. Notably, falcon generates larger highly pure clusters than alternative tools, leading to a larger reduction in data volume without the loss of relevant information for more efficient downstream processing.
falcon is a highly efficient spectrum clustering tool, which is publicly available as an open source under the permissive BSD license at https://github.com/bittremieux/falcon.
需要先进的算法解决方案来处理不断增加的质谱数据量。在本研究中,我们描述了用于对数百万个MS/MS谱进行高效聚类的falcon谱聚类工具。
falcon使用先进的快速谱相似性搜索技术成功地对数大量质谱数据进行高效聚类。首先,对高分辨率谱进行分箱,并使用特征哈希将其转换为低维向量。接下来,使用谱向量构建最近邻索引以进行快速相似性搜索。最近邻索引用于高效计算稀疏成对距离矩阵,而无需在相关前体质量容差内详尽地执行所有成对谱比较。最后,进行基于密度的聚类以将相似谱分组为簇。
使用由2500万个谱组成的大型人类蛋白质组草图数据集对几种最先进的谱聚类工具进行了评估,表明其他工具产生具有不同特征的聚类结果。值得注意的是,falcon生成的高纯度簇比其他工具更大,从而在不损失相关信息的情况下更大程度地减少数据量,以便进行更高效的下游处理。
falcon是一种高效的谱聚类工具,可在https://github.com/bittremieux/falcon上根据宽松的BSD许可作为开源软件公开获取。