Department of Computer Science Engineering, University of California, San Diego, La Jolla, California 92093, United States.
Department of Electrical and Computer Engineering, University of California, San Diego, La Jolla, California 92093, United States.
J Proteome Res. 2023 Jun 2;22(6):1639-1648. doi: 10.1021/acs.jproteome.2c00612. Epub 2023 May 11.
As current shotgun proteomics experiments can produce gigabytes of mass spectrometry data per hour, processing these massive data volumes has become progressively more challenging. Spectral clustering is an effective approach to speed up downstream data processing by merging highly similar spectra to minimize data redundancy. However, because state-of-the-art spectral clustering tools fail to achieve optimal runtimes, this simply moves the processing bottleneck. In this work, we present a fast spectral clustering tool, HyperSpec, based on hyperdimensional computing (HDC). HDC shows promising clustering capability while only requiring lightweight binary operations with high parallelism that can be optimized using low-level hardware architectures, making it possible to run HyperSpec on graphics processing units to achieve extremely efficient spectral clustering performance. Additionally, HyperSpec includes optimized data preprocessing modules to reduce the spectrum preprocessing time, which is a critical bottleneck during spectral clustering. Based on experiments using various mass spectrometry data sets, HyperSpec produces results with comparable clustering quality as state-of-the-art spectral clustering tools while achieving speedups by orders of magnitude, shortening the clustering runtime of over 21 million spectra from 4 h to only 24 min.
由于当前的 shotgun 蛋白质组学实验每小时可以产生数十千兆字节的质谱数据,因此处理这些海量数据的难度越来越大。谱聚类是一种通过合并高度相似的谱来最小化数据冗余,从而加速下游数据处理的有效方法。然而,由于最先进的谱聚类工具无法实现最佳的运行时,这只是将处理瓶颈转移了。在这项工作中,我们提出了一种快速的谱聚类工具 HyperSpec,它基于超高维计算 (HDC)。HDC 显示出有前途的聚类能力,同时只需要轻量级的二进制操作,具有很高的并行性,可以通过低级硬件架构进行优化,从而可以在图形处理单元上运行 HyperSpec,以实现极其高效的谱聚类性能。此外,HyperSpec 还包括优化的数据预处理模块,以减少谱预处理时间,这是谱聚类过程中的一个关键瓶颈。基于使用各种质谱数据集的实验,HyperSpec 产生的结果与最先进的谱聚类工具具有可比的聚类质量,同时实现了数量级的加速,将超过 2100 万条谱的聚类运行时间从 4 小时缩短到仅 24 分钟。