Department of Chemical and Biological Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong.
J Proteome Res. 2021 Dec 3;20(12):5359-5367. doi: 10.1021/acs.jproteome.1c00485. Epub 2021 Nov 4.
Modern shotgun proteomics experiments generate gigabytes of spectra every hour, only a fraction of which were utilized to form biological conclusions. Instead of being stored as flat files in public data repositories, this large amount of data can be better organized to facilitate data reuse. Clustering these spectra by similarity can be helpful in building high-quality spectral libraries, correcting identification errors, and highlighting frequently observed but unidentified spectra. However, large-scale clustering is time-consuming. Here, we present ClusterSheep, a method utilizing Graphics Processing Units (GPUs) to accelerate the process. Unlike previously proposed algorithms for this purpose, our method performs true pairwise comparison of all spectra within a precursor mass-to-charge ratio tolerance, thereby preserving the full cluster structures. ClusterSheep was benchmarked against previously reported clustering tools, MS-Cluster, MaRaCluster, and msCRUSH. The software tool also functions as an interactive visualization tool with a persistent state, enabling the user to explore the resulting clusters visually and retrieve the clustering results as desired.
现代 shotgun 蛋白质组学实验每小时生成数 Gb 的谱图,其中只有一小部分被用于形成生物学结论。与其作为平面文件存储在公共数据存储库中,不如更好地组织这些大量数据,以方便数据重用。通过相似性对这些谱图进行聚类有助于构建高质量的光谱库、纠正鉴定错误,并突出经常观察到但未识别的谱图。然而,大规模聚类是耗时的。在这里,我们提出了 ClusterSheep,一种利用图形处理单元 (GPU) 来加速该过程的方法。与为此目的提出的先前算法不同,我们的方法在母离子质量电荷比容限内对所有谱图执行真正的两两比较,从而保留完整的聚类结构。我们对 ClusterSheep 进行了基准测试,与之前报道的聚类工具 MS-Cluster、MaRaCluster 和 msCRUSH 进行了比较。该软件工具还具有交互可视化工具的功能,具有持久状态,使用户能够直观地探索生成的聚类,并根据需要检索聚类结果。