Saeed Fahad, Hoffert Jason D, Knepper Mark A
IEEE/ACM Trans Comput Biol Bioinform. 2014 Jan-Feb;11(1):128-41. doi: 10.1109/TCBB.2013.152.
High-throughput mass spectrometers can produce massive amounts of redundant data at an astonishing rate with many of them having poor signal-to-noise (S/N) ratio. These low S/N ratio spectra may not get interpreted using conventional spectra-to-database matching techniques. In this paper, we present an efficient algorithm, CAMS-RS (Clustering Algorithm for Mass Spectra using Restricted Space and Sampling) for clustering of raw mass spectrometry data. CAMS-RS utilizes a novel metric (called F-set) that exploits the temporal and spatial patterns to accurately assess similarity between two given spectra. The F-set similarity metric is independent of the retention time and allows clustering of mass spectrometry data from independent LC-MS/MS runs. A novel restricted search space strategy is devised to limit the comparisons of the number of spectra. An intelligent sampling method is executed on individual bins that allow merging of the results to make the final clusters. Our experiments, using experimentally generated data sets, show that the proposed algorithm is able to cluster spectra with high accuracy and is helpful in interpreting low S/N ratio spectra. The CAMS-RS algorithm is highly scalable with increasing number of spectra and our implementation allows clustering of up to a million spectra within minutes.
高通量质谱仪能够以惊人的速度产生大量冗余数据,其中许多数据的信噪比(S/N)很低。这些低信噪比的光谱可能无法使用传统的光谱与数据库匹配技术进行解读。在本文中,我们提出了一种高效算法,即用于原始质谱数据聚类的CAMS-RS(使用受限空间和采样的质谱聚类算法)。CAMS-RS利用一种新颖的度量(称为F集),该度量利用时间和空间模式来准确评估两个给定光谱之间的相似性。F集相似性度量与保留时间无关,并允许对来自独立液相色谱-串联质谱(LC-MS/MS)运行的质谱数据进行聚类。设计了一种新颖的受限搜索空间策略,以限制光谱数量的比较。在各个数据仓上执行智能采样方法,允许合并结果以形成最终聚类。我们使用实验生成的数据集进行的实验表明,所提出的算法能够高精度地聚类光谱,并且有助于解读低信噪比光谱。CAMS-RS算法随着光谱数量的增加具有高度可扩展性,我们的实现允许在几分钟内对多达一百万个光谱进行聚类。