College of Computer Science and Electronic Engineering, Hunan University, Lushannan Road, Changsha, 410082, China.
School of Computer Science and Engineering, Nanyang Technological University, Nangyang Road, Singapore, 639798, Singapore.
BMC Bioinformatics. 2019 Jul 17;20(1):397. doi: 10.1186/s12859-019-2980-5.
Tandem mass spectrometry (MS/MS)-based database searching is a widely acknowledged and widely used method for peptide identification in shotgun proteomics. However, due to the rapid growth of spectra data produced by advanced mass spectrometry and the greatly increased number of modified and digested peptides identified in recent years, the current methods for peptide database searching cannot rapidly and thoroughly process large MS/MS spectra datasets. A breakthrough in efficient database search algorithms is crucial for peptide identification in computational proteomics.
This paper presents MCtandem, an efficient tool for large-scale peptide identification on Intel Many Integrated Core (MIC) architecture. To support big data processing capability, a novel parallel match scoring algorithm, named MIC-SDP (spectrum dot product), and its two-level parallelization are presented in MCtandem's design. In addition, a series of optimization strategies on both the host CPU side and the MIC side, which includes pre-fetching, optimized communication overlapping scheme, multithreading and hyper-threading, are exploited to improve the execution performance.
For fair comparisons, we first set up experiments and verified the 28 fold times speedup on a single MIC against the original CPU-based implementation. We then execute the MCtandem for a very large dataset on an MIC cluster (a component of the Tianhe-2 supercomputer) and achieved much higher scalability than in a benchmark MapReduce-based programs, MR-Tandem. MCtandem is an open-source software tool implemented in C++. The source code and the parameter settings are available at https://github.com/LogicZY/MCtandem .
基于串联质谱(MS/MS)的数据库搜索是一种广泛认可和广泛使用的方法,用于在鸟枪法蛋白质组学中鉴定肽。然而,由于先进质谱产生的光谱数据的快速增长以及近年来鉴定的修饰和消化肽的数量大大增加,目前的肽数据库搜索方法无法快速而彻底地处理大型 MS/MS 光谱数据集。高效数据库搜索算法的突破对于计算蛋白质组学中的肽鉴定至关重要。
本文提出了 MCtandem,这是一种在英特尔多核(MIC)架构上进行大规模肽鉴定的有效工具。为了支持大数据处理能力,在 MCtandem 的设计中提出了一种新的并行匹配评分算法,称为 MIC-SDP(光谱点积)及其两级并行化。此外,还在主机 CPU 端和 MIC 端上利用了一系列优化策略,包括预取、优化的通信重叠方案、多线程和超线程,以提高执行性能。
为了进行公平比较,我们首先在单个 MIC 上针对原始基于 CPU 的实现进行了实验并验证了 28 倍的速度提升。然后,我们在 MIC 集群(天河-2 超级计算机的一个组件)上对 MCtandem 进行了非常大的数据集的执行,并实现了比基准 MapReduce 程序 MR-Tandem 更高的可扩展性。MCtandem 是一个用 C++实现的开源软件工具。源代码和参数设置可在 https://github.com/LogicZY/MCtandem 上获得。