Kumar Sumesh, Saeed Fahad
Knight Foundation School of Computing and Information Sciences, Florida International University (FIU), Miami, FL USA 33199.
Int Conf Field Program Log Appl. 2021 Aug-Sep;2021:99-103. doi: 10.1109/fpl53798.2021.00024. Epub 2021 Oct 12.
Database algorithms play a crucial part in systems biology studies by identifying proteins from mass spectrometry data. Many of these database search algorithms incur huge computational costs by computing similarity scores for each pair of sparse experimental spectrum and candidate theoretical spectrum vectors. Modern MS instrumentation techniques which are capable of generating high-resolution spectrometry data require comparison against an enormous search space, further emphasizing the need of efficient accelerators. Recent research has shown that the overall cost of scoring, and deducing peptides is dominated by the communication costs between different hierarchies of memory and processing units. However, these communication costs are seldom considered in accelerator-based architectures leading to inefficient DRAM accesses, and poor data-utilization due to irregular memory access patterns. In this paper, we propose a novel communication-avoiding micro-architecture to compute cross-correlation based similarity score by utilizing efficient local cache, and peptide pre-fetching to minimize DRAM accesses, and a custom-designed peptide broadcast bus to allow input reuse. An efficient bus arbitration scheme was designed, and implemented to minimize synchronization cost and exploit parallelism of processing elements. Our simulation results show that the proposed micro-architecture performs on average 24x better than a CPU implementation running on a 3.6 GHz Intel i7-4970 processor with 16GB memory.
数据库算法在系统生物学研究中发挥着关键作用,通过从质谱数据中识别蛋白质。许多此类数据库搜索算法通过计算每对稀疏实验光谱和候选理论光谱向量的相似性得分,产生了巨大的计算成本。能够生成高分辨率光谱数据的现代质谱仪器技术需要与巨大的搜索空间进行比较,这进一步凸显了高效加速器的必要性。最近的研究表明,评分和推导肽段的总体成本主要由不同层次的内存和处理单元之间的通信成本决定。然而,基于加速器的架构很少考虑这些通信成本,导致DRAM访问效率低下,以及由于不规则内存访问模式而导致的数据利用率低下。在本文中,我们提出了一种新颖的避免通信的微架构,通过利用高效的本地缓存来计算基于互相关的相似性得分,并进行肽段预取以最小化DRAM访问,以及定制设计的肽段广播总线以允许输入重用。设计并实现了一种高效的总线仲裁方案,以最小化同步成本并利用处理元件的并行性。我们的模拟结果表明,所提出的微架构平均性能比在具有16GB内存的3.6GHz英特尔i7-4970处理器上运行的CPU实现高出24倍。