Dandass Yoginder S, Burgess Shane C, Lawrence Mark, Bridges Susan M
Institute of Digital Biology, Mississippi State University, Mississippi 39762, USA.
BMC Bioinformatics. 2008 Apr 15;9:197. doi: 10.1186/1471-2105-9-197.
This paper describes techniques for accelerating the performance of the string set matching problem with particular emphasis on applications in computational proteomics. The process of matching peptide sequences against a genome translated in six reading frames is part of a proteogenomic mapping pipeline that is used as a case-study. The Aho-Corasick algorithm is adapted for execution in field programmable gate array (FPGA) devices in a manner that optimizes space and performance. In this approach, the traditional Aho-Corasick finite state machine (FSM) is split into smaller FSMs, operating in parallel, each of which matches up to 20 peptides in the input translated genome. Each of the smaller FSMs is further divided into five simpler FSMs such that each simple FSM operates on a single bit position in the input (five bits are sufficient for representing all amino acids and special symbols in protein sequences).
This bit-split organization of the Aho-Corasick implementation enables efficient utilization of the limited random access memory (RAM) resources available in typical FPGAs. The use of on-chip RAM as opposed to FPGA logic resources for FSM implementation also enables rapid reconfiguration of the FPGA without the place and routing delays associated with complex digital designs.
Experimental results show storage efficiencies of over 80% for several data sets. Furthermore, the FPGA implementation executing at 100 MHz is nearly 20 times faster than an implementation of the traditional Aho-Corasick algorithm executing on a 2.67 GHz workstation.
本文描述了加速字符串集匹配问题性能的技术,特别强调了在计算蛋白质组学中的应用。将肽序列与六个阅读框翻译的基因组进行匹配的过程是蛋白质基因组图谱绘制流程的一部分,该流程用作案例研究。Aho-Corasick算法经过调整,以便在现场可编程门阵列(FPGA)设备中执行,从而优化空间和性能。在这种方法中,传统的Aho-Corasick有限状态机(FSM)被拆分为更小的并行运行的FSM,每个FSM在输入的翻译基因组中最多匹配20个肽。每个较小的FSM进一步分为五个更简单的FSM,以便每个简单FSM在输入的单个位位置上运行(五位足以表示蛋白质序列中的所有氨基酸和特殊符号)。
Aho-Corasick实现的这种位拆分组织能够有效利用典型FPGA中有限的随机存取存储器(RAM)资源。与使用FPGA逻辑资源实现FSM相比,使用片上RAM还能使FPGA快速重新配置,而不会出现与复杂数字设计相关的布局和布线延迟。
实验结果表明,几个数据集的存储效率超过80%。此外,在100 MHz运行的FPGA实现比在2.67 GHz工作站上运行的传统Aho-Corasick算法实现快近20倍。