在FPGA硬件中加速用于生物信息学研究的字符串集匹配

Accelerating string set matching in FPGA hardware for bioinformatics research.

作者信息

Dandass Yoginder S, Burgess Shane C, Lawrence Mark, Bridges Susan M

机构信息

Institute of Digital Biology, Mississippi State University, Mississippi 39762, USA.

出版信息

BMC Bioinformatics. 2008 Apr 15;9:197. doi: 10.1186/1471-2105-9-197.

DOI:10.1186/1471-2105-9-197

PMID:18412963

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC2374783/

Abstract

BACKGROUND

This paper describes techniques for accelerating the performance of the string set matching problem with particular emphasis on applications in computational proteomics. The process of matching peptide sequences against a genome translated in six reading frames is part of a proteogenomic mapping pipeline that is used as a case-study. The Aho-Corasick algorithm is adapted for execution in field programmable gate array (FPGA) devices in a manner that optimizes space and performance. In this approach, the traditional Aho-Corasick finite state machine (FSM) is split into smaller FSMs, operating in parallel, each of which matches up to 20 peptides in the input translated genome. Each of the smaller FSMs is further divided into five simpler FSMs such that each simple FSM operates on a single bit position in the input (five bits are sufficient for representing all amino acids and special symbols in protein sequences).

RESULTS

This bit-split organization of the Aho-Corasick implementation enables efficient utilization of the limited random access memory (RAM) resources available in typical FPGAs. The use of on-chip RAM as opposed to FPGA logic resources for FSM implementation also enables rapid reconfiguration of the FPGA without the place and routing delays associated with complex digital designs.

CONCLUSION

Experimental results show storage efficiencies of over 80% for several data sets. Furthermore, the FPGA implementation executing at 100 MHz is nearly 20 times faster than an implementation of the traditional Aho-Corasick algorithm executing on a 2.67 GHz workstation.

摘要

背景

本文描述了加速字符串集匹配问题性能的技术，特别强调了在计算蛋白质组学中的应用。将肽序列与六个阅读框翻译的基因组进行匹配的过程是蛋白质基因组图谱绘制流程的一部分，该流程用作案例研究。Aho-Corasick算法经过调整，以便在现场可编程门阵列（FPGA）设备中执行，从而优化空间和性能。在这种方法中，传统的Aho-Corasick有限状态机（FSM）被拆分为更小的并行运行的FSM，每个FSM在输入的翻译基因组中最多匹配20个肽。每个较小的FSM进一步分为五个更简单的FSM，以便每个简单FSM在输入的单个位位置上运行（五位足以表示蛋白质序列中的所有氨基酸和特殊符号）。

结果

Aho-Corasick实现的这种位拆分组织能够有效利用典型FPGA中有限的随机存取存储器（RAM）资源。与使用FPGA逻辑资源实现FSM相比，使用片上RAM还能使FPGA快速重新配置，而不会出现与复杂数字设计相关的布局和布线延迟。

结论

实验结果表明，几个数据集的存储效率超过80%。此外，在100 MHz运行的FPGA实现比在2.67 GHz工作站上运行的传统Aho-Corasick算法实现快近20倍。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/a653/2374783/510dfb1f1eed/1471-2105-9-197-1.jpg

相似文献

Accelerating string set matching in FPGA hardware for bioinformatics research.

BMC Bioinformatics. 2008 Apr 15;9:197. doi: 10.1186/1471-2105-9-197.

Hardware acceleration of processing of mass spectrometric data for proteomics.

Bioinformatics. 2007 Mar 15;23(6):724-31. doi: 10.1093/bioinformatics/btl656. Epub 2007 Feb 3.

Hardware-accelerated protein identification for mass spectrometry.

Rapid Commun Mass Spectrom. 2005;19(6):833-7. doi: 10.1002/rcm.1853.

Design of FPGA-Based SHE and SPWM Digital Switching Controllers for 21-Level Cascaded H-Bridge Multilevel Inverter Model.

Micromachines (Basel). 2022 Jan 25;13(2):179. doi: 10.3390/mi13020179.

Designing hardware for protein sequence analysis.

Bioinformatics. 2003 Sep 22;19(14):1739-40. doi: 10.1093/bioinformatics/btg228.

A Pipelined Non-Deterministic Finite Automaton-Based String Matching Scheme Using Merged State Transitions in an FPGA.

PLoS One. 2016 Oct 3;11(10):e0163535. doi: 10.1371/journal.pone.0163535. eCollection 2016.

Hardware accelerator for genomic sequence alignment.

Conf Proc IEEE Eng Med Biol Soc. 2006;2006:5787-9. doi: 10.1109/IEMBS.2006.260286.

160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA).

BMC Bioinformatics. 2007 Jun 7;8:185. doi: 10.1186/1471-2105-8-185.

Using reconfigurable hardware to accelerate multiple sequence alignment with ClustalW.

Bioinformatics. 2005 Aug 15;21(16):3431-2. doi: 10.1093/bioinformatics/bti508. Epub 2005 May 26.

The proteogenomic mapping tool.

BMC Bioinformatics. 2011 Apr 22;12:115. doi: 10.1186/1471-2105-12-115.

引用本文的文献

B-vac a robust software package for bacterial vaccine design.

Sci Rep. 2025 Aug 28;15(1):31745. doi: 10.1038/s41598-025-01201-0.

Robust control of a wind energy conversion system: FPGA real-time implementation.

Heliyon. 2024 Aug 3;10(15):e35712. doi: 10.1016/j.heliyon.2024.e35712. eCollection 2024 Aug 15.

The proteogenomic mapping tool.

BMC Bioinformatics. 2011 Apr 22;12:115. doi: 10.1186/1471-2105-12-115.

A quick guide for developing effective bioinformatics programming skills.

PLoS Comput Biol. 2009 Dec;5(12):e1000589. doi: 10.1371/journal.pcbi.1000589. Epub 2009 Dec 24.

本文引用的文献

Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules.

Algorithms Mol Biol. 2007 Oct 10;2:13. doi: 10.1186/1748-7188-2-13.

160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA).

BMC Bioinformatics. 2007 Jun 7;8:185. doi: 10.1186/1471-2105-8-185.

Hardware acceleration of processing of mass spectrometric data for proteomics.

Bioinformatics. 2007 Mar 15;23(6):724-31. doi: 10.1093/bioinformatics/btl656. Epub 2007 Feb 3.

Modeling a whole organ using proteomics: the avian bursa of Fabricius.

Proteomics. 2006 May;6(9):2759-71. doi: 10.1002/pmic.200500648.

Genome annotation of Anopheles gambiae using mass spectrometry-derived data.

BMC Genomics. 2005 Sep 19;6:128. doi: 10.1186/1471-2164-6-128.

Using reconfigurable hardware to accelerate multiple sequence alignment with ClustalW.

Bioinformatics. 2005 Aug 15;21(16):3431-2. doi: 10.1093/bioinformatics/bti508. Epub 2005 May 26.

Fast and sensitive alignment of large genomic sequences.

Proc IEEE Comput Soc Bioinform Conf. 2002;1:138-47.

Hardware-accelerated protein identification for mass spectrometry.

Rapid Commun Mass Spectrom. 2005;19(6):833-7. doi: 10.1002/rcm.1853.

SITEBLAST--rapid and sensitive local alignment of genomic sequences employing motif anchors.

Bioinformatics. 2005 May 1;21(9):2093-4. doi: 10.1093/bioinformatics/bti224. Epub 2004 Dec 14.

On exact string matching of unique oligonucleotides.

Comput Biol Med. 2005 Feb;35(2):173-81. doi: 10.1016/j.compbiomed.2003.11.003.

文献AI研究员

20分钟写一篇综述，助力文献阅读效率提升50倍。

立即体验

用中文搜PubMed

大模型驱动的PubMed中文搜索引擎

马上搜索

文档翻译

学术文献翻译模型，支持多种主流文档格式。

立即体验

在FPGA硬件中加速用于生物信息学研究的字符串集匹配

Accelerating string set matching in FPGA hardware for bioinformatics research.

作者信息

Dandass Yoginder S, Burgess Shane C, Lawrence Mark, Bridges Susan M

机构信息

Institute of Digital Biology, Mississippi State University, Mississippi 39762, USA.