muBLASTP：基于多核CPU的数据库索引蛋白质序列搜索。

muBLASTP: database-indexed protein sequence search on multicore CPUs.

作者信息

Zhang Jing, Misra Sanchit, Wang Hao, Feng Wu-Chun

机构信息

Department of Computer Science, Virginia Tech, 225 Stanger Street, Blacksburg, 24060, VA, USA.

Parallel Computing Lab, Intel Corporation, Bengaluru, Karnataka, 560102, India.

出版信息

BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4.

DOI:10.1186/s12859-016-1302-4

PMID:27809763

原文链接:https://pmc.ncbi.nlm.nih.gov/articles/PMC5096327/

Abstract

BACKGROUND

The Basic Local Alignment Search Tool (BLAST) is a fundamental program in the life sciences that searches databases for sequences that are most similar to a query sequence. Currently, the BLAST algorithm utilizes a query-indexed approach. Although many approaches suggest that sequence search with a database index can achieve much higher throughput (e.g., BLAT, SSAHA, and CAFE), they cannot deliver the same level of sensitivity as the query-indexed BLAST, i.e., NCBI BLAST, or they can only support nucleotide sequence search, e.g., MegaBLAST. Due to different challenges and characteristics between query indexing and database indexing, the existing techniques for query-indexed search cannot be used into database indexed search.

RESULTS

muBLASTP, a novel database-indexed BLAST for protein sequence search, delivers identical hits returned to NCBI BLAST. On Intel Haswell multicore CPUs, for a single query, the single-threaded muBLASTP achieves up to a 4.41-fold speedup for alignment stages, and up to a 1.75-fold end-to-end speedup over single-threaded NCBI BLAST. For a batch of queries, the multithreaded muBLASTP achieves up to a 5.7-fold speedups for alignment stages, and up to a 4.56-fold end-to-end speedup over multithreaded NCBI BLAST.

CONCLUSIONS

With a newly designed index structure for protein database and associated optimizations in BLASTP algorithm, we re-factored BLASTP algorithm for modern multicore processors that achieves much higher throughput with acceptable memory footprint for the database index.

摘要

背景

基本局部比对搜索工具（BLAST）是生命科学中的一个基础程序，用于在数据库中搜索与查询序列最相似的序列。目前，BLAST算法采用查询索引方法。尽管许多方法表明使用数据库索引进行序列搜索可以实现更高的吞吐量（例如BLAT、SSAHA和CAFE），但它们无法提供与查询索引的BLAST（即NCBI BLAST）相同水平的灵敏度，或者它们仅支持核苷酸序列搜索，例如MegaBLAST。由于查询索引和数据库索引之间存在不同的挑战和特性，现有的查询索引搜索技术无法用于数据库索引搜索。

结果

muBLASTP是一种用于蛋白质序列搜索的新型数据库索引BLAST，它返回的命中结果与NCBI BLAST相同。在英特尔至强多核CPU上，对于单个查询，单线程的muBLASTP在比对阶段的加速比高达4.41倍，端到端加速比相对于单线程的NCBI BLAST高达1.75倍。对于一批查询，多线程的muBLASTP在比对阶段的加速比高达5.7倍，端到端加速比相对于多线程的NCBI BLAST高达4.56倍。

结论

通过为蛋白质数据库新设计的索引结构以及在BLASTP算法中的相关优化，我们为现代多核处理器重新设计了BLASTP算法，该算法在数据库索引占用可接受内存的情况下实现了更高的吞吐量。

https://cdn.ncbi.nlm.nih.gov/pmc/blobs/ec67/5096327/e2467c105bab/12859_2016_1302_Fig1_HTML.jpg

相似文献

muBLASTP: database-indexed protein sequence search on multicore CPUs.muBLASTP：基于多核CPU的数据库索引蛋白质序列搜索。

BMC Bioinformatics. 2016 Nov 4;17(1):443. doi: 10.1186/s12859-016-1302-4.

Database indexing for production MegaBLAST searches.用于生产性MegaBLAST搜索的数据库索引编制。

Bioinformatics. 2008 Aug 15;24(16):1757-64. doi: 10.1093/bioinformatics/btn322. Epub 2008 Jun 21.

G-BLASTN: accelerating nucleotide alignment by graphics processors.G-BLASTN：通过图形处理器加速核苷酸比对。

Bioinformatics. 2014 May 15;30(10):1384-91. doi: 10.1093/bioinformatics/btu047. Epub 2014 Jan 24.

SS-Wrapper: a package of wrapper applications for similarity searches on Linux clusters.SS-Wrapper：用于在Linux集群上进行相似性搜索的一组包装应用程序。

BMC Bioinformatics. 2004 Oct 28;5:171. doi: 10.1186/1471-2105-5-171.

Improved BLAST searches using longer words for protein seeding.使用更长的单词进行蛋白质种子比对的改进型BLAST搜索。

Bioinformatics. 2007 Nov 1;23(21):2949-51. doi: 10.1093/bioinformatics/btm479. Epub 2007 Oct 6.

H-BLAST: a fast protein sequence alignment toolkit on heterogeneous computers with GPUs.H-BLAST：一种用于具有图形处理器的异构计算机的快速蛋白质序列比对工具包。

Bioinformatics. 2017 Apr 15;33(8):1130-1138. doi: 10.1093/bioinformatics/btw769.

PSI: indexing protein structures for fast similarity search.PSI：为快速相似性搜索对蛋白质结构进行索引。

Bioinformatics. 2003;19 Suppl 1:i81-3. doi: 10.1093/bioinformatics/btg1009.

High speed BLASTN: an accelerated MegaBLAST search tool.高速BLASTN：一种加速的MegaBLAST搜索工具。

Nucleic Acids Res. 2015 Sep 18;43(16):7762-8. doi: 10.1093/nar/gkv784. Epub 2015 Aug 6.

Finding homologs in amino acid sequences using network BLAST searches.使用网络BLAST搜索在氨基酸序列中寻找同源物。

Curr Protoc Bioinformatics. 2003 Feb;Chapter 3:Unit 3.4. doi: 10.1002/0471250953.bi0304s00.

Accelerating approximate subsequence search on large protein sequence databases.加速大型蛋白质序列数据库上的近似子序列搜索

Proc IEEE Comput Soc Bioinform Conf. 2002;1:207-16.

引用本文的文献

iBLAST: Incremental BLAST of new sequences via automated e-value correction.iBLAST：通过自动 e 值校正对新序列进行增量 BLAST。

PLoS One. 2021 Apr 22;16(4):e0249410. doi: 10.1371/journal.pone.0249410. eCollection 2021.

本文引用的文献

cuBLASTP: Fine-Grained Parallelization of Protein Sequence Search on CPU+GPU.cuBLASTP：蛋白质序列搜索在CPU+GPU上的细粒度并行化

IEEE/ACM Trans Comput Biol Bioinform. 2017 Jul-Aug;14(4):830-843. doi: 10.1109/TCBB.2015.2489662. Epub 2015 Oct 12.

G-BLASTN: accelerating nucleotide alignment by graphics processors.G-BLASTN：通过图形处理器加速核苷酸比对。

Bioinformatics. 2014 May 15;30(10):1384-91. doi: 10.1093/bioinformatics/btu047. Epub 2014 Jan 24.

ScalaBLAST 2.0: rapid and robust BLAST calculations on multiprocessor systems.ScalaBLAST 2.0：在多处理器系统上快速而强大的 BLAST 计算。

Bioinformatics. 2013 Mar 15;29(6):797-8. doi: 10.1093/bioinformatics/btt013. Epub 2013 Jan 29.

GenBank.GenBank。

Nucleic Acids Res. 2013 Jan;41(Database issue):D36-42. doi: 10.1093/nar/gks1195. Epub 2012 Nov 27.

Mathematical modeling and comparison of protein size distribution in different plant, animal, fungal and microbial species reveals a negative correlation between protein size and protein number, thus providing insight into the evolution of proteomes.对不同植物、动物、真菌和微生物物种中蛋白质大小分布进行数学建模和比较，结果显示蛋白质大小与蛋白质数量之间呈负相关，从而为蛋白质组的进化提供了见解。

BMC Res Notes. 2012 Feb 1;5:85. doi: 10.1186/1756-0500-5-85.

CUDA-BLASTP: accelerating BLASTP on CUDA-enabled graphics hardware.CUDA-BLASTP：在支持 CUDA 的图形硬件上加速 BLASTP。

IEEE/ACM Trans Comput Biol Bioinform. 2011 Nov-Dec;8(6):1678-84. doi: 10.1109/TCBB.2011.33.

BLAST+: architecture and applications.BLAST+：体系结构与应用。

BMC Bioinformatics. 2009 Dec 15;10:421. doi: 10.1186/1471-2105-10-421.

Mercury BLASTP: Accelerating Protein Sequence Alignment.水星BLASTP：加速蛋白质序列比对

ACM Trans Reconfigurable Technol Syst. 2008 Jun;1(2):9. doi: 10.1145/1371579.1371581.

Database indexing for production MegaBLAST searches.用于生产性MegaBLAST搜索的数据库索引编制。

Bioinformatics. 2008 Aug 15;24(16):1757-64. doi: 10.1093/bioinformatics/btn322. Epub 2008 Jun 21.

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.随机序列之间精确和近似单词匹配的渐近行为及最优单词大小

BMC Bioinformatics. 2006 Dec 18;7 Suppl 5(Suppl 5):S21. doi: 10.1186/1471-2105-7-S5-S21.

文献检索

告别复杂PubMed语法，用中文像聊天一样搜索，搜遍4000万医学文献。AI智能推荐，让科研检索更轻松。

立即免费搜索

文件翻译

保留排版，准确专业，支持PDF/Word/PPT等文件格式，支持 12+语言互译。

免费翻译文档

深度研究

AI帮你快速写综述，25分钟生成高质量综述，智能提取关键信息，辅助科研写作。

立即免费体验

muBLASTP：基于多核CPU的数据库索引蛋白质序列搜索。

muBLASTP: database-indexed protein sequence search on multicore CPUs.

作者信息

机构信息

出版信息

BACKGROUND

RESULTS

CONCLUSIONS

背景

结果

结论

相似文献

引用本文的文献

本文引用的文献

文献检索

文件翻译

深度研究

Suppr 超能文献

相似文献

引用本文的文献

本文引用的文献