Nowicki Marek, Bzhalava Davit, BaŁa Piotr
1 Faculty of Mathematics and Computer Science, Nicolaus Copernicus University in Toruń , Poland .
2 Department of Laboratory Medicine, Karolinska Institutet , Stockholm, Sweden .
J Comput Biol. 2018 Aug;25(8):871-881. doi: 10.1089/cmb.2018.0079. Epub 2018 Jul 13.
Basic Local Alignment Search Tool (BLAST) is an essential algorithm that researchers use for sequence alignment analysis. The National Center for Biotechnology Information (NCBI)-BLAST application is the most popular implementation of the BLAST algorithm. It can run on a single multithreading node. However, the volume of nucleotide and protein data is fast growing, making single node insufficient. It is more and more important to develop high-performance computing solutions, which could help researchers to analyze genetic data in a fast and scalable way. This article presents execution of the BLAST algorithm on high performance computing (HPC) clusters and supercomputers in a massively parallel manner using thousands of processors. The Parallel Computing in Java (PCJ) library has been used to implement the optimal splitting up of the input queries, the work distribution, and search management. It is used with the nonmodified NCBI-BLAST package, which is an additional advantage for the users. The result application-PCJ-BLAST-is responsible for reading sequence for comparison, splitting it up and starting multiple NCBI-BLAST executables. Since I/O performance could limit sequence analysis performance, the article contains an investigation of this problem. The obtained results show that using Java and PCJ library it is possible to perform sequence analysis using hundreds of nodes in parallel. We have achieved excellent performance and efficiency and we have significantly reduced the time required for sequence analysis. Our work also proved that PCJ library could be used as an effective tool for fast development of the scalable applications.
基本局部比对搜索工具(BLAST)是研究人员用于序列比对分析的一种重要算法。美国国立生物技术信息中心(NCBI)的BLAST应用程序是BLAST算法最流行的实现方式。它可以在单个多线程节点上运行。然而,核苷酸和蛋白质数据量正在快速增长,使得单节点已无法满足需求。开发高性能计算解决方案变得越来越重要,这有助于研究人员以快速且可扩展的方式分析基因数据。本文介绍了如何在高性能计算(HPC)集群和超级计算机上使用数千个处理器以大规模并行方式执行BLAST算法。Java并行计算(PCJ)库已被用于实现输入查询的最优拆分、工作分配和搜索管理。它与未修改的NCBI - BLAST包一起使用,这对用户来说是一个额外的优势。最终应用程序PCJ - BLAST负责读取待比较的序列、将其拆分并启动多个NCBI - BLAST可执行文件。由于I/O性能可能会限制序列分析性能,本文对该问题进行了研究。所得结果表明,使用Java和PCJ库可以并行使用数百个节点进行序列分析。我们实现了出色的性能和效率,并显著减少了序列分析所需的时间。我们的工作还证明了PCJ库可作为快速开发可扩展应用程序的有效工具。